Behind the Scenes: Introducing Really Old Website Resurrector (ROWR)

From time to time the University Archives finds copies of departmental websites stored on CDs or DVDs as directories of html and other associated files. These websites are usually no longer available on the web. When we receive CD/DVDs from University departments, the files are carefully copied from the optical media (a relatively unstable storage medium) and deposited with our repository which is designed specifically for digital materials preservation. However, accessing a website as a directory of individual files rather than web pages in a browser leaves something to be desired. The content might be available, but the use is very different than what was originally intended. Additionally, from an archival standpoint we would like all archived websites to be stored in the WARC file format (an international standard).  

In reviewing these items in our collections last year, I began to wonder if it would be possible to temporarily host the websites again. Once hosted online, we could crawl the website with Archive-It, which is the tool we use for website archiving. This method would allow us to provide access to the webpages as a site, connect the websites with the rest of our archived website collections, and generate a WARC file copy of the site’s contents. Luke Aeschleman, then of the library’s software development department, helped me with this project by creating a tool, ROWR, to clean up links and prepare for hosting the site.  

ROWR prompts the user for a directory of website files as well as appropriate actions for modifying or removing links. ROWR creates a copy of the site prior to making changes so it is possible to reset and start over, if needed.  ROWR also keeps track of the files and folders it has scanned, so it is possible to stop and continue review of the site later.

ROWR essentially produces a website that has a new artificial URL to facilitate temporary hosting of the website through a library server. This URL is then added to the Archive-It application and we run a standard crawl of the site. Once the crawl is tested and finalized, we take the website down from the library server.  

We tested this approach with two websites. Overall the process works fairly well, but I did come up against some unique collection management and description needs. For example:

  • Do we need to keep a copy of the files from the CD/DVD or can we discard it and just use the Archive-It version?  
  • The crawl date in Archive-It is completely different from the date the website was created and originally used. How should we represent these dates to users in metadata and other description?  
  • ROWR is changing the content and we are creating an artificial URL, so how do we communicate this to users and what would they want to know about these changes?  
  • It can be time consuming to use ROWR and clean-up all the links.

I decided that we should keep the copy from the CD/DVD available in the repository as it is representative of the original website and the verison in Archive-It contains an artificial URL. To address the other issues we added some language to finding aids:


“An alternative version made for access is available hereThis website was transferred to the University Archives on optical disc. To aid preservation and access the website files were temporarily re-hosted online and archived with Archive-It in 2017.”

In Archive-It, I also created a URL group (“website cd archives”) for the websites that were part of this test project in an attempt to set them apart from our typical web archiving work. I’ve not yet found a satisfactory way to provide context for these website in the Archive-It access portal with Dublin Core metadata, but I hope that the group tag can be a clue that more information exists if a user were to ask us.  

These two approaches to description are likely not the permanent solutions to the collection management challenges, but it is a starting point that provides an easier way to access these particular websites online. A future project for us will be to assess metadata description in Archive-It for all of our archived websites.  

ROWR is in an early iteration and is not being actively developed at this time, but you can find the code on the UNC Libraries’ GitHub. In the months since we wrapped up this project in the summer of 2017, the Webrecorder team introduced a tool called warcit. The tool can transform a directory of website files into the WARC file format. The resulting WARC file could then be accessed in the Web Recorder Player application. This new tool is something else we’ll be exploring as we continue to improve procedures for the preservation and access of website archives transferred to us as file directories.  

Carolina Tweets #archiveunc

When you think of archives you might think of dusty old books and papers tucked away to be used by historians and other academics. Here at the University Archives we preserve plenty of old University records (that are kept dust-free, by the way), but our day-to-day work is actually very focused on the current moment. Without collecting materials that document the present day researchers can’t study the University in the future.

One way we archive the current moment is through collecting student life materials and UNC related web content. With only three full time staff members it can be tough to keep up with all the conversations, events, and activism happening on campus. We can’t do this alone. This is where you come in!

You can actively contribute to the documentation of what’s happening at UNC by using the hashtag #archiveunc on your public tweets or Instagram posts. That’s all you have to do! By using the hashtag, you opt in to having the posts archived for long-term preservation and research access.

How is the content archived? We will periodically use a tool called Archive-It to “crawl” the tweets or posts tagged with the #archiveunc hashtag. Once the posts have been crawled by the Archive-it tool, the data is preserved by the Internet Archive and we provide access through our Archive-It website.

What kind of tweets are we looking for? We’re open to any tweets or Instagram posts related to UNC academics, campus life, and events. For example:

  • Promoting a student organization event? #archiveunc
  • Protesting? #archiveunc
  • Promoting a cause? #archiveunc
  • Sharing activities or chalk messages seen on campus? #archiveunc

If you don’t use #archiveunc, we may be in touch to ask permission to add your social media content or website to the Archives. Collecting social media content as it unfolds is new for us. We’re experimenting, so how we ask for permission and the technology used may evolve over time. As things change, we’ll keep you in the loop.

We hope you’ll join us in this exciting new effort!

Not interested in social media? Other ways to get involved and help document Carolina history:

  • Submit photos of UNC shirts to the UNC T-Shirt Archive.
  • Connect with us regarding donation of student organization records, digital or print photos, videos, or campus posters/flyers. If it documents something happening at UNC, we’re happy to talk about adding it to the archives. Please email (archives@unc.edu) us to get the process started.
  • Nominate a UNC website for archiving. First check to see if we’ve already archived the website: https://archive-it.org/collections/3491. If the website can’t be found in our web archives, send us an email (archives@unc.edu) to get the process started.

C-A-R-O-L-I-N-A: www.unc.edu circa 1997

The UNC Libraries started a web archiving project in January 2013 (read more about that here), but the Internet Archive has been saving websites for much, much longer. In fact, they have saved over 366 BILLION web pages since 1996, accessible through the Wayback Machine.

In the Wayback Machine you can see an archive of UNC.edu since 1997, not to mention tons of other websites. Take a moment to search for some of your favorite websites and see what they looked like 10 (or more!) years ago. Not surprisingly, the Web has changed quite a bit since then.

Here is a snapshot of UNC’s homepage from April , 27 1997 featuring a very creative and informative acrostic linking to University departments and offices.

Screen Shot 2013-11-20 at 2.31.45 PM

Does anyone else think we should bring back the acrostic? What would your acrostic be?

Saving UNC’s Slice of the Web

Wayback banner
If you have ever stumbled across a webpage with this banner across the top of it, you’ve encountered the Wayback Machine. The Wayback Machine was developed by the Internet Archive in 1996 to start archiving the web, and since then it has collected around 240 billion web pages.

In 2006 the Internet Archive launched Archive-It, which is a hosted service that allows institutions to create their own web archives.

In January of 2013, the UNC Libraries began archiving websites in five different collections. These collections support existing collecting areas in the Libraries and include

You can browse all of our collections through Archive-It, and individual websites have been cataloged for access through the UNC Libraries’ catalog.

Additionally, websites that are part of existing archival collections are described in that collection’s finding aid. For example, you can see description of and get access to an archived version of the North Carolina Literary Festival’s 2009 website from the finding aid for the records of the North Carolina Literary Festival.

Here’s a snippet from that web site, showing the banner that Archive-It uses to let the viewer know that they’re looking at an archived web page.

Screen Shot 2013-10-09 at 11.55.08 AM copy