In the world of digital curation, websites are a tricky business. How does one curate an object that is constantly changing, or can be gone entirely in the blink of an eye? There is already many discussions about the “Digital Dark Age,” where the majority of the early internet has been lost, so now focus is being placed on how we can preserve the internet as it is today. But how are those sites captured? Who does the work, and what kinds of tools are available to do it?

Archiveit

Archive-It home page https://www.archive-it.org/

One such tool is called Archive-It. Based in San Francisco and built by the Internet Archive, Archive-It is a subscription-based web archiving service that provides the tools and guidance to help develop parameters for the collection and archiving of websites and topics of interest. The collections are then stored and, after “publishing,” available to access through the Wayback Machine.

seed management

Topic websites are entered in as “seeds” and can then be managed and tested to ensure the correct websites are being crawled and returning all of the wanted data and nothing else. This includes entering metadata and running test crawls to determine what is being collected. The frequency of collection can also be chosen, allowing multiple updates a day down to a single, one time crawl. This flexibility means that up to twice a day the crawler could save a political campaign website that is frequently changing, while a website that rarely or never updates can be archived once or twice a year. A single crawl could be used to collect all of the information on a certain topic available at the time of collection, providing a snapshot of a single idea in time and space.

From my own experience using Archive-It for a group project last semester, I’ve learned this tool is great for text-based, limited photo collections, but quickly runs into problems when you try to document, say, a photographic exhibit presented at the Smithsonian two years ago that includes videos and voice recordings. That was the subject my group chose and we almost lost our collective mind over our experiences. The working title of our presentation was “To Archive-It and Back Again,” (we used humor to cope with our inability to understand why what we wanted and what we were getting wasn’t the same thing) and it didn’t help that the last week before our project was due, San Francisco shut down due to heavy rains. On the bright side, their help section cleared up quite a few issues we ran into.

Other groups who chose more text-oriented topics did not have the number and frequency of issues we did, so this is probably not the tool for you if you’re looking to collect every picture your favorite artist has shown in exhibition, or the talks they’ve given and then hosted on YouTube.

Because…

Archive-It doesn’t recommend using their crawler on large social media sites like YouTube or Twitter, due to the amount of unwanted extra data that gets collected! And trust me when I say you’ll get A LOT of unwanted data if you crawl YouTube. We learned this the hard way.

Another consideration that needs to be taken into account is the copyright status and robot text blockers on the websites you’re selecting for curation. The text blockers in particular will prevent the Archive-It crawler from retrieving website data. It is possible, however, to get in contact with the organization responsible for the website you’re trying to collect and request an exception for your crawl. If the websites you are curating do not belong to your institution, gaining permission ahead of time (for both copyright and the removal of robot text blockers) will save a lot of heartache and frustration in the long run.

Ultimately, Archive-It is a useful subscription tool to curate websites. With some pre-planning and a solid idea of what you want to curate, Archive-It can help you preserve a piece of the ever-changing website landscape.