Stanford web archiving work cycle kickoff

18 views

Skip to first unread message

Nicholas Taylor

unread,

Apr 27, 2017, 7:34:29 PM4/27/17

to WASAPI-Community

Hi everybody,

I wanted to briefly share with you details of the Stanford web archiving development sprint recently underway, on behalf of the project team – John Martin, Naomi Dushay, Tommy Ingulfsen, and myself.

Imagine a meteor struck the Internet Archive. Aside from gravely injuring many of our friends and colleagues, it would secondarily impact (crater, even) the many web archive collections and data we’ve collected using their Archive-It subscription web archiving service for the last ten years. The Internet Archive as a single point-of-failure for our web archive collections is one of the key reasons that we endeavor to preserve all of our web archives in SDR, as well.

Internet Archive and the web archiving community more broadly share an interest in more distributed preservation of the archived web. To this end, we partnered with Internet Archive for an IMLS grant (https://www.imls.gov/grants/awarded/lg-71-15-0174-15) to create APIs and related utilities to support more seamless and standardized mechanisms for web archive data transfer. The current web archiving work cycle comes at a time when we otherwise need to develop the “related utilities,” but serendipitously aligns with our existing downloading mechanism becoming unworkable. We’re moreover hoping to use this opportunity to automate not just the downloading but also the accessioning pipeline, to ensure more reliable and timely synchronization of web archive data generated at Archive-It with what’s preserved in SDR.

The inception deck for the work cycle is available from the list of presentations in the Github repository: https://github.com/WASAPI-Community/data-transfer-apis . The specific goal is to create a download utility that can retrieve web archive crawl data from the Archive-It data transfer API implemented as part of the grant. Stretch goals are to make incremental steps towards bridging and automating the steps between download and accessioning.

We will be working in week-long sprints, with communication on Slack in the DLSS #web-archiving channel. We expect to spend the first week of May on onboarding to the web archiving codebases and architecture, task analysis and breakdown, and work planning, with implementation of the core requirement taking 1-3 weeks thereafter. If time allows, we’ll pursue the stretch goals.

Please let me know if you have any questions.

Thanks!

~Nicholas

Reply all

Reply to author

Forward

0 new messages