revisiting the export API use cases

73 views
Skip to first unread message

Nicholas Taylor

unread,
Feb 25, 2016, 7:14:45 PM2/25/16
to WASAPI-Community
Hi everybody,

The Technical Working Group (https://www.imls.gov/sites/default/files/proposal_narritive_lg-71-15-0174_internet_archive.pdf?page=9) will be meeting for the first face-to-face meeting of the grant at the end of March. With a month to go (and consulting the handy project timeline: https://www.imls.gov/sites/default/files/proposal_narritive_lg-71-15-0174_internet_archive.pdf?page=13), now seemed like a good time to revisit the motivating use cases and possible features for the proposed export API(s).

We articulated two high-level use cases in the proposal:
  • Facilitating distributed local preservation: at the most basic level, this means making it easier to replicate W/ARC data from one repository to another. It may also logically include associated metadata to enable integration with the collection management layer.
  • Standardizing research data delivery: enabling researchers to request and receive W/ARC and/or derivative web archive data over the network.
Some questions I have thinking about translating these use cases to the grant work package and/or candidate features for the future roadmap:
  • What other high level use cases, other than those above, or what practical examples do folks have of what the export API(s) will enable you to do? Two examples we're interested in are streamlining replication of data from Archive-It to our local Hydra repository and packaging up the contents of the LOCKSS-USDOCS network into WARCs to upload into the Internet Archive Wayback Machine.
  • What web archive data formats could/should be served by an export API, beyond W/ARC?
  • It seems like a query and extraction API - that would allow a user to request repackaged or derivative W/ARC data such that required processing by the source repository - would be the next most logical complement to an export API. Do folks agree and, if so, do we think that is in the scope of work for the grant? On a related note, the OpenWayback team has been soliciting feedback for CDX Server APIs (https://github.com/iipc/openwayback/wiki/CDX-Server-requirements). These could be a good framework for a query and extraction API and, in fact, the ArchiveSpark (https://github.com/helgeho/ArchiveSpark) developers have been using it in just that way.
Other questions or comments? Any input you could provide would be greatly appreciated, and will help shape the first (among hopefully many) web archiving APIs we work on together.

Thanks!

~Nicholas

Jefferson Bailey

unread,
Feb 28, 2016, 11:56:58 PM2/28/16
to WASAPI-Community
Thanks for kicking this off, Nicholas. I'll take a stab at some questions and raise a few others.
  • What other high level use cases, other than those above, or what practical examples do folks have of what the export API(s) will enable you to do? Two examples we're interested in are streamlining replication of data from Archive-It to our local Hydra repository and packaging up the contents of the LOCKSS-USDOCS network into WARCs to upload into the Internet Archive Wayback Machine.

  • 1) Transfer from service to partner: similar to the "GetArcs" portal we call it (https://webarchive.jira.com/wiki/display/ARIH/Partner+Guide+to+Downloading+Archive-It+Data) -- in this case a logged-in user defines a collection and gets a list of segmented list of W/ARCs. AIT partners have done local tooling around this method for automated replication. Use case is mostly for local preservation. Methods for grabbing vary, from manual to automated. We could get more info from those doing this as needed. 
    2) Transfer from collector to service: Use case above, but also externally crawled W/ARCs to, for instance, AIT or a cloud service.
    3) Transfer from multi-collection repository to researcher or non-custodian. Basic researcher use case.
    4) Transfer of a portion (collection, timespan, derivative, etc) of an institution's WARCs from a service or 3rd party repo to an approved researcher or 3rd party user. Less basic researcher use case.

  • What web archive data formats could/should be served by an export API, beyond W/ARC?

  • WATs, WANEs, CDX, "Parsed" (i.e. extracted) text, link/embed info, screenshots, etc. I'm sure there are others we can come up with. A question potentially to emerge here will be how much is worth processing pre-export vs. how much is worth doing locally after receipt of whatever.gz. The route we have gone with the ARS Workshop (https://github.com/vinaygoel/ars-workshop), iPython notebooks, and ArchiveSpark is a handful of defined pre-derived formats with more fine-grained extraction done locally.

  • It seems like a query and extraction API - that would allow a user to request repackaged or derivative W/ARC data such that required processing by the source repository - would be the next most logical complement to an export API. Do folks agree and, if so, do we think that is in the scope of work for the grant? On a related note, the OpenWayback team has been soliciting feedback for CDX Server APIs (https://github.com/iipc/openwayback/wiki/CDX-Server-requirements). These could be a good framework for a query and extraction API and, in fact, the ArchiveSpark (https://github.com/helgeho/ArchiveSpark) developers have been using it in just that way.

  • Yes, in scope. We (IA/AIT) have been working on some of these features outside of the grant work and can report in March on work on APIs for derivatives, CDX stuff, ArchiveSpark, and other fun. More to come there.
I suppose one thing that comes to mind is our potential move to writing seed-specific WARCs in AIT and how that might (or might not) impact an API. Things like seed, crawl job, collection, and collector could all potentially be queryable via an export API. This is more applicable to AIT than others I realize. W/ARCs are such arbitrary blobs that you can both apply lots of data to them and apply very little data to them! Good times.

I'd also be interested to hear if we think things like authentication and permissioning are in scope or are more local/implementation concerns.

Cheers,
Jefferson
Reply all
Reply to author
Forward
0 new messages