Configuring a remote ResourceStore

75 views
Skip to first unread message

Ben O'Brien

unread,
Sep 5, 2016, 8:22:44 PM9/5/16
to openwayback-dev
Hello all,


I've found myself wanting to setup and test a remote resource store in openwayback recently. Initially I was excited to see a link on the Advanced-configuration wiki page 'Configuring a remote ResourceStore'....only to find it was a placeholder :(

So in the interest of generating some content for that page - does anybody have an example of configuring a remote ResourceStore?


Cheers,
Ben

Lauren Ko

unread,
Sep 8, 2016, 4:24:32 PM9/8/16
to openway...@googlegroups.com
Hi Ben,
If you are using a FlatFileResourceFileLocationDB as described here https://github.com/iipc/openwayback/wiki/How-to-configure#telling-openwayback-where-to-find-your-arc-and-warc-files , in your path-index.txt file you would put the URL to where the ARC/WARC files are being served instead of just a local path. Then you can serve the WARC files via whatever web server, such as Apache, from wherever you want. Is that what you are wanting to do?

Lauren Ko
UNT Libraries

--
You received this message because you are subscribed to the Google Groups "openwayback-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openwayback-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ben O'Brien

unread,
Sep 11, 2016, 7:15:56 PM9/11/16
to openwayback-dev
Hi Lauren,

Thanks for your relpy.

Not exactly, I want to handle that 'path-index' functionality separately from OW. 
I was hoping I could write a servlet to act as the remote resource store to OW, which will look up the warc location on the fly. I see your point about serving the warcs via a webserver and using the path-index file with URLs. But it seemed nicer (in my head) if I could just serve the warc location via an external service, removing the path-index flat file step altogether.

The context is that we are trying to use OW as a viewer from our preservation system, which has a growing web archive. For a growing collection the remote resource store seemed more of a fit than using a path-index file.


Cheers,
Ben


To unsubscribe from this group and stop receiving emails from it, send an email to openwayback-d...@googlegroups.com.

Alex Osborne

unread,
Sep 12, 2016, 8:20:36 PM9/12/16
to openwayback-dev
Hi Ben,

There's an example in RemoteCollection.xml.

https://github.com/iipc/openwayback/blob/master/wayback-webapp/src/main/webapp/WEB-INF/RemoteCollection.xml#L33

Note that you can configure the resourceStore independently of the resourceIndex. So if you want to use a local CDX resourceIndex with a remote resourceStore just put the appropriate stanzas from both example CDXCollection.xml and RemoteCollection.xml in the one WaybackCollection.

Note also that the server for the resource store should support HTTP 1.1 range requests. This is so that Wayback can retrieve just the record it's interested in and not the whole WARC file. Most regular web servers like Apache and nginx will do this out of the box but if you implement your own servlet it's something you'll need to take care of. A common scenario is a servlet proxying to multiple backend servers that have the actual files. In that case just make sure to also proxy the request and response headers and status code. If your servlet is to serve the files directly off disk or via say calls to a preservation system API you might need to take care of that range headers yourself.

Here's the relevant RFC for range requests:

https://tools.ietf.org/html/rfc7233

My implementation, which currently looks up the path in a database and serves from disk is here:

https://github.com/nla/bamboo/blob/32d7f2e/ui/src/bamboo/crawl/WarcsController.java#L132

Cheers,

Alex

Ben O'Brien

unread,
Sep 15, 2016, 4:28:09 AM9/15/16
to openwayback-dev
Hi Alex,

Thanks, its starting to make a bit more sense now.

I notice your implementation supports multiple range requests, does openwayback send multi-range requests?


Cheers,
Ben

Alex Osborne

unread,
Sep 15, 2016, 9:24:57 PM9/15/16
to openway...@googlegroups.com
Hi Ben,

Wayback only makes single open range requests:

https://github.com/iipc/openwayback/blob/829693b6d43f40b1b045f08611a4fa5e27395e29/wayback-core/src/main/java/org/archive/wayback/resourcestore/resourcefile/TimeoutArchiveReaderFactory.java#L68

So you can skip implementing multiple ranges if you like.

Cheers,

Alex

________________________________
From: openway...@googlegroups.com [openway...@googlegroups.com] on behalf of Ben O'Brien [obrien...@gmail.com]
Sent: Thursday, 15 September 2016 6:28 PM
To: openwayback-dev
Subject: Re: [openwayback-dev] Configuring a remote ResourceStore
You received this message because you are subscribed to a topic in the Google Groups "openwayback-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/openwayback-dev/XmpUvhOQn1w/unsubscribe.
To unsubscribe from this group and all its topics, send an email to openwayback-d...@googlegroups.com<mailto:openwayback-d...@googlegroups.com>.
Reply all
Reply to author
Forward
0 new messages