New pubic web interface to the Memento aggregator

Andy J

unread,

Dec 18, 2012, 11:32:28 AM12/18/12

to memen...@googlegroups.com

Hi,

Just wanted to let you know that here at the UK Web Archive we've just launched a new service that leverages the Memento protocol and aggregator to make is easy for anyone to explore the archival history of any URL. For example, here is the result for the British Library website:

http://www.webarchive.org.uk/mementos/search/http://www.bl.uk

A bookmarklet is also provided, to make it easy to perform a look-up from your browser when you visit a site of interest, or a 404.

The implementation is fairly straightforward. The URL is looked up via the LANL TimeGate and the Link information is pulled from the TimeMap (this is done using a slightly cleaned up version of the code behind the Memento Android application, as hosted here https://github.com/ukwa/mementoweb-client-java). This information is then composed into graphs and tables, and combined with thumbnails of rendered versions of some of the mementos (via PhantomJS).

You can find a full description here http://www.webarchive.org.uk/ukwa/info/mementos and the source code here https://github.com/ukwa/mementoweb-webclient

We hope that this will help drive interest in web archives in general and the Memento protocol in particular. Feedback welcome!

Thanks,

Andy Jackson

Ed Summers

unread,

Dec 18, 2012, 11:44:18 AM12/18/12

to memen...@googlegroups.com

Just curious, what's the best source of information about how the
Memento TimeGate at LANL was assembled and how it is kept up to date?

//Ed

Robert Sanderson

unread,

Dec 18, 2012, 11:50:38 AM12/18/12

to memento-dev

The technical details haven't been published formally as, frankly, they're currently not very good :)

We're working on a more robust and stable implementation that can be shared.

Some of the code that we've produced is available at:

http://code.google.com/p/memento-server/

Which includes Berkeley DB and Cassandra based back ends, plus the WSGI/web side handlers.

In terms of being kept up to date, we periodically clear the local cache. In the not too distant future, we'll have a local copy of several of the archives indexes and can hence do the lookups from disk rather than in a distributed search type of way.

Hope that helps,

Rob

Ed Summers

unread,

Dec 18, 2012, 12:03:11 PM12/18/12

to memen...@googlegroups.com

Is having a local copy really a feasible way going forward? I would've
thought a very simple known URL search API that web archives could
implement, in combination with a (public) list of known endpoints
would've been easier to manage going forward. SPOFs and all.

//Ed

Robert Sanderson

unread,

Dec 18, 2012, 12:08:52 PM12/18/12

to memento-dev

The issue is one of user experience rather than best practice, unfortunately. As modern web pages are generated from hundreds of resources, if each URI has to be searched in 10+ remote systems, and the slowest system is, say, 2 seconds to respond, it could be five minutes before the page is assembled. No user is going to find that acceptable.

Rob

Ed Summers

unread,

Dec 18, 2012, 12:34:50 PM12/18/12

to memen...@googlegroups.com

Yes, but I could imagine some researchers who would be willing to wait
for a more comprehensive result. Also, there could be some heuristics
for speeding things up. For example, if a hit for an HTML page is
found in a particular archive, it's probably worthwhile to first check
that archive for a PNG that is referenced in that HTML, rather than
waiting for 10+ remote systems to respond.

I think there is a sweet spot for a documented, easy to implement way
to publish and share the inventories of web archives so that anyone
could aggregate them and then create services like the one you have
built at LANL. But perhaps this is not the venue for that discussion.

//Ed

John Erickson

unread,

Dec 18, 2012, 1:01:22 PM12/18/12

to memen...@googlegroups.com

This is a fascinating discussion!

I would argue that as we "look back" in time using infrastructure
"like" Memento, the ability to reliably recover pages accurately (or
at all) should outweigh the user experience. Yes, it would be nice for
historical pages to render up with the same level of performance as
current pages, but it (should not be) essential.

An analogy might be found in BitTorrent, which we can (usually) rely
on to retrieve esp. large files from multiple sources *eventually*,
esp. when single-source downloads using conventional protocols on
wobbly telecom infrastructure can't be trusted to complete (try living
in rural America)

John

--
John S. Erickson, Ph.D.
Director, Web Science Operations
Tetherless World Constellation (RPI)
<http://tw.rpi.edu> <olyer...@gmail.com>
Twitter & Skype: olyerickson

Robert Sanderson

unread,

Dec 18, 2012, 1:06:45 PM12/18/12

to memento-dev

Hi John, Ed,

I didn't mean to imply that the fast but potentially lossy method was the only or best method for everyone, just that our choice for implementation was focused on the general end user who wants to see something rather than the researcher who is willing to wait for a potentially higher quality representation.

We would be very keen to see other implementations that made different choices! :)

Rob

Reply all

Reply to author

Forward