Getting links to mementos of 404 pages etc

146 views
Skip to first unread message

Barry Hunter

unread,
Feb 17, 2017, 3:40:39 AM2/17/17
to Memento Development
Not where this limiation exists, in the Time Travel Service, or the Archive.org API, or I am just missing something, but I notice for some URLs I'm getting a 'closest' link to a archive of a 404 page. 

For example:
gives
But if visit that http://web.archive.org/web/20160303230846/http://www.devizesheritage.org.uk/railway_devizes.html its actully a 404 page. 

When query the archive.org available API directly
get
{"archived_snapshots":{"closest":{"available":true,"url":"http://web.archive.org/web/20150322113432/http://www.devizesheritage.org.uk:80/railway_devizes.html","timestamp":"20150322113432","status":"200"}}}

Ie its a link to a archive of a 200 OK (and appears to be a good snapshot).

This memento is listed in the original request under 'prev', but with no way to 'know' should use it. (other than I guess then requesting each memento URL to check its status) 

Did try constructing a archive.org timemap link (which I think the Time Travel Service uses internally, based on http://labs.mementoweb.org/aggregator_config/archivelist.xml
which shows ... 
<http://web.archive.org/web/20160303230846/http://www.devizesheritage.org.uk/railway_devizes.html>; rel="last memento"; datetime="Thu, 03 Mar 2016 23:08:46 GMT"

... with no real indication that the 'last memento' is not actually a snapshot of a 200 OK page. So I guess its a limitation of this protocol, that can't transport this info 

Did happen across
which does note the http status, but not sure on the specifications for that api. 


I'm trying to implement something not unlike http://robustlinks.mementoweb.org/ but making sure that linking to a 'valid' snapshot (ie dont want to link the archive of a 404 page :)




Thanks for any pointers. 




Barry Hunter

unread,
Feb 17, 2017, 11:37:24 AM2/17/17
to Memento Development
Digging some more into this, it seems that there are indeed different timemap formats
All work, but give the results in different formats. The 'link' format, does not include the http-status of the archive. 


Back to 
seems to suggest that archive.org is exlicitly configured to use the "link" format
<link id="ia" longname="Internet Archive">
   <timemap uri="http://web.archive.org/web/timemap/link/" paging-status="2" redirect="no"/>

But others dont explicitly list a format.
<timemap uri="http://webarchive.proni.gov.uk/timemap/" paging-status="2" redirect="no"/>


So presumably the aggregator can choose a format? A hence maybe could use the fuller JSON format? (so can skip non 200 OK mementos if possible)


Doesnt look like the Time Travel aggregator (Find) itself is open source (to try my own version, modifying archivelist.xml) - but could perhaps use the Prediction API (or just 
archivelist.xml itself), and then know should use http://web.archive.org/web/timemap/json/ rather than http://web.archive.org/web/timemap/link/ (would have to check other timemaps to see if this conversion worth doing too) 

Herbert Van de Sompel

unread,
Feb 19, 2017, 4:13:12 AM2/19/17
to memen...@googlegroups.com, Herbert Van de Sompel
hi Barry,

Thanks for your mails. Here's some feedback:

(*) We weren't even aware that the Internet Archive had started providing TimeMaps in JSON and we regret that they chose to use a format different than the one we introduced already in January 2015 (see http://mementoarchive.lanl.gov/twa/memento/20150126232327/http://mementoweb.org/guide/timemap-json/). We will need to determine how to proceed on this. Note that the JSON format is not specified in the Memento protocol (RFC7089); that only specifies the application/link+format serialization.

(*) You are correct that the Memento protocol provides no functionality regarding status code of archived pages. When compiling the protocol, this need never came up but I agree that it is rather relevant for use of the protocol with web archives; probably less so for resource versioning systems. In order to support status code, an extension would need to be specified and the Memento team would definitely be interested to do so, hopefully with input from the web archive community at large:

- Status code in TimeMaps: This would be pretty straightforward. We would just need to add an attribute for each Memento entry in a TimeMap, e.g. "status". I assume that supporting this would not be too hard for most web archives. But this would need to be verified. This attribute could obviously be provided in both the application/link-format and JSON formats. Maybe this could even be taken up rather quickly by the developers of the main web archiving softwares, Open Wayback and pywb? If they are listening in on this conversation, it would be great to hear from them. 

- Status code for datetime negotiation: When negotiating with a TimeGate, a client should be able to specify the preferred status code of the Memento, in addition to the preferred archival datetime. This could probably be done using the Prefer request header specified in RFC7240. A question that comes up is whether this should only support requesting for "200" Mementos or whether more expressiveness would be required, e.g. expressing preference for any code that's not a 4xx, any 2xx code, any 2xx or 3xx code, etc. And if so, the question becomes how hard this might be to implement at the end of archives. Again, I would be very interested in feedback from developers of web archiving software.

- The LANL Memento Aggregator is indeed not in open source, let's just say for historical reasons. But Old Dominion's MemGator is, see https://github.com/oduwsdl/memgator

- The Time Travel Archive Registry only lists URIs for application/link+format TimeMaps because that's the format all Memento-compliant archives support. The JSON format was defined for the Aggregator to make consumption by certain clients easier.

Cheers

Herbert



--

---
You received this message because you are subscribed to the Google Groups "Memento Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to memento-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Herbert Van de Sompel
Digital Library Research & Prototyping
Los Alamos National Laboratory, Research Library
http://public.lanl.gov/herbertv/
http://orcid.org/0000-0002-0715-6126

==

Ainsworth, Scott G.

unread,
Feb 19, 2017, 8:42:38 AM2/19/17
to <memento-dev@googlegroups.com>
Herbert and Barry,

Internet Archive’s JSON contains some potentially useful attributes: statuscode and digest.  Do we know what these actually mean (I have not found a definition on the IA site)?  Take statusecode, is it the status from the origin server response (URI-R status); or, is it the status that will be return by a request for the memento (URI-M)?  Although these seem like they should be the same, they are not.

Assume two URI-Ms for the same URI-R captured at times T1 and T2, both status 200 and with identical response bodies.  It is possible that a requests for URI-M(T1) will be redirected (302) to URI-M(T2).  (I think this is might be for storage efficiency purposes, but don’t really know).  Will the timemap statuses be 200/200 or 302/200.  If 302/200, how does the timemap consumer know if the 302 is an original 302 or a redirect from URI-M(T1) to URI-M(T2) without dereferencing URI-M(T1)?  So although adding the status to timemaps has the potential to be useful, unless it is clearly defined and matches the response that will be returned when the URI-M is derefrenced, any memento selection heuristic using this status will need to fall back to less efficient methods in order to produce determinist results.

Digest might also be useful, but simple digests are have limited usefulness because they allow determination of equality not equivalence.  For example, if an two images look exactly the same when rendered by a browser (say a GIF and PNG), the MD5s will be different.  Casual users probably don’t care, but an intellectual property lawyer might.  Thus, there are different definitions of equivalence base on use case.  Which, if any, should be captured in a timemap?

Thanks,
Scott G. Ainsworth

To unsubscribe from this group and stop receiving emails from it, send an email to memento-dev...@googlegroups.com.

Andy J

unread,
Apr 28, 2017, 11:09:45 AM4/28/17
to Memento Development
I've opened up an issue on OpenWayback to look at this: https://github.com/iipc/openwayback/issues/345

We could easily publish the status code of the URI-R, but as you say Scott, this may not always be entirely consistent with what happens when you visit the URI-M. However, it may still be useful to publish the information under some conditions, e.g. for most records but not for revisits.

IMO publishing digests would still be potentially useful, even though it only addresses binary-identical resources (broader notions of equivalence are very context-dependent so I'm tempted to avoid that challenge for now). However, it would be a bit clumsy because the type of hash and it's encoding are not very well standardised - it's mostly Base32 encoded SHA1 (IIRC) but that's not universal.

Best wishes,
Andy
To unsubscribe from this group and stop receiving emails from it, send an email to memento-dev...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Herbert Van de Sompel
Digital Library Research & Prototyping
Los Alamos National Laboratory, Research Library
http://public.lanl.gov/herbertv/
http://orcid.org/0000-0002-0715-6126

==

Barry Hunter

unread,
May 2, 2017, 5:08:35 AM5/2/17
to memen...@googlegroups.com
Just realised never replied to this. Thanks for all the replies so far, I certainly understand the issue a lot better now. 

 
We could easily publish the status code of the URI-R, but as you say Scott, this may not always be entirely consistent with what happens when you visit the URI-M.

FWIW, dont think that matters. The status of the URI-R is what is important (to me at least) - is the Memento likely to be 'working' (mainly is a 200OK, or could be 30x which then need to dereference, and see if the final one will be ok) - the fact that when access the URI-M and end getting redirected (to a different menento, or redirected to a different page (if the URI-R was a redirect), or denied access or ever is different issue. 





Did end up using a local Memgator myself, but the underlying issue is unresolved. Have recorded some Memento URLs, that ultimately of non functional pages. (some redirects some 404s, etc) - if archives where updated to include the status code, could have the aggregator cherry pick better mementos :)



Reply all
Reply to author
Forward
0 new messages