Mementos, dark archives, and HTTP Status 451

52 views
Skip to first unread message

Andy J

unread,
Apr 28, 2017, 11:40:34 AM4/28/17
to Memento Development
Hello,

We recently made a fairly significant change to our web archive OpenWayback service, and I wanted to raise it here in case it causes any problems for Memento API users or aggregators.

We operate a large 'dark' archive of content collected under Legal Deposit legislation, only a small fraction of which can be made available over the open web. In an attempt to be more transparent about our holdings, our OpenWayback endpoint has been modified to emit a HTTP 451 status code when users attempt to access content that we hold but that is not available to them unless they visit us on site. For example:

https://www.webarchive.org.uk/wayback/archive/*/http://www.example.org

In our implementation, all Mementos and the TimeMaps and TimeGate for these resources return HTTP 451 only.

I'd like to know if this causes anyone any problems. I'd also like to know if there is a better way we should be doing this.

Thanks for your time.

Best wishes,
Andy Jackson

=-=-=-=-=-=-=-=
Dr Andrew N. Jackson
Web Archiving Technical Lead
01937 546602
@UKWebArchive
@anjacks0n
Blog: http://britishlibrary.typepad.co.uk/webarchive/

Martin Klein

unread,
May 1, 2017, 4:27:10 PM5/1/17
to memen...@googlegroups.com
Hi Andy,

Thanks a lot for letting us know about the changes to the BL Web Archive. We are sad to see the open nature of the archive go but understand that institutional/political decisions are being made.
You mention that only a small fraction of the archive can be made openly available - do you have a sense of the ratio of dark vs. open content/URIs? 
Since requests for a dark URI against your TimeGate currently return a 451 and no Memento headers, our Memento-based services have to treat them like a 404. This means, URI-Ms of dark resources will disappear from all of our services (aggregator, Memento for Chrome, TimeTravel, etc). 

Another option would be to have your Memento infrastructure operate as before but make the URI-Ms of dark resources return (or redirect to) the 451 and show the generic page stating "Available in Legal Deposit Library Reading Rooms only". Since in this scenario the TimeGate would do the conneg, return the headers, and theTimeMap would list all URI-Ms, the BL content would continue to be surfaced in all Memento services. 
IMHO, this sort of hybrid model (you show what Mementos you have but don't make them accessible unless on-site) could pioneer other implementations of dark archives e.g., at the BNF, KB etc.

Out of curiosity, was the decision which resources are dark made based on URI, time, CDX file granularity? In other words, are there cases where for one and the same URI-R
some URI-Ms are dark and others are not? If so, what does the TimeGate do with these currently? Does it give preference to 451 or to datetime (same with TimeMaps)? I assume the legal restrictions trump everything.


cheers
M



--

---
You received this message because you are subscribed to the Google Groups "Memento Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to memento-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Andy J

unread,
May 1, 2017, 5:39:35 PM5/1/17
to Memento Development
Hi Martin,

I see I managed to completely mess up explaining what has happened!

No content that was previously open has been hidden! All previously open access content is still there, and in the volume of open content is likely to grow in the future. The problem was that the non-open-access stuff was entirely invisible, which I didn't like for a number of reasons.

This change effectively adds billions of 451s to an existing collection.

I had assumed the Memento aggregators would NOT want us to publish the TimeMaps of these resources because Memento aggregator users would then be directed to copies they cannot access easily 99.99999% of the time. If this is not the case, and/or if we can surface the status codes (see other thread) and make the limited-access stuff avoidable, I'm all for it (assuming we have the resources to implement it).

The access is URI-prefix based, at the moment. We would like to enable access at a fine-grained (URI+DateTime) level in the future, so having a clean way to do that would be good.

Cheers,
Andy
To unsubscribe from this group and stop receiving emails from it, send an email to memento-dev...@googlegroups.com.

Sawood Alam

unread,
May 1, 2017, 6:45:10 PM5/1/17
to memen...@googlegroups.com
Hi Martin,

If this approach is taken, we should think about how would we distinguish between recorded and reported 451s? I mean we would want to know if the archive is returning 451 status code with a custom generic payload or the archive has captured a 451 status code from the origin. This situation is same as 404s where we distinguish between a recorded 404 memento from a non-archived resource by checking the presence or absence of the Memento-Datetime header. Should the same technique be used here as well?

Best,

--
Sawood Alam



Another option would be to have your Memento infrastructure operate as before but make the URI-Ms of dark resources return (or redirect to) the 451 and show the generic page stating "Available in Legal Deposit Library Reading Rooms only". Since in this scenario the TimeGate would do the conneg, return the headers, and theTimeMap would list all URI-Ms, the BL content would continue to be surfaced in all Memento services. 
--

--
Sawood Alam
Department of Computer Science
Old Dominion University
Norfolk VA 23529

Barry Hunter

unread,
May 2, 2017, 5:08:35 AM5/2/17
to memen...@googlegroups.com

In our implementation, all Mementos and the TimeMaps and TimeGate for these resources return HTTP 451 only.

What about sites/pages that have 'Dark' content, as well as publically accessible content? 

So the whole TimeMap would be 451, even if there is some accessible momentos?  Or do timemap/gate only return 451 if ALL momentos are 451? 

As an example
returns a list, but I know there are some dark Legal Deposit accessible copies (seen them in a reading room) - but dont look like they returned here. 


Similarly, will the TimeGate function to find accessible memento, or will it always just say 451 ?

seems to work. 


I'd like to know if this causes anyone any problems.

Well as a developer, would want a way to 'exclude' these dark archives (they of no use normally) - unless was specifically looking for them. 

ie using timegate to get a memento, usually want a working/accessible memento :)



FWIW, this seems ok if its only doing this for sites that are completely unavailable. The 451 could be a useful clue, but can ignore it. 

... but it somehow hinders getting to accessible archives, that's not so good. Either than TimeMaps can't tell which are unavailable (would have to can test the individual mementos) - or TimeGate is make it harder to get at accessible content. 

Andy J

unread,
May 2, 2017, 11:48:03 AM5/2/17
to Memento Development


On Tuesday, 2 May 2017 10:08:35 UTC+1, Barry Hunter wrote:

In our implementation, all Mementos and the TimeMaps and TimeGate for these resources return HTTP 451 only.

What about sites/pages that have 'Dark' content, as well as publically accessible content? 

Currently, whole sites are either dark or open access. We'd like to support a finer-grained access model but we're not sure how to do this cleanly via Memento.
 

So the whole TimeMap would be 451, even if there is some accessible momentos?  Or do timemap/gate only return 451 if ALL momentos are 451? 

As an example
returns a list, but I know there are some dark Legal Deposit accessible copies (seen them in a reading room) - but dont look like they returned here.


Oh dear, you have found a gap in our new index! Everything available in the reading rooms should be declared publicly now, so we must have missed a chunk of WARCs when we built the new index. I'll check it out.
 
Similarly, will the TimeGate function to find accessible memento, or will it always just say 451 ?

seems to work. 


I'd like to know if this causes anyone any problems.

Well as a developer, would want a way to 'exclude' these dark archives (they of no use normally) - unless was specifically looking for them. 

ie using timegate to get a memento, usually want a working/accessible memento :)



FWIW, this seems ok if its only doing this for sites that are completely unavailable. The 451 could be a useful clue, but can ignore it. 

... but it somehow hinders getting to accessible archives, that's not so good. Either than TimeMaps can't tell which are unavailable (would have to can test the individual mementos) - or TimeGate is make it harder to get at accessible content. 


My understanding is that making whole TimeMaps (per URL) go 451 instead of just 404 should not cause significant problems, as any 4xx code on a TimeMap should be counted as a 'miss'.

Best wishes,
Andy
 

Martin Klein

unread,
May 10, 2017, 12:43:31 PM5/10/17
to memen...@googlegroups.com
Hi Andy,

Thanks a lot for the clarification, I am glad to hear that I was mistaken with my interpretation of your email. 
The 451 behavior you propose is interesting and could, in our opinion, be the basis of a behavior for all web archives regarding inaccessible yet existing Mementos. We are very much interested in discussing this topic further and we are preparing a "forum" to kick-start the conversation very soon.

We did some tests and our results indicate that something is awry with our implementation. We saw a number of Memento URIs that were previously 200 but are now 451, for example:


Maybe that is another case of missed WARCs when building the new index?

cheers
M

To unsubscribe from this group and stop receiving emails from it, send an email to memento-dev+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages