Embedding archiving metadata inside a document

54 views
Skip to first unread message

Gerben

unread,
Apr 10, 2018, 11:59:09 AM4/10/18
to memen...@googlegroups.com

Hi all, maybe you have an answer to my question: How could an HTML document declare itself to be a snapshot of some resource?

Memento is all about specifying such provenance information, but it always expresses this information in HTTP headers. If I save a web page, in my case using freeze-dry, I would like to add this information to the file itself. Do you know of standards for expressing this?

As a possible solution, would it make sense move the Memento HTTP headers into link and meta tags?:

<meta http-equiv="Memento-Datetime" content="Wed, 30 May 2007 18:47:52 GMT">
<link rel="original" href="https://example.org/">

Thinking further in this direction: would it make sense to allow more/all of the defined link relation types to be given in a <link> element, instead of an HTTP Link header? The spec seems not to mention any such practice.

Gerben

Herbert Van de Sompel

unread,
Apr 10, 2018, 5:19:26 PM4/10/18
to memen...@googlegroups.com, Herbert Van de Sompel
On Tue, Apr 10, 2018 at 8:20 AM, 'Gerben' via Memento Development <memen...@googlegroups.com> wrote:

Hi all, maybe you have an answer to my question: How could an HTML document declare itself to be a snapshot of some resource?

Memento is all about specifying such provenance information, but it always expresses this information in HTTP headers. If I save a web page, in my case using freeze-dry, I would like to add this information to the file itself. Do you know of standards for expressing this?


I don't think this question has come up before, since - as you indicate - the Memento protocol works at the level of HTTP headers, among others because that makes it applicable for resources of any MIME type.

As a possible solution, would it make sense move the Memento HTTP headers into link and meta tags?:


It needs to be noted that, in principle (see below), what you describe could be done. Obviously only for HTML documents. But:
- Current Memento clients would not be able to use the info because they rely on HTTP headers. As a matter of fact, for quite some Memento protocol interactions, HTTP HEAD (not GET) is typically used.
- The approach entails permanently modifying the archived content, which is typically considered to be "not a good thing" (tm). So, please read the below comments subject to these considerations.

<meta http-equiv="Memento-Datetime" content="Wed, 30 May 2007 18:47:52 GMT">


This would be do-able were it not that HTML5 defines http-equiv (https://www.w3.org/TR/html5/document-metadata.html#pragma-directives) as an enumerated attribute (https://www.w3.org/TR/html5/infrastructure.html#enumerated-attributes), meaning it has a finite, pre-defined list of keywords and Memento-Datetime is not one of them.
 

<link rel="original" href="https://example.org/">


One would intuitively think that this can be done. Because one would assume that HTTP Links (RFC8288) and HTML <link>s (https://www.w3.org/TR/html5/links.html#allowed-keywords-and-their-meanings) would be equivalent/interchangable. There is a problem, however, with the allowable link relation types. For some reason, the HTML5 editors did not use the long existing IANA link relation registry as the registry of allowable link types in HTML5. This leads to the regrettable (IMO) situation that:
- For HTTP Links, link relation types are registered at https://www.iana.org/assignments/link-relations/link-relations.xhtml
- For HTML5 <link>s, link relation types are registered in two places: within the HTML5 spec and in the community-managed Microformats wiki registry at http://microformats.org/wiki/existing-rel-values#HTML5_link_type_extensions .
An inspection will show that there is some overlap but also a lot of divergence. None of the Memento relation types are in the HTML5 link relation registry. I assume they could be added. Interestingly, I discovered that there is an (unspecified) "archived" link relation type in the wiki registry that maybe is intended to mean something similar to the "memento" relation type.

Thinking further in this direction: would it make sense to allow more/all of the defined link relation types to be given in a <link> element, instead of an HTTP Link header? The spec seems not to mention any such practice.


Taking into account that:
1. Current Memento clients would not be able to leverage the <link>s
2. Adding <link>s means tampering with archived content
3. The Memento link relation types are not registered for use in HTML5

One could imagine doing the following:


<link rel="original" href="https://example.org/">
<link rel="memento self" href="https://an.archive.org/2016/https://example.org/">
 
whereby https://an.archive.org/2016/example.org/ is the URI of the document itself, i.e. the archived page.

We would still be missing the "datetime" attribute that is used in HTTP Link with the "memento" relation type. And, from a quick reading, I was not able to determine whether one is allowed to put additional (self-defined) attributes on a <link>.

Cheers

Herbert

Gerben

--

---
You received this message because you are subscribed to the Google Groups "Memento Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to memento-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Herbert Van de Sompel
Digital Library Research & Prototyping
Los Alamos National Laboratory, Research Library
http://public.lanl.gov/herbertv/
http://orcid.org/0000-0002-0715-6126

==

Sawood Alam

unread,
Apr 12, 2018, 9:29:38 PM4/12/18
to memento-dev
Adding to Herbert's reply, even if we think about extending HTML somehow to be able to add link and meta elements, the fact that we often replay historical captures, would make it very difficult to backport our extension. Earlier HTML versions were of the SGML family where DOCTYPE required DTDs. However, in HTML5 they departed from XML namespacing so one cannot define another DTD and include in their HTML5 document. When replaying an HTML document, we generally leave the DOCTYPE declaration untouched, be it an older HTML version or HTML5, because changing that might break the replay and many assumptions that were made at that time when the document was originally created.

I understand that there is a desire and usefulness to have some way to expose more metadata in the markup of the root HTML document of a composite memento. For example, IA includes some such data in the form of HTML comments at the end of the document.

<!--
     FILE ARCHIVED ON 11:31:35 Mar 20, 2018 AND RETRIEVED FROM THE
     INTERNET ARCHIVE ON 23:47:54 Apr 12, 2018.
     JAVASCRIPT APPENDED BY WAYBACK MACHINE, COPYRIGHT INTERNET ARCHIVE.

     ALL OTHER CONTENT MAY ALSO BE PROTECTED BY COPYRIGHT (17 U.S.C.
     SECTION 108(a)(3)).
-->
<!--
playback timings (ms):
  LoadShardBlock: 3679.992 (36)
  esindex: 0.139 (11)
  CDXLines.iter: 239.402 (21)
  PetaboxLoader3.datanode: 2635.152 (37)
  exclusion.robots.fetch: 40.064 (22)
  exclusion.robots: 47.679 (11)
  exclusion.robots.policy: 3.681 (11)
  RedisCDXSource: 77.357 (11)
  PetaboxLoader3.resolve: 192.919 (3)
  load_resource: 84.645
-->

While this is useful for manual inspection, it does not have any well defined machine readable structure and affects the fixity of the capture.

I can think of another cleaner way to add such metadata by introducing some CustomElements (https://html.spec.whatwg.org/multipage/custom-elements.html#custom-elements) with custom properties which will be HTML5-compliant. Suppose, we introduce a custom HTML element called "<memento-meta>" that has necessary attributes.

<script src="https://components.example.net/custom-elements/memento-meta.js"></script>
<memento-meta urir="https://example.com/"
              memento-datetime="Wed, 30 May 2007 18:47:52 GMT">
</memento-meta>

As long as necessary JS is included that defines this newly introduced element and the name of the element contains at least one hyphen "-" in it (and is not among a few reserved hyphenated elements), an HTML5 parser should not complain about it. We can even have more elements defined that are used as child elements under an umbrella node to avoid too many things in a single node as attributes, because some values might better fit as the content of an element rather than a property. CustomElements give everyone the power to introduce their own elements and maintain any name conflicts locally. These are generally used for making an extensible and reusable component set, hence, carry no global semantics. However, if the archiving community agrees on something like this, we can define a few elements and host the JS file somewhere that we all can use. This information can remain in the markup without any visual representation or rendered in the form of a banner or badge. JavaScript can parse and allow interaction with these elements like any native element.

Recently, we introduced Reconstructive (http://ws-dl.blogspot.com/2018/01/2018-01-08-introducing-reconstructive.html), a ServiceWorker-based replay system module that requires very little or no rewriting on the server side. The module has a rudimentary implementation of an archival replay banner that leverages CustomElements (the code is available at https://github.com/oduwsdl/Reconstructive). A more extended form of the banner implementation will be presented in JCDL18 (and potentially in WADL18 as well) as a poster, but we can share the two-pager if anyone is interested.

There is another potential way to expose such metadata in the payload (as opposed to HTTP headers) using a separate parallel endpoint, for example:


In case of a separate description document one can serve XML, JSON or other formats while utilizing the newly introduced Memento ontology in RDF, TTL and JSON-LD formats at https://mementoweb.org/ns.

Best,

--
Sawood Alam
Department of Computer Science
Old Dominion University
Norfolk VA 23529

Gerben

unread,
Apr 17, 2018, 7:23:35 AM4/17/18
to memen...@googlegroups.com

Thanks for thinking along, Herbert & Sawood. Some follow-up thoughts:

I agree with you it is preferable to leave pages unchanged by saving metadata separately. To clarify/justify my intended approach, the situation in question is one without any archiving-specific infrastructure: it's just snapshotting and saving an html page, to save it locally or host it anywhere. So the alternative is not storing any information at all, and then I suppose it is worth making some changes to add some metadata (I wish MHTML was more widely supported, it allows headers..). In my scenario large changes are made anyhow, e.g. subresources are inlined as data URLs (I guess the best possible outcome would be if those changes are mostly reversible; though I haven't thought that through).

I think I have not fully understood the worry about using unspec'd types for link and meta tags; I thought the web standards are designed to be extensible, allowing for such unknown values by simply ignoring them. I would be very surprised if an unknown link relation type or unknown meta http-equiv header name would upset any browser or other platform. Especially as for both of them, the HTML 5.2 spec refers to a wiki as the authority for their name registration (though the wiki for http-equiv values now appears defunct). I would be glad to hear if I may be overseeing some potential problem here.

In any case, link and meta tags to stick with Memento vocabulary was just a suggestion. I hoped to discover some existing practice of adding provenance metadata to pages, but if that seems not to exist perhaps I will just experiment with this approach for now. Sawood's suggestion to use a custom element is another interesting approach, though I do not understand the pros and cons of that well enough; I fear it would lack the possibility to standardise the tag name.

Gerben

PS I like the idea of Reworker; I did not know this was even possible with ServiceWorkers. :)

To unsubscribe from this group and stop receiving emails from it, send an email to memento-dev...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages