Hi all,
The following is a brief summary of the discussion during the last OWB call.
1. Review of the proposed CDXJ spec
The concerns raised by Ilya and Sawood were discussed.
It was agreed that the content digest should only be in the JSON block as suggested by Ilya. [Already updated]
We will consider having short and long forms for the JSON keys to allow implementations that value readability over compactness.
It is important that the JSON keys be well defined. We welcome any suggestions for clarification on the existing keys as well as recommendations for additional keys that we haven't accounted for.
There was a lengthy discussion about the header lines.
The argument for amending the current spec essentially boiled down to treating these header lines in the same manner as any other lines in the CDXJ file. This also allows more advanced metadata and possibly a self-describing file in terms of the JSON block contents. [I hope I'm accurately representing Sawood's point here].
The counter argument is that these are header lines and aren't part of the 'contents' as such. It is meant to serve a narrow purpose (identifying which spec the current CDXJ file corresponds to) and should be compatible with simple sort/merge tools (such as the bash sort command), thus requiring it sort first, bitwise, and allowing it to be repeated.
While there are valid arguments on both sides. Ultimately, the current spec is simpler and is clearly sufficient for the intended purpose (of serving as an index). Barring code contributions that implement a richer variant, we will proceed with the current spec.
2. Status of CDX server rewrite
John Erik expects a minimally functional version to be available in the next month or so.
He raised one concern about using UTF-8 encoded URIs for sorting. Action: Kristinn to look into this [which I have, it is not an issue]
3. AOB
A question was raised about accessing non-Response WARC records. Notably Request records, but also Metadata.
It was explained that the new CDXJ spec accommodates a WARC Type allowing all the records in a WARC to be indexed. It is then possible to use WARC-Record-ID, WARC-Refers-To and WARC-Concurrent-To fields to properly match up Request, Response and Metadata records that correspond to the same harvesting event. These fields are already accounted for in the JSON block (and are optional). Using them properly will be up to the clients accessing the CDX server.
The next OWB call is scheduled for August 17, 15:00 UTC.
Best,
Kris
-------------------------------------------------------------------------
Landsbókasafn Íslands - Háskólabókasafn | Arngrímsgötu 3 - 107 Reykjavík
Sími/Tel:
+354 5255600 |
www.landsbokasafn.is
-------------------------------------------------------------------------
fyrirvari/disclaimer -
http://fyrirvari.landsbokasafn.is