Summary of OWB call 22/07/2016

36 views
Skip to first unread message

Kristinn Sigurðsson

unread,
Aug 2, 2016, 6:55:30 AM8/2/16
to openway...@googlegroups.com
Hi all,

The following is a brief summary of the discussion during the last OWB call.

1. Review of the proposed CDXJ spec

The concerns raised by Ilya and Sawood were discussed.

It was agreed that the content digest should only be in the JSON block as suggested by Ilya. [Already updated]

We will consider having short and long forms for the JSON keys to allow implementations that value readability over compactness.

It is important that the JSON keys be well defined. We welcome any suggestions for clarification on the existing keys as well as recommendations for additional keys that we haven't accounted for.

There was a lengthy discussion about the header lines.

The argument for amending the current spec essentially boiled down to treating these header lines in the same manner as any other lines in the CDXJ file. This also allows more advanced metadata and possibly a self-describing file in terms of the JSON block contents. [I hope I'm accurately representing Sawood's point here].

The counter argument is that these are header lines and aren't part of the 'contents' as such. It is meant to serve a narrow purpose (identifying which spec the current CDXJ file corresponds to) and should be compatible with simple sort/merge tools (such as the bash sort command), thus requiring it sort first, bitwise, and allowing it to be repeated.

While there are valid arguments on both sides. Ultimately, the current spec is simpler and is clearly sufficient for the intended purpose (of serving as an index). Barring code contributions that implement a richer variant, we will proceed with the current spec.

2. Status of CDX server rewrite

John Erik expects a minimally functional version to be available in the next month or so.

He raised one concern about using UTF-8 encoded URIs for sorting. Action: Kristinn to look into this [which I have, it is not an issue]

3. AOB

A question was raised about accessing non-Response WARC records. Notably Request records, but also Metadata.

It was explained that the new CDXJ spec accommodates a WARC Type allowing all the records in a WARC to be indexed. It is then possible to use WARC-Record-ID, WARC-Refers-To and WARC-Concurrent-To fields to properly match up Request, Response and Metadata records that correspond to the same harvesting event. These fields are already accounted for in the JSON block (and are optional). Using them properly will be up to the clients accessing the CDX server.


The next OWB call is scheduled for August 17, 15:00 UTC.


Best,
Kris
-------------------------------------------------------------------------
Landsbókasafn Íslands - Háskólabókasafn | Arngrímsgötu 3 - 107 Reykjavík
Sími/Tel: +354 5255600 | www.landsbokasafn.is
-------------------------------------------------------------------------
fyrirvari/disclaimer - http://fyrirvari.landsbokasafn.is

Ilya Kreymer

unread,
Oct 5, 2016, 8:19:04 PM10/5/16
to openway...@googlegroups.com
Hi,

It has caught my attention that the update proposed spec is still incompatible with the CDXJ formats already in use. In addition to removing the digest, the record type should also be placed in the JSON dictionary. I think there should only be 3 parts: 

<Searchable URI> <Timestamp> <JSON block>

The searchable URI + timestamp are needed for binary search, while all the other fields, including record type, digest, etc.. can easily be accessed by reading the JSON block.
This will keep this format much more compatible with existing uses of CDXJ and tools.

Ilya



--
You received this message because you are subscribed to the Google Groups "openwayback-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openwayback-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages