OWB call 15/06 - Summary

15 views
Skip to first unread message

Kristinn Sigurðsson

unread,
Jun 15, 2016, 12:04:25 PM6/15/16
to openway...@googlegroups.com
Hi all,

Following is a brief summary of the OWB call today, 15/06 2016.

1. Review ACTION items from last time.

Still need to contact submitter of PR 294. Other action items were completed. PR 277 will likely be abandoned. Other actions were covered under the next two items on the agenda.

2. CDXJ specification for OpenWayback 3.0.0.

General discussion. No suggestions for changes. JEH is working on implementing this. Additional feedback is welcome here or via issues on the GitHub warc-specification project. https://iipc.github.io/warc-specifications/specifications/cdx-format/openwayback-cdxj/

OWB will provide support for indexing WARC and ARC files. May also provide support for converting existing CDX files.

Need to evaluate sorting order when using UTF-8. ACTION: KS to investigate.

3. Possible merging of IA Wayback changes into OpenWayback

Item was prompted by PR 316. Requires substantial effort. Unless we have a volunteer for this, it will not proceed. The PR has been closed.

4. webarchive-commons 1.1.7

Will be released this week. Includes a number of bugfixes.

AOB: PR 317 makes significant UI changes to the bubble calendar to accommodate additional years. Feedback is welcome!

https://github.com/iipc/openwayback/pull/317


Next meeting: July 27th at the usual time (15:00 UTC).

Note that we will be testing a new online meeting tool (VSee). Details will be sent out with meeting invites. As usual if you do not get an invite at least 1 week prior, contact be for one.

Best,
Kris



-------------------------------------------------------------------------
Landsbókasafn Íslands - Háskólabókasafn | Arngrímsgötu 3 - 107 Reykjavík
Sími/Tel: +354 5255600 | www.landsbokasafn.is
-------------------------------------------------------------------------
fyrirvari/disclaimer - http://fyrirvari.landsbokasafn.is

Sawood Alam

unread,
Jun 17, 2016, 9:15:24 AM6/17/16
to openway...@googlegroups.com
I have some concerns about the CDXJ specification that I wanted to discuss on the phone call, but I could not join the call because I was travelling. I will perhaps open a ticket to discuss my concerns, but I will be travelling tomorrow and will be out for a week, so I am not sure when can I spare some time.

Best,

--
Sawood Alam
Department of Computer Science
Old Dominion University
Norfolk VA 23529



--
You received this message because you are subscribed to the Google Groups "openwayback-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openwayback-d...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ilya Kreymer

unread,
Jun 17, 2016, 9:49:14 AM6/17/16
to openway...@googlegroups.com
Hi,

I have some comments about the CDXJ spec as well.

1) The digest should not be in the url key prefix, but in the JSON dictionary, for the following reasons:

 - Not all records have a digest and the spec calls for it to be a '-'. This is precisely why it should not be in the key but in the JSON dictionary, anything that is optional belongs in the JSON.

 - It is not needed for binary lookup, only the url + timestamp is used for binary search. If filtering by additional properties is needed, that can be done by filtering the json dictionary.

 - It will (unnecessarily) break compatibility with existing uses of cdxj, including in pywb.

2) The three letter field names.

For example, the CommonCrawl index cdxj looks like this: 
org,commoncrawl)/ 20160524172816 {"url": "http://commoncrawl.org/", "mime": "text/html", "status": "200", "digest": "TBVWS22XFGRRUVAZUEUMBY6IQD4QFGLU", "length": "4989", "offset": "60991167", "filename": "crawl-data/CC-MAIN-2016-22/segments/1464049272823.52/warc/CC-MAIN-20160524002112-00154-ip-10-185-217-139.ec2.internal.warc.gz"}
When someone sees this line, most of the fields are self-explanatory. Since the cdx is gzip compressed (ZipNum), the effect of the longer field names is pretty negligible on the overall file size.

Some of these could be made shorter, and I would be interested in collaborating to standardize on useful, short names that are also user friendly.

3) The first line of  '!OpenWayback-CDXJ 1.0' should support additional metadata in the JSON dictionary. Also, the '1.0' is not exactly compatible with the timestamp parser and so may require special handling, so maybe dropping the '.' may be a safer approach.
I think Sawood's idea of using keys prefixed with @ had this intention in his examples: https://github.com/oduwsdl/ORS/wiki/CDXJ

4) As shown above, the CDXJ format is already in use, such as by CommonCrawl, and for research at ODU. 

Please consider if there is a need to create a specifically incompatible version for OpenWayback, as outlined in the spec, or if it would be better to work together to have a more common CDXJ format.




Sawood Alam

unread,
Jun 17, 2016, 10:16:36 AM6/17/16
to openway...@googlegroups.com, Michael L. Nelson
Adding on Ilya's third point, a JSON block (dict or array) must be present in every line, including the proposed doctype line. This will enable parsing each line with a single parser without an exception. Also, the initial idea was to have just one character "@" which when prefixed, makes the corresponding entry a special entry (part of the metadata). Multiple occurrences of a particular key should be merged according to their semantics (overwrite/discard/append as suited). Keys of the data section (those appear before the JSON block) will be advertised in the meta section under "@keys" as an array. Then there is the "@context" key that points to a resource where all the keywords can be defined and all the restrictions are applied, this is a good place to associate a version, without introducing a new entry for doctype. All this is part of the underlying ORS syntax to allow adding semantics and restrictions that suits certain applications. Please refer to the initial blog post for a refresher http://ws-dl.blogspot.com/2015/09/2015-09-10-cdxj-object-resource-stream.html

Additionally, on the version numbering in the docktype, I was not sure about the minor version number. When compatible files are merged, there will be no clue to identify which data record belongs to which version then what is the need of a minor version? Perhaps an example would help me understand the purpose of compatible sub-versions.

Best,

--
Sawood Alam
Department of Computer Science
Old Dominion University
Norfolk VA 23529


John Erik Halse

unread,
Jun 17, 2016, 11:13:04 AM6/17/16
to openwayback-dev, m...@cs.odu.edu
I just want to throw in a few points, though I'm about to go on vacation and will not be able to participate in the discussion for a while.

The reason to not use '@' as the prefix for metadata is that it will not sort to the beginning of the file if using ordinary sort tools. The '!' is the first visible character in ascii and will be sorted right. Space would be a candidate (as used in legacy CDX), but it is easily overseen when showing parts of cdx-files in e-mails, code etc.

To Ilya's second point: Size matters. It is wrong to assume that all implementations would be gzipped. For example is a DB as storage backend already in use by the tinyCdxServer.

Best,

John Erik

Ilya Kreymer

unread,
Jun 17, 2016, 11:36:01 AM6/17/16
to openway...@googlegroups.com, Michael Nelson
On Fri, Jun 17, 2016 at 11:13 AM, John Erik Halse <johner...@gmail.com> wrote:
I just want to throw in a few points, though I'm about to go on vacation and will not be able to participate in the discussion for a while.

The reason to not use '@' as the prefix for metadata is that it will not sort to the beginning of the file if using ordinary sort tools. The '!' is the first visible character in ascii and will be sorted right. Space would be a candidate (as used in legacy CDX), but it is easily overseen when showing parts of cdx-files in e-mails, code etc.

No strong preference here, though I think '!' is fine and is also consistent with shell file heading.
 

To Ilya's second point: Size matters. It is wrong to assume that all implementations would be gzipped. For example is a DB as storage backend already in use by the tinyCdxServer.


Of course, and it would also be wrong to assume that DB storage does not support any compression either. I would assume that any production quality DB provides some sort of compression, including RocksDB (https://github.com/facebook/rocksdb/wiki/RocksDB-Basics)

Ilya
Reply all
Reply to author
Forward
0 new messages