Hi,
I have some comments about the CDXJ spec as well.
1) The digest should not be in the url key prefix, but in the JSON dictionary, for the following reasons:
- Not all records have a digest and the spec calls for it to be a '-'. This is precisely why it should not be in the key but in the JSON dictionary, anything that is optional belongs in the JSON.
- It is not needed for binary lookup, only the url + timestamp is used for binary search. If filtering by additional properties is needed, that can be done by filtering the json dictionary.
- It will (unnecessarily) break compatibility with existing uses of cdxj, including in pywb.
2) The three letter field names.
For example, the CommonCrawl index cdxj looks like this:
org,commoncrawl)/ 20160524172816 {"url": "http://commoncrawl.org/", "mime": "text/html", "status": "200", "digest": "TBVWS22XFGRRUVAZUEUMBY6IQD4QFGLU", "length": "4989", "offset": "60991167", "filename": "crawl-data/CC-MAIN-2016-22/segments/1464049272823.52/warc/CC-MAIN-20160524002112-00154-ip-10-185-217-139.ec2.internal.warc.gz"}
When someone sees this line, most of the fields are self-explanatory. Since the cdx is gzip compressed (ZipNum), the effect of the longer field names is pretty negligible on the overall file size.
Some of these could be made shorter, and I would be interested in collaborating to standardize on useful, short names that are also user friendly.
3) The first line of '!OpenWayback-CDXJ 1.0' should support additional metadata in the JSON dictionary. Also, the '1.0' is not exactly compatible with the timestamp parser and so may require special handling, so maybe dropping the '.' may be a safer approach.
4) As shown above, the CDXJ format is already in use, such as by CommonCrawl, and for research at ODU.
Please consider if there is a need to create a specifically incompatible version for OpenWayback, as outlined in the spec, or if it would be better to work together to have a more common CDXJ format.