Summary of OWB call 14, December 2016

37 views
Skip to first unread message

Lauren Ko

unread,
Dec 15, 2016, 12:47:27 PM12/15/16
to openwayback-dev
Hi all,

The following is a summary of the discussion from the OpenWayback call December 14, 2016. Please feel free to comment here on anything discussed.

1. State of OpenWayback and webarchive-commons

John Erik has done work on URI classes for webarchive-commons (on his private GitHub account). He is comfortable in their usefulness to other web archiving Java projects. It should better mimic how browsers parse and resolve URIs. The changes are an improvement on canonicalization, which has primarily been lower casing the URI, and are configurable. It passes Heritrix unit tests but is not yet compatible. He plans to make a PR soon.

We are expecting a webarchive-commons release version 1.2.0 early in 2017. Currently there are 9 open PRs, 5 are labeled with the 1.2.0 release. Kris has looked at all of them and thinks they are okay but hasn't tested them. John Erik will further review the PRs in January. #34 requires a minor change in Heritrix if Heritrix will be updating its webarchive-commons dependency. Mohamed volunteered to review #63. Anyone else is invited to review any of the open PRs and comment.

There are also some webarchive-commons issues marked with a 2.0.0 release. At this point, the breaking change that requires a major release is a module name change, as John Erik has begun separating the project into separate poms, with what is currently in webarchive-commons 1.7 renamed to webarchive-commons-core. The API itself has not changed.

There has not been much work on OpenWayback or the Resource Resolver since John Erik messaged the OpenWayback Google Group about having an unpolished version of the Resource Resolver ready for trial (September 14). There was some question on proceeding with work on this regarding issues raised about the format of CDXJ: short vs. long names in the JSON block, and whether record type should move inside of the JSON block.

Sawood suggested the use of aliases with shorter names. It was decided this would add undesirable complexity.

John Erik and Kris support not focusing on defining a CDXJ file format. Kris sees the files on disk as particular to whoever is implementing the indexing. He supports moving forward with what we have, supporting this position with the argument that this is only being used for indexing and should be kept separate from interchange.


2. Set an official CDX Server Protocol (carried on from above)
 
We should focus on a well-defined and documented Resource Resolver response format for interoperability. The response should have specified documented field names, but the CDXJ on disk can be up to the implementer of the indexing.


3. Alternative configuration format for OpenWayback

While there is not a desire to re-architect the project without Spring configuration, and we would like to retain the great control allowed by Spring for experienced users, exposing an additional more simple configuration format is welcome. This would be an easier to use configuration mechanism for users who just want basic configuration and to easily be able to do things like set indexes and resource locations for collections.

Kris explained how National and University Library of Iceland achieves this sort of thing with an overlay and properties file, so Spring is not touched unless doing code work.

https://github.com/ato/wayback-easy is another route that could allow configuration via yaml and would allow reusing some of pywb's config format.

Please let us know if you want to work on the implementation for this.


4. AOB

Sawood asked about the possibility of OpenWayback having an embedded web server instead of having to be deployed to run in Tomcat. There was agreement that this is a trend that makes sense for OpenWayback to follow.


Thank you,
Lauren Ko
UNT Libraries

Reply all
Reply to author
Forward
0 new messages