Resource Resolver ready for testing

86 views
Skip to first unread message

John Erik Halse

unread,
Sep 14, 2016, 10:20:08 AM9/14/16
to openwayback-dev
Hi all,

A very early version of the Resource Resolver (aka CDX server) is ready for testing and feedback.

Since the Resource Resolver also supports the current CDX file format, you can test it right away, but if you want to use the new format, a tool is available here:

Best,

John Erik Halse

Lauren Ko

unread,
Sep 16, 2016, 3:18:58 PM9/16/16
to openway...@googlegroups.com
Hi John Erik,
Thanks for all your work on the Resource Resolver and the cdx-cli. I tried them both successfully. I noticed a few things, but nothing major.

For the Resource Resolver I basically just did what was documented in the README: queried both /resource and /resourcelist, used old-style CDX and CDXJ, tried the various parameters listed, sent request headers for the different Accept values. Here are the issues I encountered (all were easily overcome).

When first trying to start up with openwayback-resource-resolver-3.0.0-SNAPSHOT/bin/warr I got:
 Exception in thread "main" java.lang.UnsupportedClassVersionError: org/netpreserve/resource/resolver/Main : Unsupported major.minor version 52.0
 
- I set $JAVA_HOME to Java 8 instead of the Java 7 that I had set by default on my machine.


Then trying to start again I got:
 java.lang.IllegalArgumentException: /tmp/warr/openwayback-resource-resolver-3.0.0-SNAPSHOT/cdx/index.cdx is not a recognized CDX format

 - I remembered OpenWayback 3 requires SURT-formatted CDX files, so I grabbed a SURT-formatted file.


Tried to start again:
 10:51:48.707 [main] INFO  org.netpreserve.commons.cdx.CdxSourceFactory - Loaded CDX Source Factory for scheme 'cdxfile'
 10:51:48.712 [main] INFO  org.netpreserve.commons.cdx.cdxsource.CdxFileSourceFactory - Adding all files in '/tmp/warr/openwayback-resource-resolver-3.0.0-SNAPSHOT/cdx' as cdx sources
 10:51:48.713 [main] INFO  org.netpreserve.commons.cdx.cdxsource.CdxFileSourceFactory - Adding file '/tmp/warr/openwayback-resource-resolver-3.0.0-SNAPSHOT/cdx/index_IA_surt.cdx' as a cdx source
 Exception in thread "main" java.lang.IllegalArgumentException: Negative position
at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:670)
at org.netpreserve.commons.cdx.cdxsource.CdxFileDescriptor.<init>(CdxFileDescriptor.java:70)
at org.netpreserve.commons.cdx.cdxsource.CdxFileDescriptor.<init>(CdxFileDescriptor.java:55)
at org.netpreserve.commons.cdx.cdxsource.CdxFileSourceFactory.createCdxSource(CdxFileSourceFactory.java:71)
at org.netpreserve.commons.cdx.CdxSourceFactory.getCdxSource(CdxSourceFactory.java:62)
at org.netpreserve.resource.resolver.settings.SettingsUtil.lambda$createCdxSource$0(SettingsUtil.java:38)
at org.netpreserve.resource.resolver.settings.SettingsUtil$$Lambda$1/1279271200.apply(Unknown Source)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1359)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:512)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:502)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at org.netpreserve.resource.resolver.settings.SettingsUtil.createCdxSource(SettingsUtil.java:40)
at org.netpreserve.resource.resolver.ResourceResolverServer.<init>(ResourceResolverServer.java:69)
at org.netpreserve.resource.resolver.Main.main(Main.java:33)

 - Turns out the first SURT-formatted CDX file I grabbed was 30GB and seemed to be too big to handle. I fed the first 1,000,000 lines to a new CDX file (359MB) and then it worked: Resource Resolver (v. 3.0.0-SNAPSHOT) started.


I tried doing some searches, stopped the Resource Resolver, and upon trying to restart it I got:
 Exception in thread "main" java.lang.IllegalArgumentException: /tmp/warr/openwayback-resource-resolver-3.0.0-SNAPSHOT/cdx/.out.cdx.swp is not a recognized CDX format

 - At some point my system had created a .out.cdx.swp (my cdx file was called out.cdx). Not sure if Resource Resolver should ignore dot files or if it should just be up to the user to handle this sort of issue.

 
- For a date range query, Resource Resolver did not include the exact start time match (README says start date is inclusive) when precision is down to the second. For example:
 
- Does not give me the entry in my CDX file with exact timestamp 2012-10-14T03:18:37.


- Also relating to timestamp, but maybe not a problem with the application itself, in the README, it says "The time stamp can be in either WARC-format (e.g. 2016-02-05T45:42:00Z)..." In my initial testing of things I copy and pasted that timestamp without thinking to my request URL and got a 500 error before realizing that example is not a valid time. My mistake, but perhaps the example timestamp formats should be changed in the README. Also, should the invalid time be handled so it doesn't throw a 500?


I also tried out cdx-cli to get a CDXJ formatted index. I used both the reformat and extract commands. I very much appreciate the thorough usage instructions that will print at the command line. I did have one issue in trying to convert an existing CDX file:

| (pipe character) in URLs (but not in the query string) in the CDX file I was trying to convert (status codes in CDX were 404s for these URLs) would error and the reformatting process would stop.

 $ cdxcli-1.0.0-SNAPSHOT/bin/cdxcli reformat -o ../openwayback-resource-resolver-3.0.0-SNAPSHOT/cdx/ -f cdxj -s -i out.cdx
Reformatting: out.cdx into: ../openwayback-resource-resolver-3.0.0-SNAPSHOT/cdx/out.cdxj

That is what I found in initial testing. Overall it worked well. Thanks again!

Lauren Ko
UNT Libraries

--
You received this message because you are subscribed to the Google Groups "openwayback-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openwayback-dev+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

John Erik Halse

unread,
Sep 22, 2016, 8:58:39 AM9/22/16
to openwayback-dev
Hi Lauren,

Thanks for testing and finding bugs. I'm sure there are a lot more :-)

For the cdx-cli tool: I found the bug with the pipe symbol, it should work now.

I think I also have fixed the bugs in the Resource resolver. I also updated the readme to reflect that Java 8 is needed.

I totally agree that error messages from the Resource Resolver with incorrect input, needs better handling. It's on my todo list.

For the problem with invalid cdx-files, it will now log a warning and skip the invalid file instead of aborting.

Thanks,

John Erik
To unsubscribe from this group and stop receiving emails from it, send an email to openwayback-d...@googlegroups.com.

Lauren Ko

unread,
Sep 26, 2016, 3:47:54 PM9/26/16
to openway...@googlegroups.com
Hi John Erik,
I tried things out with the changes you made, and it looks like the issues I mentioned are all addressed. I did also notice a small logging issue when starting up WARR.


Thanks again!
Lauren

To unsubscribe from this group and stop receiving emails from it, send an email to openwayback-dev+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages