Thanks again to Ilya Kreymer who wrote the initial version of PyWB and
Common Crawl's URL and WARC index (and the corresponding indexer) and
to the webrecorder project as the maintainer of PyWB.
Best,
Sebastian
Greg Lindahl
unread,
Mar 31, 2021, 5:39:41 PM3/31/21
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Sign in to report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to common...@googlegroups.com
Sebastian,
The updated pywb changed the json object key name when no captures are
found: it is now named "message" instead of "error". cdx_toolkit was
broken by the change. The just-released version 0.9.31 of cdx_toolkit
tolerates this change.
I did look at test-index when you announced it, however, a bug in my
content-download tests that didn't tolerate revisit records for
example.com distracted me from noticing this API change!
You do not have permission to delete messages in this group
Copy link
Report message
Sign in to report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Common Crawl
Hi Greg,
thanks for the fix. And sorry for the issue. I've sampled a few thousand queries from the logs and tried to cover most user-agents while sampling. A couple of regressions where detected when comparing responses of old and new system, most of them about proper HTTP codes, esp. sending HTTP 400 "Bad Request" when the query is invalid (eg. page out of range) and not HTTP 500 "Internal Server Error". The behavior is now even more consistent in this point. See issues and PRs pushed upstream to webrecorder/pywb.