Planned upgrade of the URL index server

59 views
Skip to first unread message

Sebastian Nagel

unread,
Mar 23, 2021, 3:27:09 PM3/23/21
to Common Crawl
Hi everybody,

we plan to upgrade the URL index server to be based on PyWB 2.5
next week. A test system is available at
https://test-index.commoncrawl.org/

If no regressions are found, we will switch index.commoncrawl.org
to the new version on Tuesday, March 30 12:00 UTC.

We tried to fix any incompatibilities of the new system and make it
behave almost the same as the old system. However, there are few changes:

- filters now behave as documented in the API docs
https://pywb.readthedocs.io/en/latest/manual/cdxserver_api.html#filter
https://github.com/webrecorder/pywb/wiki/CDX-Server-API#filter
In detail, regex and "contains" filters are now applied correctly:
filter=~field:pattern - regex filter
filter=field:string - "contains" filter

- the "Content-Type" HTTP header of successful results is now
"text/x-ndjson"

- while the parameter "fl" is still supported you may now use the
param "fields" instead


The configuration of the new server setup is found in
https://github.com/commoncrawl/cc-index-server/tree/pywb2
There are a couple changes to make PyWB more compatible to the
behavior of the older server - these are found in
https://github.com/commoncrawl/pywb/tree/common-crawl-cdx-index

Thanks again to Ilya Kreymer who wrote the initial version of PyWB and
Common Crawl's URL and WARC index (and the corresponding indexer) and
to the webrecorder project as the maintainer of PyWB.

Best,
Sebastian

Greg Lindahl

unread,
Mar 31, 2021, 5:39:41 PM3/31/21
to common...@googlegroups.com
Sebastian,

The updated pywb changed the json object key name when no captures are
found: it is now named "message" instead of "error". cdx_toolkit was
broken by the change. The just-released version 0.9.31 of cdx_toolkit
tolerates this change.

I did look at test-index when you announced it, however, a bug in my
content-download tests that didn't tolerate revisit records for
example.com distracted me from noticing this API change!

-- greg
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/78a448c1-41b3-7dcf-6bc1-f17bd440829c%40commoncrawl.org.

Sebastian Nagel

unread,
Apr 1, 2021, 1:23:23 AM4/1/21
to Common Crawl
Hi Greg,

thanks for the fix. And sorry for the issue. I've sampled a few thousand queries from the logs and tried to cover most user-agents while sampling. A couple of regressions where detected when comparing responses of old and new system, most of them about proper HTTP codes, esp. sending HTTP 400 "Bad Request" when the query is invalid (eg. page out of range) and not HTTP 500 "Internal Server Error". The behavior is now even more consistent in this point. See issues and PRs pushed upstream to webrecorder/pywb.

Thanks again,
Sebastian


Reply all
Reply to author
Forward
0 new messages