Planned upgrade of the URL index server

Skip to first unread message

Sebastian Nagel

Mar 23, 2021, 3:27:09 PMMar 23
to Common Crawl
Hi everybody,

we plan to upgrade the URL index server to be based on PyWB 2.5
next week. A test system is available at

If no regressions are found, we will switch
to the new version on Tuesday, March 30 12:00 UTC.

We tried to fix any incompatibilities of the new system and make it
behave almost the same as the old system. However, there are few changes:

- filters now behave as documented in the API docs
In detail, regex and "contains" filters are now applied correctly:
filter=~field:pattern - regex filter
filter=field:string - "contains" filter

- the "Content-Type" HTTP header of successful results is now

- while the parameter "fl" is still supported you may now use the
param "fields" instead

The configuration of the new server setup is found in
There are a couple changes to make PyWB more compatible to the
behavior of the older server - these are found in

Thanks again to Ilya Kreymer who wrote the initial version of PyWB and
Common Crawl's URL and WARC index (and the corresponding indexer) and
to the webrecorder project as the maintainer of PyWB.


Greg Lindahl

Mar 31, 2021, 5:39:41 PMMar 31

The updated pywb changed the json object key name when no captures are
found: it is now named "message" instead of "error". cdx_toolkit was
broken by the change. The just-released version 0.9.31 of cdx_toolkit
tolerates this change.

I did look at test-index when you announced it, however, a bug in my
content-download tests that didn't tolerate revisit records for distracted me from noticing this API change!

-- greg
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> To view this discussion on the web visit

Sebastian Nagel

Apr 1, 2021, 1:23:23 AMApr 1
to Common Crawl
Hi Greg,

thanks for the fix. And sorry for the issue. I've sampled a few thousand queries from the logs and tried to cover most user-agents while sampling. A couple of regressions where detected when comparing responses of old and new system, most of them about proper HTTP codes, esp. sending HTTP 400 "Bad Request" when the query is invalid (eg. page out of range) and not HTTP 500 "Internal Server Error". The behavior is now even more consistent in this point. See issues and PRs pushed upstream to webrecorder/pywb.

Thanks again,

Reply all
Reply to author
0 new messages