Sort algorithm for cdx files?

95 views
Skip to first unread message

Henry S. Thompson

unread,
Aug 11, 2023, 12:50:29 PM8/11/23
to common-crawl
I've tried to reproduce the sorting algorithm used to order the index
files, without success. For authority URL segments with all-ascii
components, it's reasonably obvious, but add in IPV4 and IDNA
addresses and even the occasional oddities such as:

x50b85467:2021)
0x6b,0x21,0x61,0xb4)

and I can't find any unix command-line sort parameters such that e.g.

gunzip -c ..../cdx-00000.gz | cut -f 1 -d \| uniq | sort ....

doesn't change anything, i.e. is the same as

gunzip -c ..../cdx-00000.gz | cut -f 1 -d \| uniq

I guess there are two parts to this question:
1) What parameters to what algorithm are used to produce the SURT
(sort key version of the authority segment) from the WARC-Target-URI?
2) How are the resulting values ordered?

Thanks,

ht
--
Henry S. Thompson, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
Fax: (44) 131 650-4587, e-mail: h...@inf.ed.ac.uk
URL: http://www.ltg.ed.ac.uk/~ht/
[mail from me _always_ has a .sig like this -- mail without it is forged spam]

Sebastian Nagel

unread,
Aug 12, 2023, 12:51:23 PM8/12/23
to common...@googlegroups.com
Hi Henry,

what does the following command return?

zcat ..../cdx-00000.gz | cut -d' ' -f1 | LC_ALL=C sort -c

Explanations:
- decompress the cdx-00000.gz to stdout
- extract the first field (separator is U+0020
- let sort verify the sort order using the C locale, i.e.
sorting the raw bytes (and not by Unicode characters or
any special character semantics). If properly sorted
no error should be reported.


> 1) What parameters to what algorithm are used to produce the SURT
> (sort key version of the authority segment) from the WARC-Target-
> URI?

See the "surt" Python package [1] used from [2] via PyWb [3].
Resp. the WaybackURLKeyMaker [4] for Java programs.


> 2) How are the resulting values ordered?

A MapReduce job sorts the output by key (the SURT URL)
in ascending ASCII order. That's the default.

The only magic is the TotalOrderPartitioner, see the
zipnumclusterjob.py in [2]. This guarantees that the
concatenation of cdx-00000.gz ... cdx-00299.gz is
"globally" sorted.


Best,
Sebastian


[1] https://pypi.org/project/surt/
[2] https://github.com/commoncrawl/webarchive-indexing
[3] https://pywb.readthedocs.io/en/latest/
[4]
https://github.com/iipc/webarchive-commons/blob/master/src/main/java/org/archive/url/WaybackURLKeyMaker.java

Ed Summers

unread,
Aug 12, 2023, 1:25:42 PM8/12/23
to common...@googlegroups.com
On Aug 12, 2023 11:51 AM, Sebastian Nagel <seba...@commoncrawl.org> wrote:

 > 1) What parameters to what algorithm are used to produce the SURT
>   (sort key version of the authority segment) from the WARC-Target-
>    URI?

See the "surt" Python package [1] used from [2] via PyWb [3].
Resp. the WaybackURLKeyMaker [4] for Java programs.


Does section 4 in the CDXJ Specification help at all? I believe the definition applies for CDX too.


//Ed

Henry S. Thompson

unread,
Aug 13, 2023, 5:13:38 PM8/13/23
to common...@googlegroups.com
Sebastian Nagel writes:

> Hi Henry,
>
> what does the following command return?
>
> zcat ..../cdx-00000.gz | cut -d' ' -f1 | LC_ALL=C sort -c

That works, i.e. no output, returns 0.

Thanks, I think that will enable me to do what I need to.

>> 1) What parameters to what algorithm are used to produce the SURT
>> (sort key version of the authority segment) from the WARC-Target-
>> URI?
>
> See the "surt" Python package [1] used from [2] via PyWb [3].
> Resp. the WaybackURLKeyMaker [4] for Java programs.

Will look in to that tomorrow, thanks!

>> 2) How are the resulting values ordered?
>
> A MapReduce job sorts the output by key (the SURT URL)
> in ascending ASCII order. That's the default.
>
> The only magic is the TotalOrderPartitioner, see the
> zipnumclusterjob.py in [2]. This guarantees that the
> concatenation of cdx-00000.gz ... cdx-00299.gz is
> "globally" sorted.

Likewise.

Thanks!

Henry S. Thompson

unread,
Aug 22, 2023, 12:04:47 PM8/22/23
to common...@googlegroups.com
Sebastian Nagel writes:
> HST wrote:
>> 1) What parameters to what algorithm are used to produce the SURT
>> (sort key version of the authority segment) from the WARC-Target-
>> URI?
>
> See the "surt" Python package [1] used from [2] via PyWb [3].
> Resp. the WaybackURLKeyMaker [4] for Java programs.

OK, lots of progress as a result of that, _but_, after some success in
my task which involves regenerating the sort key from the
WARC-Target-URI, I hit the following problem:

WARC file 2019-35...17...00251.warc has
WARC-Target-URI: https://www.insbase.ac/xoops2/modules/xpwiki/?%A4%D5%A4%AF%A4%AA%A4%AB%B8%A9%A4%AA%A4%AA%A4%CE%A4%B8%A4%E7%A4%A6%BB%D4

cdx-00000 has same value for "url", but key is
ac,insbase)/xoops2/modules/xpwiki?%25a4%25a2%25a4%25a4%25a4%25c1%25b8%25a9%25a4%25a2%25a4%f3%a4%b8%a4%e7%a4%a6%25bb%25d4

That is, for every % in the URI's path, the key contains a
percent-encoded %, i.e. %25.

Whereas my code using surt.surt, and both surt.surt and pywb.utils.canonicalize.canonicalize by themselves, produce
ac,insbase)/xoops2/modules/xpwiki?%a4%d5%a4%af%a4%aa%a4%ab%b8%a9%a4%aa%a4%aa%a4%ce%a4%b8%a4%e7%a4%a6%bb%d4

Working through "used from [2] via PyWb [3]", I conclude this amounts
to using
pywb.indexer.archiveindexer.DefaultRecordParser to build the index
entries, which in turn uses

entry['urlkey'] = canonicalize(entry['url'], surt_ordered)

to produce the key. I've tried and failed to find anywhere in
pywb.indexer.cdxindexer where the urlkey is modified as part of the
index _writing_ process. Any clues?

Sebastian Nagel

unread,
Aug 22, 2023, 12:41:22 PM8/22/23
to common...@googlegroups.com
Hi Henry,

there is an unresolved issue [1] in the Java SURT implementation
which is likely the cause (I haven't verified it yet).

The URL uses percent-encoded characters not in UTF-8, could be
big5 instead.

There was a similar issue in the Python SURT package if used with
Python 3 [2]. However, it looks like it's fixed (maybe by [3]).

I need to have a close look into the current state.

> to produce the key. I've tried and failed to find anywhere in
> pywb.indexer.cdxindexer where the urlkey is modified as part of the
> index _writing_ process. Any clues?

The CDX files are written from Java, see [4].

Best,
Sebastian


[1] https://github.com/commoncrawl/ia-web-commons/issues/6
[2] https://github.com/internetarchive/surt/issues/19
[3]
https://github.com/internetarchive/surt/commit/6b8e656d7f916155dab79186beae4d31821a6c50
[4]
https://github.com/commoncrawl/nutch/blob/cc/src/java/org/commoncrawl/util/WarcCdxWriter.java

Tom Morris

unread,
Aug 23, 2023, 7:17:32 PM8/23/23
to common...@googlegroups.com
On Tue, Aug 22, 2023 at 12:41 PM Sebastian Nagel <seba...@commoncrawl.org> wrote:

The URL uses percent-encoded characters not in UTF-8, could be
big5 instead.

It looks like the URL encoding matches that declared in the HTML head - EUC-JP.
The UTF-8 equivalent is ふくおか県おおのじょう市 (ie the name of the wiki page).

Tom
 

Henry S. Thompson

unread,
Aug 24, 2023, 5:01:19 AM8/24/23
to common...@googlegroups.com
Tom Morris writes:

> It looks like the URL encoding matches that declared in the HTML
> head - EUC-JP. The UTF-8 equivalent is ふくおか県おおのじょう市 (ie
> the name of the wiki page).
> https://www.insbase.ac/xoops2/modules/xpwiki/?%A4%D5%A4%AF%A4%AA%A4%AB%B8%A9%A4%AA%A4%AA%A4%CE%A4%B8%A4%E7%A4%A6%BB%D4

Interesting, and relatively common for query strings. However it's
worth noting that it isn't conformant with RFC 3987 [1], which covers
the mapping from non-ASCII URI-like strings to URIs. This requires
conversion to Unicode and then UTF-8 before %-encoding.

Further to Sebastian's earlier message, it's worth noting that the
Java code which produces the SURT in CC's index files is failing to
follow advice in RFC 3986 [2] (but is not strictly speaking
non-conformant):

"For consistency, URI producers and normalizers should use
uppercase hexadecimal digits for all percent- encodings."

Python's SURT implementation has the same property.

ht

[1] https://datatracker.ietf.org/doc/html/rfc3987#section-2
[2] https://datatracker.ietf.org/doc/html/rfc3986#section-2.1

Henry S. Thompson

unread,
Aug 24, 2023, 8:53:50 AM8/24/23
to common...@googlegroups.com
I believe the following python3 definition of cdx_key works to
correctly* convert a WARC-Target-URI to a cdx index key:

from surt import surt
from urllib.parse import quote, unquote

import codecs

def percent_encode(ude):
return (''.join('%%%X'%c for c in ude.object[ude.start:ude.end]),
ude.end)

codecs.register_error('percent',percent_encode)

def cdx_key(uristring):
return quote(unquote(surt(uristring),
errors='percent'),
safe='/,:)?=').lower()

Further testing may reveal that more characters need to be added to
the safe argument to quote.

ht

*correct, as in produces the same key created by the Java code
used to produce the 2019-35 index files. See my previous message for
why the ".lower()" is IMO not 'correct'.

Tom Morris

unread,
Aug 27, 2023, 1:29:04 PM8/27/23
to common...@googlegroups.com
On Thu, Aug 24, 2023 at 8:53 AM 'Henry S. Thompson' via Common Crawl
<common...@googlegroups.com> wrote:
>
> I believe the following python3 definition of cdx_key works to
> correctly* convert a WARC-Target-URI to a cdx index key:
...
> *correct, as in produces the same key created by the Java code
> used to produce the 2019-35 index files.

I think Sebastian was proposing fixing the other end of things i.e.
the Java code that generates the CDX files.

I've put up a PR which addresses that bug and matches the output of
the Python SURT module:
https://github.com/commoncrawl/ia-web-commons/pull/28

Of course, this doesn't fix any of the (many) CDX files which have
already been written.

Tom

Henry S. Thompson

unread,
Aug 28, 2023, 4:12:06 AM8/28/23
to common...@googlegroups.com
Tom Morris writes:

> On Thu, Aug 24, 2023 at 8:53?AM 'Henry S. Thompson' via Common Crawl
> <common...@googlegroups.com> wrote:
>>
>> I believe the following python3 definition of cdx_key works to
>> correctly* convert a WARC-Target-URI to a cdx index key:
> ...
>> *correct, as in produces the same key created by the Java code
>> used to produce the 2019-35 index files.
>
> I think Sebastian was proposing fixing the other end of things i.e.
> the Java code that generates the CDX files.

Indeed, I understand and agree.

> I've put up a PR which addresses that bug and matches the output of
> the Python SURT module:
> https://github.com/commoncrawl/ia-web-commons/pull/28

Thanks!

> Of course, this doesn't fix any of the (many) CDX files which have
> already been written.

It was for use with those that I posted (and am about to update) my
code snippet.

ht

Tom Morris

unread,
Aug 28, 2023, 1:55:16 PM8/28/23
to common...@googlegroups.com
On Tue, Aug 22, 2023 at 12:41 PM Sebastian Nagel
<seba...@commoncrawl.org> wrote:

> There was a similar issue in the Python SURT package if used with
> Python 3 [2]. However, it looks like it's fixed (maybe by [3]).
...
I confirmed that the issue in the Python SURT package is fixed and put
Sebastien's test from the issue into a PR along with a couple of the
tests from the Java package.
https://github.com/internetarchive/surt/pull/30

Tom
Reply all
Reply to author
Forward
0 new messages