Sebastian Nagel writes:
> HST wrote:
>> 1) What parameters to what algorithm are used to produce the SURT
>> (sort key version of the authority segment) from the WARC-Target-
>> URI?
>
> See the "surt" Python package [1] used from [2] via PyWb [3].
> Resp. the WaybackURLKeyMaker [4] for Java programs.
OK, lots of progress as a result of that, _but_, after some success in
my task which involves regenerating the sort key from the
WARC-Target-URI, I hit the following problem:
WARC file 2019-35...17...00251.warc has
WARC-Target-URI:
https://www.insbase.ac/xoops2/modules/xpwiki/?%A4%D5%A4%AF%A4%AA%A4%AB%B8%A9%A4%AA%A4%AA%A4%CE%A4%B8%A4%E7%A4%A6%BB%D4
cdx-00000 has same value for "url", but key is
ac,insbase)/xoops2/modules/xpwiki?%25a4%25a2%25a4%25a4%25a4%25c1%25b8%25a9%25a4%25a2%25a4%f3%a4%b8%a4%e7%a4%a6%25bb%25d4
That is, for every % in the URI's path, the key contains a
percent-encoded %, i.e. %25.
Whereas my code using surt.surt, and both surt.surt and pywb.utils.canonicalize.canonicalize by themselves, produce
ac,insbase)/xoops2/modules/xpwiki?%a4%d5%a4%af%a4%aa%a4%ab%b8%a9%a4%aa%a4%aa%a4%ce%a4%b8%a4%e7%a4%a6%bb%d4
Working through "used from [2] via PyWb [3]", I conclude this amounts
to using
pywb.indexer.archiveindexer.DefaultRecordParser to build the index
entries, which in turn uses
entry['urlkey'] = canonicalize(entry['url'], surt_ordered)
to produce the key. I've tried and failed to find anywhere in
pywb.indexer.cdxindexer where the urlkey is modified as part of the
index _writing_ process. Any clues?