In the index for the 2023-40 crawl, there are over 100,000 entries
with URIs containing very long strings of %-escaped unicode FFFD
(Replacement Character), some as long as 50 consecutive instances of
%EF%BF%BD.
Such strings are very rare in e.g. the index for 2019-35, numbering
only around 1500.
Given the indices I have downloaded, I can narrow down the change to
somewhere between 2021-21 and 2021-31:
>: fgrep -c '%ef%bf%bd' CC-MAIN-2021-25/cdx/cluster.idx
128
>: fgrep -c '%ef%bf%bd' CC-MAIN-2021-31/cdx/cluster.idx
7074
[Stop reading now unless this is of relevance/interest to you, what
follows is just a report of my efforts to find out more about what's
happened.]
And, perhaps more interestingly:
>: fgrep -c '%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd' CC-MAIN-2021-25/cdx/cluster.idx
20
>: fgrep -c '%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd' CC-MAIN-2021-31/cdx/cluster.idx
3724
Of course only the most common domains exhibiting this make it into
the secondary index, for example on line 350018 of the index for 2023-40:
com,hket,invest)/article/3532920/%ef%bf%bd%ef%bf%bd%ef%bf%bd...[repeats for
6 more lines]?mtc=40001&srkw=%e7%be%8e%e5%9c%8b%e5%85%b1%e5%92%8c%e9%bb%a8 20230925231423
Following the relevant entry if we look at the first line in the block
at offset 737595399, length 199309 in cdx-00078.gz, it in turn
points to the response at offset 358674722, length 1468 in
segments/1695233510100.47/warc/CC-MAIN-20230925215547-20230926005547-00202.warc.gz,
where we find
WARC-Target-URI:
https://invest.hket.com/article/3532920/%EF%BF%BD%EF%BF%BD...
So the index is consistent with the WARC file entry. And indeed,
somewhat surprisingly, wget for that URI does produce the same result
we find in the WARC file. So the long string of FFFD code points is
irrelevant to the response. Indeed deleting some or all of the FFFds,
and even the query string, doesn't affect the result.
There does seem to be something systematic changing, maybe just in the
distribution of domains wrt languages using non-ascii charsets:
>: fgrep '%ef%bf%bd' CC-MAIN-2021-25/cdx/cluster.idx | cut -f 1 -d \) | uniq | wc -l
99
>: fgrep '%ef%bf%bd' CC-MAIN-2021-31/cdx/cluster.idx | cut -f 1 -d \) | uniq | wc -l
5814
cirrus-login1<6068>: fgrep '%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd' CC-MAIN-2021-25/cdx/cluster.idx | cut -f 1 -d \) | uniq | wc -l
21
>: fgrep '%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd' CC-MAIN-2021-31/cdx/cluster.idx | cut -f 1 -d \) | uniq | wc -l
3223
So the increase in the overall FFFD count is basically down to an
increase in the number of distinct domains exhibiting the phenomenon.
There is a change in the relative proportion of non-200 responses
involved:
>: uz CC-MAIN-2021-25/cdx/warc/cdx-0015[0-9].gz | fgrep '%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd' | egrep -o '"filename": "[^"]*' |cut -f 5 -d / | sus
2631 warc
1113 crawldiagnostics
>: uz CC-MAIN-2021-31/cdx/warc/cdx-0015[0-9].gz | fgrep '%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd' | egrep -o '"filename": "[^"]*' |cut -f 5 -d / | sus
157960 crawldiagnostics
157160 warc
('sus' is an alias for 'sort "$@" | uniq -c | sort -k1nr,1', uz is an
alias for 'igzip -dc "$@"')
The distribution of status codes for the crawldiagnotic cases has one big
difference:
>: uz CC-MAIN-2021-25/cdx/warc/cdx-0015[0-9].gz | fgrep '%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd' | egrep '"filename": "[^"]*diagnostics' |egrep -o '"status": "..."'|cut -f 4 -d \" | sus
609 301
264 403
182 302
46 404
4 500
4 502
3 503
1 400
>: uz CC-MAIN-2021-31/cdx/warc/cdx-0015[0-9].gz | fgrep '%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd' | egrep '"filename": "[^"]*diagnostics' |egrep -o '"status": "..."'|cut -f 4 -d \" | sus
85056 404
35580 301
22931 302
5994 400
2791 403
2476 500
417 503
410 414
370 307
361 308
301 429
404 has gone from ~4% to ~54%.
That's as far as I've gotten.
The obvious question to ask is if anything significant changed in
either the seeding or the 'crawling' between 2021-25 and 2021-35.
Thanks for your patience if you've read this far.
ht
--
Henry S. Thompson, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND
e-mail:
h...@inf.ed.ac.uk
URL:
https://www.ltg.ed.ac.uk/~ht/
[mail from me _always_ has a .sig like this -- mail without it is forged spam]