Encoding bug in 2023-40 ?

216 views
Skip to first unread message

Henry S. Thompson

unread,
Feb 8, 2024, 1:44:13 PMFeb 8
to Common Crawl
In the index for the 2023-40 crawl, there are over 100,000 entries
with URIs containing very long strings of %-escaped unicode FFFD
(Replacement Character), some as long as 50 consecutive instances of
%EF%BF%BD.

Such strings are very rare in e.g. the index for 2019-35, numbering
only around 1500.

Given the indices I have downloaded, I can narrow down the change to
somewhere between 2021-21 and 2021-31:

>: fgrep -c '%ef%bf%bd' CC-MAIN-2021-25/cdx/cluster.idx
128
>: fgrep -c '%ef%bf%bd' CC-MAIN-2021-31/cdx/cluster.idx
7074

[Stop reading now unless this is of relevance/interest to you, what
follows is just a report of my efforts to find out more about what's
happened.]

And, perhaps more interestingly:

>: fgrep -c '%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd' CC-MAIN-2021-25/cdx/cluster.idx
20
>: fgrep -c '%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd' CC-MAIN-2021-31/cdx/cluster.idx
3724

Of course only the most common domains exhibiting this make it into
the secondary index, for example on line 350018 of the index for 2023-40:

com,hket,invest)/article/3532920/%ef%bf%bd%ef%bf%bd%ef%bf%bd...[repeats for
6 more lines]?mtc=40001&srkw=%e7%be%8e%e5%9c%8b%e5%85%b1%e5%92%8c%e9%bb%a8 20230925231423

Following the relevant entry if we look at the first line in the block
at offset 737595399, length 199309 in cdx-00078.gz, it in turn
points to the response at offset 358674722, length 1468 in
segments/1695233510100.47/warc/CC-MAIN-20230925215547-20230926005547-00202.warc.gz,
where we find

WARC-Target-URI: https://invest.hket.com/article/3532920/%EF%BF%BD%EF%BF%BD...

So the index is consistent with the WARC file entry. And indeed,
somewhat surprisingly, wget for that URI does produce the same result
we find in the WARC file. So the long string of FFFD code points is
irrelevant to the response. Indeed deleting some or all of the FFFds,
and even the query string, doesn't affect the result.

There does seem to be something systematic changing, maybe just in the
distribution of domains wrt languages using non-ascii charsets:

>: fgrep '%ef%bf%bd' CC-MAIN-2021-25/cdx/cluster.idx | cut -f 1 -d \) | uniq | wc -l
99
>: fgrep '%ef%bf%bd' CC-MAIN-2021-31/cdx/cluster.idx | cut -f 1 -d \) | uniq | wc -l
5814
cirrus-login1<6068>: fgrep '%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd' CC-MAIN-2021-25/cdx/cluster.idx | cut -f 1 -d \) | uniq | wc -l
21
>: fgrep '%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd' CC-MAIN-2021-31/cdx/cluster.idx | cut -f 1 -d \) | uniq | wc -l
3223

So the increase in the overall FFFD count is basically down to an
increase in the number of distinct domains exhibiting the phenomenon.

There is a change in the relative proportion of non-200 responses
involved:

>: uz CC-MAIN-2021-25/cdx/warc/cdx-0015[0-9].gz | fgrep '%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd' | egrep -o '"filename": "[^"]*' |cut -f 5 -d / | sus
2631 warc
1113 crawldiagnostics
>: uz CC-MAIN-2021-31/cdx/warc/cdx-0015[0-9].gz | fgrep '%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd' | egrep -o '"filename": "[^"]*' |cut -f 5 -d / | sus
157960 crawldiagnostics
157160 warc

('sus' is an alias for 'sort "$@" | uniq -c | sort -k1nr,1', uz is an
alias for 'igzip -dc "$@"')

The distribution of status codes for the crawldiagnotic cases has one big
difference:

>: uz CC-MAIN-2021-25/cdx/warc/cdx-0015[0-9].gz | fgrep '%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd' | egrep '"filename": "[^"]*diagnostics' |egrep -o '"status": "..."'|cut -f 4 -d \" | sus
609 301
264 403
182 302
46 404
4 500
4 502
3 503
1 400
>: uz CC-MAIN-2021-31/cdx/warc/cdx-0015[0-9].gz | fgrep '%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd%ef%bf%bd' | egrep '"filename": "[^"]*diagnostics' |egrep -o '"status": "..."'|cut -f 4 -d \" | sus
85056 404
35580 301
22931 302
5994 400
2791 403
2476 500
417 503
410 414
370 307
361 308
301 429

404 has gone from ~4% to ~54%.

That's as far as I've gotten.

The obvious question to ask is if anything significant changed in
either the seeding or the 'crawling' between 2021-25 and 2021-35.

Thanks for your patience if you've read this far.

ht
--
Henry S. Thompson, School of Informatics, University of Edinburgh
10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND
e-mail: h...@inf.ed.ac.uk
URL: https://www.ltg.ed.ac.uk/~ht/
[mail from me _always_ has a .sig like this -- mail without it is forged spam]


Greg Lindahl

unread,
Feb 11, 2024, 1:11:54 AMFeb 11
to common...@googlegroups.com
Henry,

Thank you for doing a deep dive on this issue. I went and looked at
our internal crawl notes and nothing changed significantly between
2021-25 and 2021-35.

Given the smallish number of urls involved -- 100k urls looks like a
lot until you realize we crawl 3.5 billion urls in a typical crawl --
I suspect that what's going on is some kind of CGI arg used for
tracking that we're mangling because we think it's not valid utf8, and
then the website keeps on appending it over and over.

If I was going to learn a lesson from this, I think it would be to
count how often the replacement character appears in urls or content,
so that we could react quickly if somehow we started getting abnormal
numbers of FFFD.

-- greg
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/f5bmssaeqyw.fsf%40ecclerig.inf.ed.ac.uk.

Henry S. Thompson

unread,
Feb 12, 2024, 12:56:19 PMFeb 12
to common...@googlegroups.com
Greg Lindahl writes:

> Thank you for doing a deep dive on this issue. I went and looked at
> our internal crawl notes and nothing changed significantly between
> 2021-25 and 2021-35.

Thanks for following up.

> Given the smallish number of urls involved -- 100k urls looks like a
> lot until you realize we crawl 3.5 billion urls in a typical crawl --

Apologies for not being clearer: 100k was the incidence of the
phenomenon from a sample of only 10 index files from 2021-31. I've
now done the tabulation for whole of the two datasets:

2021-25: 67,351
2021-31: 11,708,228

The latter is perhaps large enough to be worth understanding a bit
better, and the scale of the change a bit more surprising.

> I suspect that what's going on is some kind of CGI arg used for
> tracking that we're mangling because we think it's not valid utf8, and
> then the website keeps on appending it over and over.

That brings up a question I've meant to ask before: is the set of
seeds for each crawl available somewhere? I did some work several
years ago now to try to build the redirection chains for a single
crawl, working backwards from each 200 and 404 response to look for a
matching Location header in a 301/302 response, and then again if
necessary, until I got to a request which wasn't anywhere in the
crawldiagnostics data, but that was _very_ time consuming and in
various ways unsatisfactory. It would be much easier to start from
the seeds and track them going forward! Wrt the question at hand, by
looking at the seeds we could tell if the problem was actually there,
or somewhere in the crawl process...

In any case I am now looking in a bit more detail at the relationship
of this change to a change in the ranking of non-ascii-centric
languages.

Tom Morris

unread,
Feb 12, 2024, 1:14:05 PMFeb 12
to common...@googlegroups.com
On Sun, Feb 11, 2024 at 1:11 AM Greg Lindahl <gr...@commoncrawl.org> wrote:
>
> I suspect that what's going on is some kind of CGI arg used for
> tracking that we're mangling because we think it's not valid utf8, and
> then the website keeps on appending it over and over.

My initial suspicion, which I haven't investigated, is similar,
namely that they are non-UTF encoded URLs (probably Big5
encoded in this case), being incorrectly decoded. I don't think
it necessarily has to be the query string. It could be a non-critical
part of the path, perhaps a redundant slug line in Henry's case.

Although UTF-8 encoding for URLs is the standard now, I think
there are still legacy servers in the wild using other encodings.

As Henry implies, it's probably the pipeline generating/collecting
the seeds which needs investigating.

Tom

Greg Lindahl

unread,
Feb 12, 2024, 1:46:10 PMFeb 12
to common...@googlegroups.com
Looking at the seed for the crawl in progress, about 0.1% of the seed
urls have %ef%bf%bd. The one I looked at was a "soft 404", i.e. it
was a 200. It came from the previous crawl.

The seed computation is complicated, and there's basically no way to
go back and find out how an URL was initially discovered.

Tom's theory is a good one, it could easily have started as mojibake,
either on a webpage or in a sitelist.

Many of them have a lot of repetitions, in just one set of urls, 2/3
had 5 sets of those 3 bytes in a row, and 95% had at least 2 sets of
those 3 bytes in a row.
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/CAE9vqEEkOcSN5WJV0PyTDkmqdcpSk%2B2qtQ4nhDRehgzvkJapKg%40mail.gmail.com.

Tom Morris

unread,
Feb 12, 2024, 4:04:00 PMFeb 12
to common...@googlegroups.com
OK, I've confirmed that this can be at least one failure mode using
the example website from Henry's original email. I had a look through
the Wayback Machine's archive of pages from that site [1], looking for
URLs which were rendered with percent encoding (indicating they
couldn't be decoded to UTF-8). Small sampling of broken URLs that I
saw all seemed to date from early 2016, but interestingly there seem
to be a mix of Big5 and UTF-8 URLs from the same time period, so it
would require more investigation upstream to see if they are coming
from different sources or what the difference was caused by. As an
example, the 4 captures of this one page [2] have 3 different URLs,
two with UTF-8 encoding and one with Big5 encoding. The Python snippet
below shows that the Big5 URL is equivalent to the UTF-8 one:

$ python
>>> from urllib.parse import unquote
u='http://invest.hket.com:80/article/1122212/%A4%AC%C1p%BA%F4-%A8%FC%B4f%AA%D1%20%A9%FA%A6~%B5J%C2I%A4%A7%A4@'
>>> unquote(u)
'http://invest.hket.com:80/article/1122212/���p��-���f�� ���~�J�I���@'
>>> unquote(u,encoding='big5')
'http://invest.hket.com:80/article/1122212/互聯網-受惠股 明年焦點之一'

Because the slug is ignored on this web site, this URL will resolve
despite replacement characters (or anything else) [3] in the slug.

However, the big uptick in 404s indicates that there are plenty of
websites where that isn't the case. Pretty much any occurrence of the
replacement character in a URL is a red flag, so perhaps the seed list
should be scanned for theses.

Tom

p.s. Henry - did you do a tally of the total number of 404s in the
crawl that you were looking at?

[1] https://web.archive.org/web/*/https://invest.hket.com/article/*
[2] https://web.archive.org/web/*/https://invest.hket.com/article/1122212/*
[3] https://invest.hket.com/article/1122212/anything

Tom Morris

unread,
Feb 12, 2024, 4:19:07 PMFeb 12
to common...@googlegroups.com
I mean't to mention that although the Wayback Machine thinks it has
two captures for the Big5 encoded URL according to the index listing
[1], it can't seem to find/render them[2].

Tom

[1] https://web.archive.org/web/*/https://invest.hket.com/article/1122212/*
[2] https://web.archive.org/web/20160202165953*/http://invest.hket.com:80/article/1122212/%A4%AC%C1p%BA%F4-%A8%FC%B4f%AA%D1%20%A9%FA%A6~%B5J%C2I%A4%A7%A4@

Henry S. Thompson

unread,
Feb 13, 2024, 11:22:24 AMFeb 13
to common...@googlegroups.com
Tom Morris writes:

> ...
> p.s. Henry - did you do a tally of the total number of 404s in the
> crawl that you were looking at?

With the same filter I reported on previously, that is, at least 6
hex-encoded FFFDs in a row, the relevant numbers for the whole of
2021-35 are as follows:

Totals:

6,582,114 crawldiagnostics
5,126,113 warc
1 robotstxt
----------
11,708,228

Breakdown of status codes for the above 6.5m cd (top 20):

3665495 404
1626032 301
716720 302
265936 400
102461 403
62380 500
38583 307
29547 303
20060 414
16394 503
7407 410
6515 502
6030 401
4849 308
2154 406
1777 522
1575 520
1246 504
1184 430
943 429

I.e more than 30% of these 6+ FFFD cases are 404s

From 10% (365,870,427 in total) of the whole 2021-35 index, we get

317033824 200
16670631 301
12717097 404
9547747 302
2108786 403
2102629 430
2075950 304
846859 500
842736 503
355676 303
310749 406
263733 410
211306 307
128605 429
125830 502
112720 400
74850 522
74686 401
63894 308
48376 520

Note that the 200s will include most of the robots.txt entries, I
don't have a breakdown on cd/robots/warc numbers for this crawl, but
for 2019-35 where I do have a count it was approximately
robots:cd:warc :: 1:6:30

On that basis we can estimate

robots 10,226,898
cd 48,836,603
warc 306,806,926

which is approx: 1:5:31

About 3.5% of the total and 26% of the cds are 404s in this 10%
sample, whereas in the FFFD-filtered subset of the whole dataset,
those numbers are 31% and 56%.

QED. That is, the change in the seeding, whatever it is, is creating
a subset of the data whichs show a significant shift in the relative
occurrences of 404 and 301.

Greg Lindahl

unread,
Feb 13, 2024, 2:21:36 PMFeb 13
to common...@googlegroups.com
I used to have a detector for these slug-ignoring sites based on
observing an integer or hex number with a fixed number of digits.
Haven't used it for Common Crawl yet. Almost all of these sites are
news websites that change headlines.

For this next crawl, I'm going to filter all urls with a path or query
that has 2 or more replacement characters in a row.

-- greg
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/f5bsf1w9vwl.fsf%40ecclerig.inf.ed.ac.uk.

Tom Morris

unread,
Feb 14, 2024, 1:22:54 PMFeb 14
to common...@googlegroups.com
On Tue, Feb 13, 2024 at 2:21 PM Greg Lindahl <gr...@commoncrawl.org> wrote:

> For this next crawl, I'm going to filter all urls with a path or query
> that has 2 or more replacement characters in a row.

That'll get rid of the 404s, but also skip all sites which ignore the
slugs where the URLs would have returned results even though nominally
"broken."

Is there any way to trace back upstream to the original Big5 (or
whatever non-UTF8) encoded URLs?

On Wed, Feb 14, 2024 at 5:49 AM Henry S. Thompson <h...@inf.ed.ac.uk> wrote:

> I'm working on getting the distribution of
> multiple replacement characters versus detected natural language
> and/or CC-TLD.

That would be interesting.

> I'd still like to get more clarity on what actually
> happened between 2021-25 and 2021-31 in this respect.

I'd be curious too, but it could just be something as simple as the
crawl finally got to a stash of previously saved Big5 (or whatever)
encoded URLs in the seed list.

Depending on how prevalent the problem is and whether or not the URLs
with their original encodings are available, it should be possible for
the canonicalization algorithms to guess the correct encoding. I
wouldn't be surprised if the most common case here is that
historically non-UTF8 URLs were used by a website, but they've since
upgraded to UTF8 compliance (probably with the same URLs just encoded
differently).

Tom

Henry S. Thompson

unread,
Feb 15, 2024, 5:57:22 AMFeb 15
to common...@googlegroups.com
Tom Morris writes:

> On Wed, Feb 14, 2024 at 5:49 AM Henry S. Thompson <h...@inf.ed.ac.uk> wrote:
> ...
>> I'd still like to get more clarity on what actually
>> happened between 2021-25 and 2021-31 in this respect.
>
> I'd be curious too, but it could just be something as simple as the
> crawl finally got to a stash of previously saved Big5 (or whatever)
> encoded URLs in the seed list.
>
> Depending on how prevalent the problem is and whether or not the URLs
> with their original encodings are available, it should be possible for
> the canonicalization algorithms to guess the correct encoding. I
> wouldn't be surprised if the most common case here is that
> historically non-UTF8 URLs were used by a website, but they've since
> upgraded to UTF8 compliance (probably with the same URLs just encoded
> differently).

Ah, but perhaps I wasn't clear, and I didn't have as much evidence,
but the thing is that _since_ 2021-31 the frequency of multi-FFFD
strings has jumped up and _stayed_ up. Here's a quick tabulation of
the counts from the crawls for which I have cluster.idx (which is much
faster to search than the whole index):

2013-20 3
2014-35 0
2015-35 0
2016-30 0
2017-30 4
2018-30 654
2018-34 668
2019-18 204
2019-35 55
2020-34 16
2021-25 23
2021-31 3946
2021-49 4511
2022-21 3506
2022-33 3627
2022-40 2989
2022-49 3723
2023-40 3723
2023-50 3094

Henry S. Thompson

unread,
Feb 19, 2024, 3:51:15 PMFeb 19
to common...@googlegroups.com
I've narrowed things down a bit more. Drilling down on a Russian
domain which exhibits the big jump in FFFD escapes between 2021-25 and
2021-31, namely dom2.clan.su, we see the following:

Total index entries 6 or more 6 or more consecutive
consecutive FFFD Cyrillic UTF-8
2021-25 1874 0 1591
2021-31 2012 1863 33

No entries in either crawl have both FFFDs and good Cyrillic UTF-8

In the 2021-31 case, the blip of 33 correctly encoded cases occur in
one 4-hour period of one day of the crawl, namely 1300--1700 on
2021-07-27.

A bit messier, because Hangul code points aren't as simple to grep
for, looking at a Korean domain where the big jump in FFFD happens,
research.unist.ac.kr:

Total index entries 6 or more 6 or more consecutive
consecutive FFFD Hangul UTF-8
2021-25 36656 0 22085
2021-31 36967 30266 6053

Greg Lindahl

unread,
Feb 21, 2024, 3:30:31 PMFeb 21
to common...@googlegroups.com
Henry, it takes us about 4 hours to fetch a segment. Were the
correctly encoded urls all in the same segment?

Other than that possibility, it sure seems like it was on the other end.

By the way, CC-MAIN-2024-10 is fetching, and I am dropping all urls
with 2 replacement characters in a row. crawldiagnostics still has the
ones that 404/301/302, and warc has none.

-- greg
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/f5bmsrw89fm.fsf_-_%40ecclerig.inf.ed.ac.uk.

Tom Morris

unread,
Feb 21, 2024, 8:37:47 PMFeb 21
to common...@googlegroups.com
I looked at this using Athena index and it appears to me that the
issue (or at least these two specific examples) are to do with the
index construction rather than something to do with the crawl itself.

I first looked at the counts to make sure they were comparable using
the query [1]
# crawl url_host_name subset count
1 CC-MAIN-2021-25 dom2.clan.su warc 1861
2 CC-MAIN-2021-31 dom2.clan.su warc 1940
1 CC-MAIN-2021-25 research.unist.ac.kr warc 24421
2 CC-MAIN-2021-31 research.unist.ac.kr warc 23203

The counts are slightly lower because I ignored the crawldiagnostics
and robots.txt subsets and just looked at the warc subset.

I then looked [2] for any URLs containing %FF escapes and found none.

I'll go back and look at the earlier example where I saw the Big5
encoding and see what's going on there.

Tom

[1]
SELECT crawl, url_host_name, subset, COUNT(*) AS count
FROM "ccindex"."ccindex"
WHERE (crawl = 'CC-MAIN-2021-25' OR crawl = 'CC-MAIN-2021-31')
AND url_host_tld = 'kr' -- 'su'
AND url_host_name = 'research.unist.ac.kr' -- 'dom2.clan.su'
AND subset = 'warc'
GROUP BY crawl, subset, url_host_name
ORDER BY crawl, count

[2]
SELECT crawl, fetch_status, content_charset, url
FROM "ccindex"."ccindex"
WHERE (crawl = 'CC-MAIN-2021-25' OR crawl = 'CC-MAIN-2021-31')
AND url_host_tld = 'kr' -- 'su'
AND url_host_name = 'research.unist.ac.kr' -- 'dom2.clan.su'
AND url LIKE '%\%FF%' ESCAPE '\'
AND subset = 'warc'
ORDER BY crawl, url
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/f5bmsrw89fm.fsf_-_%40ecclerig.inf.ed.ac.uk.

Tom Morris

unread,
Feb 21, 2024, 8:41:33 PMFeb 21
to common...@googlegroups.com
Oops! I was using the raw code point, not the UTF-8 encoding of
%EF%BF%BD. With the correct string there are lots of matches.

Tom

Tom Morris

unread,
Feb 21, 2024, 10:40:39 PMFeb 21
to common...@googlegroups.com
Revised analysis [1] - no replacement characters in 2021-25 crawl for
those two sites and the 2021-31 crawl errors all seem to be search
URLs with replacement characters in a search string.

# crawl url_host_name subset count
CC-MAIN-2021-31 dom2.clan.su warc 1864
CC-MAIN-2021-31 research.unist.ac.kr warc 17850

1864 URLs which look like https://dom2.clan.su/search/%EF%BF%BD...
with the replacement character in the URL path [1]

17850 URLs for the Korean site with a correct UTF-8 encoded path, but
query strings containing b_s=%EF%BF%BD...

Without insight into the upstream pipeline providing the seeds, I'm
not sure it's possible to conclude too much more.

Looking at the smaller .su site, the problem persists to this day and
while the absolute number of occurrences has peaked, that's only
because the site is being crawled less.

# crawl url_host_name subset count
# zero replacement characters in 2021 crawls before 2021-31 or any
2020 crawl for this host
CC-MAIN-2021-31 dom2.clan.su warc 1864 (of 1940 total URLs)
CC-MAIN-2021-39 dom2.clan.su warc 3422 (3483)
CC-MAIN-2021-43 dom2.clan.su warc 6849 (6977)
CC-MAIN-2021-49 dom2.clan.su warc 7581 (7873)
CC-MAIN-2022-05 dom2.clan.su warc 9482
CC-MAIN-2022-21 dom2.clan.su warc 10138
CC-MAIN-2022-27 dom2.clan.su warc 12432
CC-MAIN-2022-33 dom2.clan.su warc 8769
CC-MAIN-2022-40 dom2.clan.su warc 13354
CC-MAIN-2022-49 dom2.clan.su warc 16263
CC-MAIN-2023-06 dom2.clan.su warc 16103 (16406)
CC-MAIN-2023-14 dom2.clan.su warc 12558 (98% of 12757)
CC-MAIN-2023-23 dom2.clan.su warc 7025 (7077)
CC-MAIN-2023-40 dom2.clan.su warc 2992 (3004)
CC-MAIN-2023-50 dom2.clan.su warc 7571 (of 7776 total URLs or 97%)

[1]
SELECT crawl, url_host_name, subset, COUNT(*) AS count
FROM "ccindex"."ccindex"
WHERE crawl LIKE 'CC-MAIN-2021-%'
AND url_host_tld = 'su' -- 'kr'
AND url_host_name = 'dom2.clan.su' -- 'research.unist.ac.kr'
AND url LIKE '%\%EF%BF%BD%' ESCAPE '\'
AND subset = 'warc'
GROUP BY crawl, subset, fetch_status, url_host_name
ORDER BY crawl, count DESC

Greg Lindahl

unread,
Feb 21, 2024, 11:09:36 PMFeb 21
to common...@googlegroups.com
I wonder if the bad characters came in via dom2.clan.su's sitemaps?

It's not unusual for search pages on a website to return 200 if
nothing is found.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/CAE9vqEFGfWP8Hw4yjz7C4q9HBCp4eDEwR4zCr1tJHh9UTLeDWQ%40mail.gmail.com.

irene

unread,
Feb 22, 2024, 3:00:54 AMFeb 22
to common...@googlegroups.com

Henry S. Thompson

unread,
Feb 22, 2024, 6:00:43 AMFeb 22
to common...@googlegroups.com
Greg Lindahl writes:

> Henry, it takes us about 4 hours to fetch a segment. Were the
> correctly encoded urls all in the same segment?

Yes. All in /CC-MAIN-2021-31/segments/1627046153392.43/.

Here's where all the FFFD ones are:

180 1627046153971.20
179 1627046154126.73
178 1627046150067.51
178 1627046155268.80
177 1627046152168.38
177 1627046153816.3
177 1627046154459.22
175 1627046151972.40
175 1627046152085.13
174 1627046155458.35
94 1627046153392.43

All the 33 good ones were crawled in a four-hour window:

2021072713
2021072714
2021072715
2021072716

which is the 4-hour window in which the whole of .43 was crawled,
consistent with your "4 hours to crawl a segment" above.

Henry S. Thompson

unread,
Feb 22, 2024, 6:23:46 AMFeb 22
to common...@googlegroups.com
Greg Lindahl writes:

> I wonder if the bad characters came in via dom2.clan.su's sitemaps?
>
> It's not unusual for search pages on a website to return 200 if
> nothing is found.

Well, it is certainly true that if you actually try these URIs today,
the Cyrillic ones actually find real pages, but the FFFD ones, even
ones with some ascii bits, give a 200 for a page that says, e.g.

Результаты 0-0 из 0 по запросу ������ 2 ������������ ���������� love dom 2

That is, "results 0-0 of 0 for the query ...".

Greg Lindahl

unread,
Mar 16, 2024, 1:43:50 AMMar 16
to common...@googlegroups.com
Henry, Tom,

I tried to ban all URL paths and queries that had the replacement
string twice in a row in the new 2024-10 crawl. It's a big hammer and
doesn't seem to have hurt anything.

The host dom2.clan.su, which Henry mentioned in his analysis, went
from 7776 captures in 2023-50 to 109 in 2024-10. About 1/3 of the 109
were non-search webpages and 2/3 were search webpages (which sends
"200 no results found" and thus we'll continue to crawl them in the
future.)

But looking in 2 shards of the CDX index, just 1/150th of it, I see
still 150 urls out of ~ 20 million with '%EF%BF%BD%EF%BF%BD'. Might be
good enough.
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/f5ble7c68ua.fsf%40ecclerig.inf.ed.ac.uk.

Henry S. Thompson

unread,
Mar 16, 2024, 6:16:59 AMMar 16
to common...@googlegroups.com
Greg Lindahl writes:

> I tried to ban all URL paths and queries that had the replacement
> string twice in a row in the new 2024-10 crawl. It's a big hammer and
> doesn't seem to have hurt anything.

Thanks for the update.

> The host dom2.clan.su, which Henry mentioned in his analysis, went
> from 7776 captures in 2023-50 to 109 in 2024-10.

That counts as a win, I think. Shame we can't do the same to Putin's
votes today :-).

> About 1/3 of the 109 were non-search webpages and 2/3 were search
> webpages (which sends "200 no results found" and thus we'll continue
> to crawl them in the future.)

Sounds good to me.

> But looking in 2 shards of the CDX index, just 1/150th of it, I see
> still 150 urls out of ~ 20 million with '%EF%BF%BD%EF%BF%BD'. Might be
> good enough.

I agree that's pretty good. I'll do some further analysis when I can
to see if I see _any_ plausible uses of FFFD. In particular, I guess
I'd like to graph the impact of varying 2...10 in a row as the
trigger.
Reply all
Reply to author
Forward
0 new messages