Language information for earlier crawls.

153 views
Skip to first unread message

tor...@gmail.com

unread,
Dec 8, 2023, 2:38:17 AM12/8/23
to Common Crawl
Hi

I'm currently using the "languages" field in the metadata to download non-English WARC. It speed up my download significantly, since majority of the crawls are in English.
However, I just found that all the crawls from earlier than 2018 don't a "languages" field. Is it stored somewhere else or I just have to download all the WARC and do the filtering with a language model ?

Thanks

tor...@gmail.com

unread,
Dec 8, 2023, 2:44:38 AM12/8/23
to Common Crawl
By " languages field in the metadata", I actually meant "languages" field in the index.

Sorry for the confusion.

C.L. Liu

unread,
Dec 8, 2023, 3:37:32 AM12/8/23
to Common Crawl
I was wondering how to speed up the downloading by  "languages" field in the index? In my experience, filtering index by language means we need to send more requests for the wanted segments compared to download the whole wet/warc. If I was wrong, could you teach me how you do this?

Thanks
tor...@gmail.com 在 2023年12月8日 星期五下午3:38:17 [UTC+8] 的信中寫道:

Greg Lindahl

unread,
Dec 8, 2023, 3:30:56 PM12/8/23
to common...@googlegroups.com
Replying to 2 things at once:

Around 45% of the crawl is identified as English, see
https://commoncrawl.github.io/cc-crawl-statistics/plots/languages

If you want all of the warcs with some non-English webpages, that's
all of the warcs. If you want all of the content in a particular
language, yes, the most efficient way to do this currently is to
download all of the individual warc records for just that language.
This is many small transactions, and it's done quite a lot by our
users. We have a project on our list to provide language-specific
warcs someday. I'm not sure when that will actually happen.

This "languages" field in the columnar index does not go back to the
early crawls. We hope to extend this CLD2-based language
identification to earlier crawls someday.

-- greg
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/4bf33436-fc1a-4043-9a33-6849d2be4f41n%40googlegroups.com.

tor...@gmail.com

unread,
Dec 11, 2023, 10:54:21 PM12/11/23
to Common Crawl
Put hundreds of ranges in a single HTTP request, the response aren't that small. I was even thinking about thousands of ranges per request, but gave up the idea due to limitation of HTTP header length.

Greg Lindahl

unread,
Dec 12, 2023, 2:19:07 AM12/12/23
to common...@googlegroups.com
Torshie,

100s of range requests in a single HTTP request is 100s of small
requests as far as AWS performance is concerned. That's what I have to
worry about, and have no way to limit.

Putting them in a single HTTP request will fool CloudFront's rate
limit into thinking it's just one request, for rate-limit purposes.

We can see a bunch of users in our logs making these kinds of requests
-- if you want to download a language that has a few hundred pages in
each WARC, that's an effective way to do it. Of course, you have to
worry about http header lengths.

-- greg
> To view this discussion on the web visit https://groups.google.com/d/msgid/common-crawl/65f94091-5373-4e9f-8cd8-5e4aa93a942fn%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages