Replying to 2 things at once:
Around 45% of the crawl is identified as English, see
https://commoncrawl.github.io/cc-crawl-statistics/plots/languages
If you want all of the warcs with some non-English webpages, that's
all of the warcs. If you want all of the content in a particular
language, yes, the most efficient way to do this currently is to
download all of the individual warc records for just that language.
This is many small transactions, and it's done quite a lot by our
users. We have a project on our list to provide language-specific
warcs someday. I'm not sure when that will actually happen.
This "languages" field in the columnar index does not go back to the
early crawls. We hope to extend this CLD2-based language
identification to earlier crawls someday.
-- greg
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
common-crawl...@googlegroups.com.
> To view this discussion on the web visit
https://groups.google.com/d/msgid/common-crawl/4bf33436-fc1a-4043-9a33-6849d2be4f41n%40googlegroups.com.