> for example Danish or Farsi or other?
> I said English only for an example.
The point is that right now about 45% of the crawled pages are English
while Danish or Farsi contribute about 0.5%, see
This imbalance requires a different retrieval strategy:
- the CDX index might be used to retrieve primarily by choosing a promising
country-code top-level domain (*.dk or *.ir) and then filter by content language
for Danish resp. Farsi content
- this is efficient because there are few records for these TLDs and a large portion
of the records are of the desired content language
- of course, you will miss all Danish or Farsi content not hosted in
a .dk resp .ir domain
- but filtering the .com TLD (45% of all records) for Danish or Farsi pages
would take very long
- the columnar index allows to do look-ups primarily by content language
independent from the TLD. That's why you should use it for your use case.
On 2/16/21 3:22 PM, mili lali wrote:
> Thanks for your precise reply.
> mmm, Can you explain in other languages? for example Danish or Farsi or other? I said English only for an example.
> On Tuesday, February 16, 2021 at 5:32:03 PM UTC+3:30 Sebastian Nagel wrote:
> the CDX index always need a part of the URL. This could be a wild-card pattern using
> only the TLD suffix (e.g. `*.us`). However, because filtering by content language is
> done secondary, such queries tend to run very long. The columnar index is definitely
> more efficient to select records by language.
> However, and just in case you are interested only in English content which makes about half
> of the content in the Common Crawl archives:
> - language annotations are also included in WARC files in a "metadata" record following
> the response record
> - in WET files (since May 2020) in the header `WARC-Identified-Content-Language`
> You could just use this annotations and skip over non-English records.
> On 2/16/21 2:41 PM, mili lali wrote:
> > Dear Sebastian,
> > Many Thanks for your quick reply.
> > Sorry about any inconvenience.
> > I using cdx_toolkit <https://github.com/cocrawler/cdx_toolkit
>>  to retrieve the index
> file. I set only content_languages but it response
> > nothing.
> > I use it like this:
> > api_url = 'https://index.commoncrawl.org/CC-MAIN-2020-40-index
> > example, AWS Athena <https://aws.amazon.com/athena/
> > best regards
> >  https://github.com/cocrawler/cdx_toolkit