Hi,
> for example Danish or Farsi or other?
> I said English only for an example.
The point is that right now about 45% of the crawled pages are English
while Danish or Farsi contribute about 0.5%, see
https://commoncrawl.github.io/cc-crawl-statistics/plots/languages
This imbalance requires a different retrieval strategy:
- the CDX index might be used to retrieve primarily by choosing a promising
country-code top-level domain (*.dk or *.ir) and then filter by content language
for Danish resp. Farsi content
- this is efficient because there are few records for these TLDs and a large portion
of the records are of the desired content language
- of course, you will miss all Danish or Farsi content not hosted in
a .dk resp .ir domain
- but filtering the .com TLD (45% of all records) for Danish or Farsi pages
would take very long
- the columnar index allows to do look-ups primarily by content language
independent from the TLD. That's why you should use it for your use case.
Best,
Sebastian
On 2/16/21 3:22 PM, mili lali wrote:
> Thanks for your precise reply.
> mmm, Can you explain in other languages? for example Danish or Farsi or other? I said English only for an example.
>
> On Tuesday, February 16, 2021 at 5:32:03 PM UTC+3:30 Sebastian Nagel wrote:
>
> Hi,
>
> the CDX index always need a part of the URL. This could be a wild-card pattern using
> only the TLD suffix (e.g. `*.us`). However, because filtering by content language is
> done secondary, such queries tend to run very long. The columnar index is definitely
> more efficient to select records by language.
>
> However, and just in case you are interested only in English content which makes about half
> of the content in the Common Crawl archives:
> - language annotations are also included in WARC files in a "metadata" record following
> the response record
> - in WET files (since May 2020) in the header `WARC-Identified-Content-Language`
> You could just use this annotations and skip over non-English records.
>
> Best,
> Sebastian
>
>
> On 2/16/21 2:41 PM, mili lali wrote:
> > Dear Sebastian,
> > Many Thanks for your quick reply.
> > Sorry about any inconvenience.
> > I using cdx_toolkit <
https://github.com/cocrawler/cdx_toolkit <
https://github.com/cocrawler/cdx_toolkit>> [1] to retrieve the index
> file. I set only content_languages but it response
> > nothing.
> > I use it like this:
> >
> > api_url = '
https://index.commoncrawl.org/CC-MAIN-2020-40-index <
https://index.commoncrawl.org/CC-MAIN-2020-40-index>'
> > example, AWS Athena <
https://aws.amazon.com/athena/ <
https://aws.amazon.com/athena/>>?
> >
> > best regards
> >
> >
> >
> >
> >
> > [1]
https://github.com/cocrawler/cdx_toolkit <
https://github.com/cocrawler/cdx_toolkit>
> <
https://groups.google.com/d/msgid/common-crawl/423e22e4-6d4b-4937-884a-261f606c55aen%40googlegroups.com?utm_medium=email&utm_source=footer
> <
https://groups.google.com/d/msgid/common-crawl/423e22e4-6d4b-4937-884a-261f606c55aen%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
https://groups.google.com/d/msgid/common-crawl/87385b46-d434-4fa9-b743-b8466dda9d1fn%40googlegroups.com
> <
https://groups.google.com/d/msgid/common-crawl/87385b46-d434-4fa9-b743-b8466dda9d1fn%40googlegroups.com?utm_medium=email&utm_source=footer>.