retrieve CDX Toolkit with content_languages

174 views
Skip to first unread message

mili lali

unread,
Feb 16, 2021, 8:41:25 AM2/16/21
to Common Crawl
Dear Sebastian,
Many Thanks for your quick reply.
Sorry about any inconvenience.
I using cdx_toolkit [1] to retrieve the index file. I set only content_languages but it response nothing.
I use it like this:

r = requests.get(api_url,
                 params = {
                     'content_languages' : 'eng',
                     'limit': 10,
                     'output': 'json'
                 })
records = [json.loads(line) for line in r.text.split('\n') if line]

could you mind helping me how can access the index file with specific content_languages without download the index file? Does it have any solution?  as you said in [2] I must be using, for example, AWS Athena?

best regards





Sebastian Nagel

unread,
Feb 16, 2021, 9:02:03 AM2/16/21
to common...@googlegroups.com
Hi,

the CDX index always need a part of the URL. This could be a wild-card pattern using
only the TLD suffix (e.g. `*.us`). However, because filtering by content language is
done secondary, such queries tend to run very long. The columnar index is definitely
more efficient to select records by language.

However, and just in case you are interested only in English content which makes about half
of the content in the Common Crawl archives:
- language annotations are also included in WARC files in a "metadata" record following
the response record
- in WET files (since May 2020) in the header `WARC-Identified-Content-Language`
You could just use this annotations and skip over non-English records.

Best,
Sebastian


On 2/16/21 2:41 PM, mili lali wrote:
> Dear Sebastian,
> Many Thanks for your quick reply.
> Sorry about any inconvenience.
> I using cdx_toolkit <https://github.com/cocrawler/cdx_toolkit> [1] to retrieve the index file. I set only content_languages but it response
> nothing.
> I use it like this:
>
> api_url = 'https://index.commoncrawl.org/CC-MAIN-2020-40-index'
> r = requests.get(api_url,
>                  params = {
>                      'content_languages' : 'eng',
>                      'limit': 10,
>                      'output': 'json'
>                  })
> records = [json.loads(line) for line in r.text.split('\n') if line]
>
> could you mind helping me how can access the index file with specific content_languages without download the index file? Does it have any
> solution?  as you said in [2] <https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/>I must be using, for
> example, AWS Athena <https://aws.amazon.com/athena/>?
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/423e22e4-6d4b-4937-884a-261f606c55aen%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/423e22e4-6d4b-4937-884a-261f606c55aen%40googlegroups.com?utm_medium=email&utm_source=footer>.

mili lali

unread,
Feb 16, 2021, 9:22:13 AM2/16/21
to Common Crawl
Thanks for your precise reply.
mmm, Can you explain in other languages? for example Danish or Farsi or other? I said English only for an example.

Sebastian Nagel

unread,
Feb 16, 2021, 9:52:15 AM2/16/21
to common...@googlegroups.com
Hi,

> for example Danish or Farsi or other?
> I said English only for an example.

The point is that right now about 45% of the crawled pages are English
while Danish or Farsi contribute about 0.5%, see
https://commoncrawl.github.io/cc-crawl-statistics/plots/languages

This imbalance requires a different retrieval strategy:
- the CDX index might be used to retrieve primarily by choosing a promising
country-code top-level domain (*.dk or *.ir) and then filter by content language
for Danish resp. Farsi content
- this is efficient because there are few records for these TLDs and a large portion
of the records are of the desired content language
- of course, you will miss all Danish or Farsi content not hosted in
a .dk resp .ir domain
- but filtering the .com TLD (45% of all records) for Danish or Farsi pages
would take very long
- the columnar index allows to do look-ups primarily by content language
independent from the TLD. That's why you should use it for your use case.

Best,
Sebastian

On 2/16/21 3:22 PM, mili lali wrote:
> Thanks for your precise reply.
> mmm, Can you explain in other languages? for example Danish or Farsi or other? I said English only for an example.
>
> On Tuesday, February 16, 2021 at 5:32:03 PM UTC+3:30 Sebastian Nagel wrote:
>
> Hi,
>
> the CDX index always need a part of the URL. This could be a wild-card pattern using
> only the TLD suffix (e.g. `*.us`). However, because filtering by content language is
> done secondary, such queries tend to run very long. The columnar index is definitely
> more efficient to select records by language.
>
> However, and just in case you are interested only in English content which makes about half
> of the content in the Common Crawl archives:
> - language annotations are also included in WARC files in a "metadata" record following
> the response record
> - in WET files (since May 2020) in the header `WARC-Identified-Content-Language`
> You could just use this annotations and skip over non-English records.
>
> Best,
> Sebastian
>
>
> On 2/16/21 2:41 PM, mili lali wrote:
> > Dear Sebastian,
> > Many Thanks for your quick reply.
> > Sorry about any inconvenience.
> > I using cdx_toolkit <https://github.com/cocrawler/cdx_toolkit <https://github.com/cocrawler/cdx_toolkit>> [1] to retrieve the index
> file. I set only content_languages but it response
> > nothing.
> > I use it like this:
> >
> > api_url = 'https://index.commoncrawl.org/CC-MAIN-2020-40-index <https://index.commoncrawl.org/CC-MAIN-2020-40-index>'
> > r = requests.get(api_url,
> >                  params = {
> >                      'content_languages' : 'eng',
> >                      'limit': 10,
> >                      'output': 'json'
> >                  })
> > records = [json.loads(line) for line in r.text.split('\n') if line]
> >
> > could you mind helping me how can access the index file with specific content_languages without download the index file? Does it have
> any
> > solution?  as you said in [2] <https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
> <https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/>>I must be using, for
> > example, AWS Athena <https://aws.amazon.com/athena/ <https://aws.amazon.com/athena/>>?
> >
> > best regards
> >
> >
> >
> >
> >
> > [1] https://github.com/cocrawler/cdx_toolkit <https://github.com/cocrawler/cdx_toolkit>
> > [2] https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
> <https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/>
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> > <mailto:common-crawl...@googlegroups.com>.
> > To view this discussion on the web visit
> > https://groups.google.com/d/msgid/common-crawl/423e22e4-6d4b-4937-884a-261f606c55aen%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/423e22e4-6d4b-4937-884a-261f606c55aen%40googlegroups.com>
> >
> <https://groups.google.com/d/msgid/common-crawl/423e22e4-6d4b-4937-884a-261f606c55aen%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/msgid/common-crawl/423e22e4-6d4b-4937-884a-261f606c55aen%40googlegroups.com?utm_medium=email&utm_source=footer>>.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/87385b46-d434-4fa9-b743-b8466dda9d1fn%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/87385b46-d434-4fa9-b743-b8466dda9d1fn%40googlegroups.com?utm_medium=email&utm_source=footer>.

Reply all
Reply to author
Forward
0 new messages