A question about detected languages

59 views
Skip to first unread message

Tom Alby

unread,
Jul 22, 2021, 4:25:37 PM7/22/21
to Common Crawl
Hi,

I am looking for hosts/domains with documents in German and used the following query in Athena:

SELECT DISTINCT url_host_name, url_host_registered_domain, url_host_tld, content_languages
FROM "ccindex"."ccindex"
WHERE subset = 'warc'
        AND fetch_status = 200
        AND content_mime_detected = 'text/html'
        AND content_languages LIKE '%deu%'

I was expecting one line for each host but I got more than a dozen lines in this example.

# Source: spark<?> [?? x 4]
   url_host_name      url_host_registered_domain url_host_tld content_languages
   <chr>              <chr>                      <chr>        <chr>            
 1 agnes.hu-berlin.de hu-berlin.de               de           deu,spa          
 2 agnes.hu-berlin.de hu-berlin.de               de           deu,isl,eng      
 3 agnes.hu-berlin.de hu-berlin.de               de           deu,eng,rus      
 4 agnes.hu-berlin.de hu-berlin.de               de           deu,ces          
 5 agnes.hu-berlin.de hu-berlin.de               de           deu,lat          
 6 agnes.hu-berlin.de hu-berlin.de               de           deu,eng,por      
 7 agnes.hu-berlin.de hu-berlin.de               de           deu,eng          
 8 agnes.hu-berlin.de hu-berlin.de               de           deu,ita,eng      
 9 agnes.hu-berlin.de hu-berlin.de               de           fra,deu,eng      
10 agnes.hu-berlin.de hu-berlin.de               de           deu,spa,eng  

I understand that up to 3 languages are detected per document, and I am assuming that my SQL query resulted in host data points that share the same order of languages. Is that correct?

Also, does the order in the list imply a confidence with respect to the language detected? In some cases, I don't see that language on that host, and I am wondering whether it would be safer to use only those hosts that have 'deu' either alone or as the 1st value in the list.

Best

Tom

Sebastian Nagel

unread,
Jul 26, 2021, 5:47:04 AM7/26/21
to common...@googlegroups.com
Hi Tom,

great question!

If you want one output row per host, you'd need to add a "GROUP BY" clause
together with aggregation functions for all columns not in the "GROUP BY"
clause. For example:

SELECT COUNT(*) as page_count,
url_host_name,
url_host_registered_domain,
url_host_tld,
histogram(content_languages)
FROM "ccindex"."ccindex"
WHERE subset = 'warc'
AND content_languages LIKE '%deu%'
AND crawl = 'CC-MAIN-2021-25'
GROUP BY url_host_name, url_host_registered_domain, url_host_tld
ORDER BY page_count DESC;


> I understand that up to 3 languages are detected per document,
> and I am assuming that my SQL query resulted in host data points
> that share the same order of languages. Is that correct?

Yes.

> Also, does the order in the list imply a confidence with respect
> to the language detected?

The languages are ordered by text coverage. From the documentation of
CDL2 which is used for language detection:

For mixed-language input, CLD2 returns the top three languages found
and their approximate percentages of the total text bytes (e.g. 80%
English and 20% French out of 1000 bytes of text means about 800
bytes of English and 200 bytes of French).
[https://github.com/CLD2Owners/cld2#readme]

> In some cases, I don't see that language on that host, and I am
> wondering whether it would be safer to use only those hosts that
> have 'deu' either alone or as the 1st value in the list.

This might reduce the noise indeed. However, all text content on the
HTML page is passed to the language detector, including boilerplate
content (header, footer, navigation, etc.) which might sometimes
exceed the amount of the "main" content.

The WARC metadata record (just after the response record) includes
more details about the detected languages. One example:

{
"reliable": true,
"text-bytes": 2940,
"languages": [
{
"code": "de",
"code-iso-639-3": "deu",
"text-covered": 0.97,
"score": 1199,
"name": "GERMAN"
},
{
"code": "en",
"code-iso-639-3": "eng",
"text-covered": 0.02,
"score": 1039,
"name": "ENGLISH"
}
]
}


Finally, a few remarks:

- instead of
content_mime_detected = 'text/html'
better write
contains(ARRAY ['text/html', 'application/xhtml+xml'], content_mime_detected)
to catch all HTML documents. Or leave it away. For now, the language
detection isn't performed on document types other than HTML.

- the "warc" subset only includes successfully fetched pages,
fetch_status = 200
would have any effect if combined with
subset = 'warc'

- for development it's recommended to add a filter limiting
the query to a small part of the table. This saves money
and development time - results will show up within seconds.
For example:

WHERE ...
AND crawl = 'CC-MAIN-2021-25'
AND url_host_tld = 'at'

In doubt, restrict the query to one monthly crawl or a subset, eg.
crawl LIKE 'CC-MAIN-2021-%'
Querying all crawls since 2013 might time out.


If it's ok for you, I'd add your query to the list of example queries on
https://github.com/commoncrawl/cc-index-table#query-the-table-in-amazon-athena


Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com
> <mailto:common-crawl...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/common-crawl/656b6811-9220-42c1-8b07-cd43eb5d4e04n%40googlegroups.com
> <https://groups.google.com/d/msgid/common-crawl/656b6811-9220-42c1-8b07-cd43eb5d4e04n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Tom Alby

unread,
Jul 27, 2021, 5:01:19 PM7/27/21
to Common Crawl

Thank you so much for your answer and your code, Sebastian, highly appreciated! I tried the GROUP BY as well in one of my attempts but put it at the wrong place. And yes, absolutely ok to add my query!
Best
Tom
Reply all
Reply to author
Forward
0 new messages