Extracting a Specific language

224 views
Skip to first unread message

Ralf F

unread,
Apr 19, 2018, 2:55:15 PM4/19/18
to Common Crawl
All in all what I'm interested in is to just extrace a list of URLS to sites in a specific language to use as a seed to do my own webcrawling...


How do I go about it?

Thanx!

Sebastian Nagel

unread,
Apr 20, 2018, 8:13:04 AM4/20/18
to common...@googlegroups.com
Hi Ralf,

unfortunately, the language of a crawled page is not detected: you would need to run over the
entire content (or a larger sample of it) and check each WARC/WAT/WET record.

The only approximation is to rely on top-level domains. URLs and also the WARC records can
be obtained using the URL index (http://index.commoncrawl.org/).

Best,
Sebastian
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

BlackIce

unread,
Apr 20, 2018, 8:37:19 AM4/20/18
to common...@googlegroups.com
So at least I could get the country specific TLD's?
That would be a start.

But the WAR/WAT/WET does not contain language information?


Hi-Di-Ho Neighbor ;)

On Fri, Apr 20, 2018 at 2:12 PM, Sebastian Nagel <seba...@commoncrawl.org> wrote:
Hi Ralf,

unfortunately, the language of a crawled page is not detected: you would need to run over the
entire content (or a larger sample of it) and check each WARC/WAT/WET record.

The only approximation is to rely on top-level domains. URLs and also the WARC records can
be obtained using the URL index (http://index.commoncrawl.org/).

Best,
Sebastian

On 04/19/2018 08:55 PM, Ralf F wrote:
> All in all what I'm interested in is to just extrace a list of URLS to sites in a specific language
> to use as a seed to do my own webcrawling...
>
>
> How do I go about it?
>
> Thanx!
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to

> To post to this group, send email to common...@googlegroups.com
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl+unsubscribe@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.

Sebastian Nagel

unread,
Apr 20, 2018, 9:21:16 AM4/20/18
to common...@googlegroups.com
> But the WAR/WAT/WET does not contain language information?

No. At least, nothing more than was sent by the responding server
(eg., HTTP Content-Language, HTML metadata).

On 04/20/2018 02:37 PM, BlackIce wrote:
> So at least I could get the country specific TLD's?
> That would be a start.
>
> But the WAR/WAT/WET does not contain language information?
>
>
> Hi-Di-Ho Neighbor ;)
>
> On Fri, Apr 20, 2018 at 2:12 PM, Sebastian Nagel <seba...@commoncrawl.org
> <mailto:seba...@commoncrawl.org>> wrote:
>
> Hi Ralf,
>
> unfortunately, the language of a crawled page is not detected: you would need to run over the
> entire content (or a larger sample of it) and check each WARC/WAT/WET record.
>
> The only approximation is to rely on top-level domains. URLs and also the WARC records can
> be obtained using the URL index (http://index.commoncrawl.org/).
>
> Best,
> Sebastian
>
> On 04/19/2018 08:55 PM, Ralf F wrote:
> > All in all what I'm interested in is to just extrace a list of URLS to sites in a specific language
> > to use as a seed to do my own webcrawling...
> >
> >
> > How do I go about it?
> >
> > Thanx!
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to
> > common-crawl...@googlegroups.com <mailto:common-crawl%2Bunsu...@googlegroups.com>
> <mailto:common-crawl...@googlegroups.com
> <mailto:common-crawl%2Bunsu...@googlegroups.com>>.
> > To post to this group, send email to common...@googlegroups.com <mailto:common...@googlegroups.com>
> > <mailto:common...@googlegroups.com <mailto:common...@googlegroups.com>>.
> <https://groups.google.com/group/common-crawl>.
> > For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl%2Bunsu...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> <https://groups.google.com/group/common-crawl>.
> For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.

BlackIce

unread,
Apr 20, 2018, 9:48:18 AM4/20/18
to common...@googlegroups.com
On Fri, Apr 20, 2018 at 3:21 PM, Sebastian Nagel <seba...@commoncrawl.org> wrote:
> But the WAR/WAT/WET does not contain language information?

No. At least, nothing more than was sent by the responding server
(eg., HTTP Content-Language, HTML metadata).


So, could the HTTP Content-Language be filtered out? and extract the corresponding URL's? 

Sebastian Nagel

unread,
Apr 20, 2018, 11:10:28 AM4/20/18
to common...@googlegroups.com
Hi,

> So, could the HTTP Content-Language be filtered out? and extract the corresponding URL's?

Below just one way to do this using the WAT files:
- WATs are smaller than WARCs by 1/3
- page metadata and links are provided as JSON => easy to query
- every WAT contains a random sample. If you don't request for a small language
you'll get quickly a collection of URLs.

I've used grep to extract only the JSON lines and jq (https://stedolan.github.io/jq/)
to process JSON:

% zgrep '^{' CC-MAIN-20170629154125-20170629174125-00719.warc.wat.gz | jq -f language.jq | ...


% cat language.jq
.Envelope
| [."WARC-Header-Metadata"."WARC-Target-URI",."Payload-Metadata"."HTTP-Response-Metadata"]
| {"url": .[0],
"http-content-language": .[1]."Headers"."Content-Language",
"html-http-equiv" : [
.[1]."HTML-Metadata"."Head"."Metas"[]?
| select(."http-equiv" != null)
| select(."http-equiv" | test("(?i)lang"))
| ."content"?],
"html-language" : [
.[1]."HTML-Metadata"."Head"."Metas"[]?
| select(."name" != null)
| select(."name" | test("(?i)lang"))
| ."content"?]
}


And two results of pages tagged as French:


{
"url": "http://290364.canalblog.com/tag/fl%C3%A8ches/p30-0.html",
"http-content-language": null,
"html-http-equiv": [
"fr"
],
"html-language": []
}

{
"url": "http://apu.univ-artois.fr/Revues-et-collections/Histoire/Le-Jardin-dans-les-anciens-Pays-Bas",
"http-content-language": "fr-FR",
"html-http-equiv": [
"fr-FR"
],
"html-language": []
}



Of course, to do this over thousands of WAT files, it's better to use Hadoop, Spark, etc.

(Python, mrjob)
https://github.com/commoncrawl/cc-mrjob/blob/master/server_analysis.py
(Python, Spark)
https://github.com/commoncrawl/cc-pyspark/blob/master/server_count.py
(Java, MapReduce)

https://github.com/commoncrawl/cc-warc-examples/blob/master/src/org/commoncrawl/examples/mapreduce/WATServerType.java


Best,
Sebastian

BlackIce

unread,
Apr 21, 2018, 6:27:00 AM4/21/18
to common...@googlegroups.com
Thank you very much for your insight!

Have a great weekend!


> To post to this group, send email to common...@googlegroups.com

> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl+unsubscribe@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.

BlackIce

unread,
Apr 21, 2018, 6:33:03 AM4/21/18
to common...@googlegroups.com
I take it that replace TEST with the language code?

 test("(?i)lang
to
 fr("(?i)lang

or is it the question/exclamation mark?
 
test("(fr)lang


Or am I completely wrong?

Sebastian Nagel

unread,
Apr 21, 2018, 10:22:23 AM4/21/18
to common...@googlegroups.com
The <meta> elements are checked whether they have an attribute "http-equiv" (resp. "name")
and the attribute value contains "lang" (case-insensitive). The function test(...) takes
and regular expression as argument and "(?i)" means case-insensitive.

For a specific language you need to check the value of the attribute "content".
But this could be done also in a second step.

On 04/21/2018 12:33 PM, BlackIce wrote:
> I take it that replace TEST with the language code?
>
>  test("(?i)lang
> to
>  fr("(?i)lang
>
> or is it the question/exclamation mark?
>  
> test("(fr)lang
>
>
> Or am I completely wrong?
>
> On Sat, Apr 21, 2018 at 12:26 PM, BlackIce <black...@gmail.com <mailto:black...@gmail.com>> wrote:
>
> Thank you very much for your insight!
>
> Have a great weekend!
>
> On Fri, Apr 20, 2018 at 5:10 PM, Sebastian Nagel <seba...@commoncrawl.org
> <mailto:seba...@commoncrawl.org>> wrote:
>
> Hi,
>
> > So, could the HTTP Content-Language be filtered out? and extract the corresponding URL's?
>
> Below just one way to do this using the WAT files:
> - WATs are smaller than WARCs by 1/3
> - page metadata and links are provided as JSON => easy to query
> - every WAT contains a random sample. If you don't request for a small language
>   you'll get quickly a collection of URLs.
>
> I've used grep to extract only the JSON lines and jq (https://stedolan.github.io/jq/
> <https://stedolan.github.io/jq/>)
> "http://apu.univ-artois.fr/Revues-et-collections/Histoire/Le-Jardin-dans-les-anciens-Pays-Bas <http://apu.univ-artois.fr/Revues-et-collections/Histoire/Le-Jardin-dans-les-anciens-Pays-Bas>",
> <mailto:common-crawl%2Bunsu...@googlegroups.com>>.
> > To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>
> > <mailto:common...@googlegroups.com <mailto:common...@googlegroups.com>>.
> <https://groups.google.com/group/common-crawl>.
> > For more options, visit https://groups.google.com/d/optout
> <https://groups.google.com/d/optout>.
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl%2Bunsu...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
> <https://groups.google.com/group/common-crawl>.
> For more options, visit https://groups.google.com/d/optout <https://groups.google.com/d/optout>.
>
>
>
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
> common-crawl...@googlegroups.com <mailto:common-crawl...@googlegroups.com>.
> To post to this group, send email to common...@googlegroups.com
> <mailto:common...@googlegroups.com>.
Reply all
Reply to author
Forward
0 new messages