Hi,
> So, could the HTTP Content-Language be filtered out? and extract the corresponding URL's?
Below just one way to do this using the WAT files:
- WATs are smaller than WARCs by 1/3
- page metadata and links are provided as JSON => easy to query
- every WAT contains a random sample. If you don't request for a small language
you'll get quickly a collection of URLs.
I've used grep to extract only the JSON lines and jq (
https://stedolan.github.io/jq/)
to process JSON:
% zgrep '^{' CC-MAIN-20170629154125-20170629174125-00719.warc.wat.gz | jq -f language.jq | ...
% cat language.jq
.Envelope
| [."WARC-Header-Metadata"."WARC-Target-URI",."Payload-Metadata"."HTTP-Response-Metadata"]
| {"url": .[0],
"http-content-language": .[1]."Headers"."Content-Language",
"html-http-equiv" : [
.[1]."HTML-Metadata"."Head"."Metas"[]?
| select(."http-equiv" != null)
| select(."http-equiv" | test("(?i)lang"))
| ."content"?],
"html-language" : [
.[1]."HTML-Metadata"."Head"."Metas"[]?
| select(."name" != null)
| select(."name" | test("(?i)lang"))
| ."content"?]
}
And two results of pages tagged as French:
{
"url": "
http://290364.canalblog.com/tag/fl%C3%A8ches/p30-0.html",
"http-content-language": null,
"html-http-equiv": [
"fr"
],
"html-language": []
}
{
"url": "
http://apu.univ-artois.fr/Revues-et-collections/Histoire/Le-Jardin-dans-les-anciens-Pays-Bas",
"http-content-language": "fr-FR",
"html-http-equiv": [
"fr-FR"
],
"html-language": []
}
Of course, to do this over thousands of WAT files, it's better to use Hadoop, Spark, etc.
(Python, mrjob)
https://github.com/commoncrawl/cc-mrjob/blob/master/server_analysis.py
(Python, Spark)
https://github.com/commoncrawl/cc-pyspark/blob/master/server_count.py
(Java, MapReduce)
https://github.com/commoncrawl/cc-warc-examples/blob/master/src/org/commoncrawl/examples/mapreduce/WATServerType.java
Best,
Sebastian