pages with specific language

380 views
Skip to first unread message

mucahid kutlu

unread,
Oct 20, 2015, 9:54:08 AM10/20/15
to Common Crawl


Hi,

We need English and Arabic web pages for our research project. Is it possible to get only Arabic (or English) pages? Or do we have to download all data and run a language detection tool on it?

Thanks,
Mucahid Kutlu

 

Ken Krugler

unread,
Oct 20, 2015, 10:08:47 AM10/20/15
to common...@googlegroups.com
Hi Mucahid,

There's now way to know in advance that a page definitely uses a specific language.

You can set the Accept-Language field in the request header so that multi-lingual sites automatically return the content you want. E.g. "Accept-Language: ar;q=0.8, en;q=0.2" See http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html

And more importantly, you can optimize your crawl by helping it focus on sites (and sub-domains) that are more likely to have pages in your target language.

Typically you'd do this by keeping language stats on domains (and sub-domains), and using that to weight the expected value (in terms of probability of getting a page in your target language) of unfetched pages. Then you don't bother fetching pages that have a really low probability of being ones that you want.

More complex is to do the same as above, but on a per-link basis. For each page, after you've fetched it you determine the language, and assign an expected language (based on this) to the outbound links. For each unfetched link, you sum these weights, and then do the same as above. The premise being that Arabic pages are more likely to have links to other Arabic pages, and so on.

HTH,

-- Ken




From: mucahid kutlu

Sent: October 20, 2015 6:54:08am PDT

To: Common Crawl

Subject: pages with specific language





--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Tom Morris

unread,
Oct 20, 2015, 10:31:12 AM10/20/15
to common...@googlegroups.com
On Tue, Oct 20, 2015 at 9:54 AM, mucahid kutlu <mucahi...@gmail.com> wrote:

We need English and Arabic web pages for our research project. Is it possible to get only Arabic (or English) pages? Or do we have to download all data and run a language detection tool on it?

Ken's suggestions will help if you're planning to do your own crawl, but if you want to use the CommonCrawl data (my assumption) you'll need to do your own language detection (or use someone else's) because the CC data isn't tagged by language.

If you're open to using someone else's tagging, there is a bunch of Arabic and English data derived from the CommonCrawl data here: http://data.statmt.org/ngrams/raw/
You can read their paper to see if it was processed in a way which is useful for your purposes: https://kheafield.com/professional/stanford/crawl_paper.pdf

Tom

Julien Nioche

unread,
Oct 20, 2015, 10:34:19 AM10/20/15
to common...@googlegroups.com
Hi Mucahid,

You could check if the WAT files contain a language metatag then retrieve the corresponding WARC entries but (a) that will work only for HTML pages, not other formats (b) it requires processing both WARC and WET on the whole dataset anyway.

Better to do language detection on the content of the WARC files. You can easily do that with Behemoth (https://github.com/DigitalPebble/behemoth), see for instance [http://digitalpebble.blogspot.co.uk/2012/09/using-behemoth-on-commoncrawl-dataset.html]

HTH

Julien


--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.



--

mucahid kutlu

unread,
Oct 30, 2015, 7:24:12 AM10/30/15
to Common Crawl


Thank you all for your valuable answers! I appreciate that!
Reply all
Reply to author
Forward
0 new messages