Only spanish

64 views
Skip to first unread message

Alejandro Moleiro

unread,
Apr 12, 2014, 12:07:28 PM4/12/14
to common...@googlegroups.com
Hello,

I wonder if someone can point me out a method to identify the language of the sites crawled.
The one I planned is just to compare  the words of the crawled site with the "stop-words" (Spanish example: http://snowball.tartarus.org/algorithms/spanish/stop.txt)
and then catch the language. I do it with Java.

As you can see I am absolutely newbie but I do love your project and I would lo like get involved asap. Crawling is funny and I used to do it in my spare time. Now I decided to go deeper.


Thank you from Barcelona, Spain.


Alex Moleiro



Mat Kelcey

unread,
Apr 12, 2014, 12:21:06 PM4/12/14
to common...@googlegroups.com
I used apache tika with some success, but it was years ago so might not be the best java has to offer any more...

http://tika.apache.org/1.5/detection.html#Language_Detection

--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl+unsubscribe@googlegroups.com.
To post to this group, send email to common-crawl@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.


Alejandro Moleiro

unread,
Apr 12, 2014, 12:46:45 PM4/12/14
to common...@googlegroups.com
Hello again:

Thank you Mat, as soon as I can I'll test it.

A friend of mine  has also recommended me this https://code.google.com/p/chromium-compact-language-detector/ that it is supposed to be the one who Chrome use. Can someone verify this information?

Googling I've founded this article that is quite interesting to follow my investigation
It is just a comparison test between three methods of language detection. Just copying the quick conclusion of it
"The language-detection library gets the best accuracy, at 99.22%, followed by CLD, at 98.82%, followed by Tika at 97.12%. "

 Thank you all again,

Alex




To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.

Mat Kelcey

unread,
Apr 12, 2014, 6:35:49 PM4/12/14
to common...@googlegroups.com
given how close those numbers are i'd pick whatever is easiest to implement and try it. i don't know what your overall task is but i'm sure you'll find bigger fish to fry soon enough! ( you can always evaluate later using a different detector once you have one working )

keep in mind too that these numbers are, i'm guessing, related to the task of detecting the correct language and you're talking about, arguably, the simpler one-vs-all problem of is it spanish? i.e. these numbers aren't going to be totally representative of your specific problem.

Greg Lindahl

unread,
Apr 12, 2014, 6:48:21 PM4/12/14
to common...@googlegroups.com
On Sat, Apr 12, 2014 at 06:46:45PM +0200, Alejandro Moleiro wrote:

> A friend of mine has also recommended me this
> https://code.google.com/p/chromium-compact-language-detector/ that it is
> supposed to be the one who Chrome use.

That looks good to me.

If you ask a linguist they'll tell you that this is a solved problem!
But the algorithms they recommend don't work on the web, where many
documents have multiple languages in them. I had not seen this Google
library before, but it looks similar (and probably much better
debugged) to what we use internally at blekko.

-- greg


David Parks

unread,
Apr 13, 2014, 12:43:32 PM4/13/14
to common...@googlegroups.com
Have you considered using the HTTP header Content-Language?  I can't say I have any experience as to whether it's used constantly enough be a good benchmark for you, but I noted that it's a valid HTTP header, and it sure would sure cut down on your processing.


Though I imagine Tikka's or similar algorithms will provide better total results. Perhaps it would work to only use the language detection algorithm in cases when the Content-Language header isn't present.

Tom Morris

unread,
Apr 13, 2014, 1:04:22 PM4/13/14
to common...@googlegroups.com
On Sat, Apr 12, 2014 at 12:46 PM, Alejandro Moleiro <amol...@caucana.com> wrote:

Googling I've founded this article that is quite interesting to follow my investigation
It is just a comparison test between three methods of language detection. Just copying the quick conclusion of it
"The language-detection library gets the best accuracy, at 99.22%, followed by CLD, at 98.82%, followed by Tika at 97.12%. "

The "language-detection" library mentioned is this one:

There are more interesting things to be gleaned from the McCandless blog post than just the raw numbers.  Some of the things that caught my eye when I read it last year include:

- The corpus was built by the authors of the language-detection project, which may give it an advantage in the measurements.
- The corpus is plain text, not web documents
- Performance ranges over two orders of magnitude with CLD being fastest and Tika being 250X slower.  This may have implications for using them at scale on the crawl
- The various detectors screw up in diverse ways.  This may provide the opportunity for building a voting uber-detector from the three (if accuracy is more important than performance)

For your project, it sounds like the overall accuracy numbers are less important than the accuracy on Spanish and there Tika struggles, achieving less than 90% accuracy.  The combination of that and it's lower performance probably makes it a poor candidate for your project.

The other thing to note about the McCandless study is that it was done with the original Chromium Language Detector, which has since been revised.  The new CLD2 project is here: 
and McCandless wrote about CLD2 here:

Some nice things about CLD2 include:
 - it can handle raw HTML/XHTML
 - you can give it hints from the HTTP header to bias the predictor
 - it was trained on a web corpus, specifically filtered to remove low information words like "click", "link" and character sequences like ".jpg"
 - it's fast and small

Folks who are interested in the distribution of languages on the web should read:
which points to this Google Doc:
It's got lots of interesting data, including an analysis of changes over time.

Anyone know of any similar analysis for the Common Crawl corpus?

Tom




 

Ken Krugler

unread,
Apr 13, 2014, 4:32:17 PM4/13/14
to common...@googlegroups.com
The fundamental problem with using anything reported by web servers is that they all lie.

Well, almost all of them lie about something, some of the time. And some of them lie all of the time, about almost everything.

Plus there aren't that many servers which actually report the content language.

So unfortunately you're going to have to do the detection yourself.

As part of Common crawl's processing, it would be handy to have the pages tagged with language, charset, etc so this processing only has to be done once for the common use case.

-- Ken
--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Julien Nioche

unread,
Apr 14, 2014, 4:58:51 AM4/14/14
to common...@googlegroups.com
Hi,

One quick way of doing would be to use Behemoth (https://github.com/DigitalPebble/behemoth) which can ingest data from the CC corpus (https://github.com/DigitalPebble/behemoth-commoncrawl) and then a combination of the Tika wrapper to extract the text then the language-id module which uses the com.cybozu.labs.langdetect API. The language detection in Tika is slow and inaccurate and the API above is a better choice. You can filter by language metadata to keep only the documents in Spanish.


HTH

Julien 



--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages