Language detection with NLTK

Toddy Mladenov

unread,

Dec 2, 2011, 12:56:53 PM12/2/11

to nltk-...@googlegroups.com

Hi All,

We just started using NLTK to mind Twitter data and would like to filter the tweets by language. We used the n-gram methods as described here http://misja.posterous.com/language-detection-with-python-nltk and here http://code.activestate.com/recipes/326576/ to determine the message language but we still end up with quite a few messages that are not from the desired language.

We would prefer to use dictionary method instead the n-gram one and I was wondering whether somebody has a good example how to do that.

Appreciate your help.

Thanks!
--
ToddySM
Blog http://blog.toddysm.com
Twitter @toddysm
LinkedIn http://www.linkedin.com/in/toddysm

Correa Denzil

unread,

Dec 2, 2011, 1:17:04 PM12/2/11

to nltk-...@googlegroups.com

I believe N-gram models are the state of the art in language detection. Dictionary based approaches (my hypothesis) may not work well due to colloquial language used in Twitter. Twitter is more difficult for language id as it contains pithy text (assuming you are identifying language per tweet). Moreover, in some cases there would be lesser than 140 characters in the tweet which would make it difficult to detect the language. A work around could be using the "locale" information given by Twitter API to get a sense of which language the user is using. It may be a fairly good indicator.

If you need a language model corpus, check out Leipzig's corpus. TextCat and LexTex few also use n-gram based approaches which support a wide variety of languages. Hit the links to check out online demos.

--Regards,
Denzil

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To post to this group, send email to nltk-...@googlegroups.com.
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.

Lars Buitinck

unread,

Dec 2, 2011, 1:33:59 PM12/2/11

to nltk-...@googlegroups.com

2011/12/2 Correa Denzil <mce...@gmail.com>:

> I believe N-gram models are the state of the art in language detection.

Actually, for Twitter probabilistic approaches are the state of the
art. My colleagues at U. Amsterdam developed an algorithm for this
last year, see http://staff.science.uva.nl/~tsagias/?p=241

> Dictionary based approaches (my hypothesis) may not work well due to
> colloquial language used in Twitter. Twitter is more difficult for language
> id as it contains pithy text (assuming you are identifying language per
> tweet). Moreover, in some cases there would be lesser than 140 characters in
> the tweet which would make it difficult to detect the language. A work
> around could be using the "locale" information given by Twitter API to get a
> sense of which language the user is using. It may be a fairly good
> indicator.

Dictionary methods are really the poor man's approach. They were
promoted in the 90s solely on the grounds that they were faster than
n-gram approaches; the speed difference shouldn't matter on today's
hardware.

If you really want to do a dictionary approach, then you should
probably collect a corpus of annotated tweets because, as Denzil
notes, the actual text is sparse and often (deliberately) transformed
to avoid common words like "the", "he", "is", etc. If you have that
annotated collection, then you might as well use the Cavnar & Trenkle
n-gram-based algorithm or something smarter :)

Alexis Dimitriadis

unread,

Dec 3, 2011, 9:06:09 AM12/3/11

to nltk-...@googlegroups.com

On 02/12/2011 18:56, Toddy Mladenov wrote:

Hi All,

We just started using NLTK to mind Twitter data and would like to filter the tweets by language. We used the n-gram methods as described here http://misja.posterous.com/language-detection-with-python-nltk and here http://code.activestate.com/recipes/326576/ to determine the message language but we still end up with quite a few messages that are not from the desired language.

On 02/12/2011 19:17, Correa Denzil wrote:

I believe N-gram models are the state of the art in language detection.

I also tried the code from posterous and got poor results. I think the problem is not the use of ngrams, but the very naive similarity metric (vector distance?) that the code calculates... plus some errors in the code. I'd suggest looking into the literature for a more appropriate ngram-based statistic. (I can't help you there, since the noisy results were sufficient for my purposes).

The trigram files indicate word boundaries with angle brackets: "one" -> "<on", "one", "ne>". But the posterous code doesn't support this, so it loses a lot of information.

Best,

Alexis

Michael McCandless

unread,

Dec 3, 2011, 9:17:58 AM12/3/11

to nltk-...@googlegroups.com

You could also try Google's compact language detector (CLD)?

I factored it out from Chromium's source code, to make it accessible via Python.

It's incredibly fast and seems to get good results on Europarl. Of
course, those documents are not nearly as short as tweets...

I described it at
http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html

Also, CLD was used to generate these fun Twitter language maps:

http://www.flickr.com/photos/walkingsf/6277163176/in/photostream

So it at least sort of seems to work on tweets!

Mike McCandless

http://blog.mikemccandless.com

Toddy Mladenov

unread,

Dec 5, 2011, 12:13:03 PM12/5/11

to nltk-...@googlegroups.com

Thank you everyone for the good suggestions. I've looked at some of the options over the weekend and they certainly look much more promising than the blob posts.

@Denzil: The TextCat and LexTex demos seem to be broken. I tried to send the error to the emails posted but I received bounce back - do you know somebody who can fix those?

@Mike: The language map is pretty cool (although the big image is not very readable). Wondering how much the language detection was influenced by the location information (as Denzil proposes above)?

Toddy

Correa Denzil

unread,

Dec 5, 2011, 12:19:35 PM12/5/11

to nltk-...@googlegroups.com

My bad! If you are ok with using Java, have a look at LingPipe Language Identification tutorial. It uses the Leipzig corpus as I told earlier.

Did you try the probabilistic approach suggested by Lars?

--Regards,
Denzil

Olivier Grisel

unread,

Dec 5, 2011, 12:27:37 PM12/5/11

to nltk-...@googlegroups.com

2011/12/5 Correa Denzil <mce...@gmail.com>:

> My bad! If you are ok with using Java, have a look at LingPipe Language
> Identification tutorial. It uses the Leipzig corpus as I told earlier.
>
> Did you try the probabilistic approach suggested by Lars?

I wrote a similar tutorial using scikit-learn in python / numpy / scipy:

http://scikit-learn.github.com/scikit-learn-tutorial/working_with_text_data.html
http://scikit-learn.github.com/scikit-learn-tutorial/exercises.html#exercise-1-language-identification
https://github.com/scikit-learn/scikit-learn-tutorial/blob/master/solutions/exercise_01_language_train_model.py

It uses char n-grams as features on a single wikipedia page. Use a
larger training set to get better results.

--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

Toddy Mladenov

unread,

Dec 5, 2011, 12:51:24 PM12/5/11

to nltk-...@googlegroups.com

Denzil,

Java is perfect:) I've used it for years while I am pretty new to Python. Will definitely look at the tutorial.

I didn't try Lars' approach yet - will go over all the options in details this week and will post back my discoveries.

Thanks!
Toddy

Michael McCandless

unread,

Dec 5, 2011, 12:54:58 PM12/5/11

to nltk-...@googlegroups.com

There's also this java library:

http://code.google.com/p/language-detection

It's one of the ones I had tested in my blog post (on Europarl docs)...

But, just a week ago they added support for detecting 17 languages
trained on Twitter's corpus. The author describes it here:

http://shuyo.wordpress.com/2011/11/28/language-detection-supported-17-language-profiles-for-short-messages/

Mike McCandless

http://blog.mikemccandless.com

Toddy Mladenov

unread,

Dec 9, 2011, 4:57:11 PM12/9/11

to nltk-...@googlegroups.com

Mike,

When I try to build the python bindings for your CLD library I get the following:

running build
running build_ext
building 'cld' extension
error: Unable to find vcvarsall.bat

Do you have any instructions how to build those on Windows?

Thanks!
Toddy

Michael McCandless

unread,

Dec 9, 2011, 5:34:16 PM12/9/11

to nltk-...@googlegroups.com

You should just edit the first line of build.win.cmd to the full path
of where vcvarsall.bat is for your Visual Studio installation?

Mike McCandless

http://blog.mikemccandless.com

Toddy Mladenov

unread,

Dec 9, 2011, 8:30:32 PM12/9/11

to nltk-...@googlegroups.com

I may be doing something wrong but I am getting the same result after changing the path on build.win.cmd.

Here is the path to the vcvarsall.bat that I am using:
C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\vcvarsall.bat

Can this be causing some errors?

I have rebuild the C++ library after changing the path.

I am attaching the output if you want to take a look.

Thanks!
Toddy

cmd-output.txt

Michael McCandless

unread,

Dec 10, 2011, 10:40:44 AM12/10/11

to nltk-...@googlegroups.com

Oh, I see: you succeeded in compiling the C code, and running the
(C-only) example test.

So, it's "python setup.py build" that's failing to find vcvarsall.bat...

Hmm, one problem is you have VS 2010 installed (right?), but I believe
Python 2.6 requires that you use VS 2008 to do the compilation
(apparently the runtime libraries change incompatibly w/ each VS
release...), or you can use cygwin's MinGW compiler. Sorry this is
such a hassle... I wish there were an easier way to compile Python
extensions on Windows.