--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To post to this group, send email to nltk-...@googlegroups.com.
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.
Actually, for Twitter probabilistic approaches are the state of the
art. My colleagues at U. Amsterdam developed an algorithm for this
last year, see http://staff.science.uva.nl/~tsagias/?p=241
> Dictionary based approaches (my hypothesis) may not work well due to
> colloquial language used in Twitter. Twitter is more difficult for language
> id as it contains pithy text (assuming you are identifying language per
> tweet). Moreover, in some cases there would be lesser than 140 characters in
> the tweet which would make it difficult to detect the language. A work
> around could be using the "locale" information given by Twitter API to get a
> sense of which language the user is using. It may be a fairly good
> indicator.
Dictionary methods are really the poor man's approach. They were
promoted in the 90s solely on the grounds that they were faster than
n-gram approaches; the speed difference shouldn't matter on today's
hardware.
If you really want to do a dictionary approach, then you should
probably collect a corpus of annotated tweets because, as Denzil
notes, the actual text is sparse and often (deliberately) transformed
to avoid common words like "the", "he", "is", etc. If you have that
annotated collection, then you might as well use the Cavnar & Trenkle
n-gram-based algorithm or something smarter :)
Hi All,
We just started using NLTK to mind Twitter data and would like to filter the tweets by language. We used the n-gram methods as described here http://misja.posterous.com/language-detection-with-python-nltk and here http://code.activestate.com/recipes/326576/ to determine the message language but we still end up with quite a few messages that are not from the desired language.
I believe N-gram models are the state of the art in language detection.
I factored it out from Chromium's source code, to make it accessible via Python.
It's incredibly fast and seems to get good results on Europarl. Of
course, those documents are not nearly as short as tweets...
I described it at
http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html
Also, CLD was used to generate these fun Twitter language maps:
http://www.flickr.com/photos/walkingsf/6277163176/in/photostream
So it at least sort of seems to work on tweets!
Mike McCandless
http://blog.mikemccandless.com
I wrote a similar tutorial using scikit-learn in python / numpy / scipy:
http://scikit-learn.github.com/scikit-learn-tutorial/working_with_text_data.html
http://scikit-learn.github.com/scikit-learn-tutorial/exercises.html#exercise-1-language-identification
https://github.com/scikit-learn/scikit-learn-tutorial/blob/master/solutions/exercise_01_language_train_model.py
It uses char n-grams as features on a single wikipedia page. Use a
larger training set to get better results.
--
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
http://code.google.com/p/language-detection
It's one of the ones I had tested in my blog post (on Europarl docs)...
But, just a week ago they added support for detecting 17 languages
trained on Twitter's corpus. The author describes it here:
Mike McCandless
http://blog.mikemccandless.com
Mike McCandless
http://blog.mikemccandless.com
So, it's "python setup.py build" that's failing to find vcvarsall.bat...
Hmm, one problem is you have VS 2010 installed (right?), but I believe
Python 2.6 requires that you use VS 2008 to do the compilation
(apparently the runtime libraries change incompatibly w/ each VS
release...), or you can use cygwin's MinGW compiler. Sorry this is
such a hassle... I wish there were an easier way to compile Python
extensions on Windows.
Mike McCandless
http://blog.mikemccandless.com