[nltk-users] Language detection with nltk?

4,311 views
Skip to first unread message

Ashwathy Narayan

unread,
May 17, 2010, 5:16:47 PM5/17/10
to nltk-users
Hi all,
This could be a really dumb question, but is there a way of filtering
words based on language?
i.e., let's say I have a sentence (or a set of words) with words from
multiple languages, like "the quick brown fox jumped over the lazy
sleeping dog le rapide goupil brun sauta par dessus le chien paresseux
sommeil el zorro marrón rápido saltó sobre el perro que duerme
perezoso" - it's just the first sentence in English followed by the
corresponding translations in French and Spanish (courtesy of Google
Translate).
Is there a way to pull out just the English words? I tried loading the
Brown corpus and getting out only the words which appear in it but I
wanted to know if there's something more elegant out there.

Thanks in advance.

A

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To post to this group, send email to nltk-...@googlegroups.com.
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.

Richard Careaga

unread,
May 17, 2010, 6:10:14 PM5/17/10
to nltk-...@googlegroups.com
english_vocab = set(w.lower() for w in nltk.corpus.words.words())
text_vocab = set(w.lower() for w in text if w.lower().isalpha())
unusual = text_vocab.difference(english_vocab)

JAGANADH G

unread,
May 18, 2010, 1:19:45 AM5/18/10
to nltk-...@googlegroups.com


On Tue, May 18, 2010 at 3:40 AM, Richard Careaga <leu...@gmail.com> wrote:
english_vocab = set(w.lower() for w in nltk.corpus.words.words())
text_vocab = set(w.lower() for w in text if w.lower().isalpha())
unusual = text_vocab.difference(english_vocab)



Aswathy
See the discussion

http://groups.google.co.in/group/nltk-users/browse_thread/thread/8e60c05cb9a7d1f8?hl=en

--
**********************************
JAGANADH G
ജഗനാഥ്.ജി
http://jaganadhg.freeflux.net/blog

Ashwathy Narayan

unread,
May 18, 2010, 3:27:22 AM5/18/10
to nltk-...@googlegroups.com
Thanks for the quick response Richard. It worked like a charm.
 
On an unrelated note (should I have started a new thread for this?), is there a corpus for brandnames? I'm trying to find out how many times different brandnames have been mentioned in a corpus of articles pertaining to consumer appliances, so words like Samsung and Sony etc obviously would not be present in the nltk words corpus. I could always go by first letter capitalization but I'm scraping a bunch of blogs for opinion mining and most commentors really don't bother with capitals.

Ashwathy Narayan

unread,
May 18, 2010, 3:33:09 AM5/18/10
to nltk-...@googlegroups.com
Thanks for the link Jaganadh; I'll go through it. It definitely looks like something I should check out.

JAGANADH G

unread,
May 18, 2010, 4:15:14 AM5/18/10
to nltk-...@googlegroups.com
On Tue, May 18, 2010 at 12:57 PM, Ashwathy Narayan <alilda...@gmail.com> wrote:
Thanks for the quick response Richard. It worked like a charm.
 
On an unrelated note (should I have started a new thread for this?), is there a corpus for brandnames?
I think you may have to create such one.
 
I'm trying to find out how many times different brandnames have been mentioned in a corpus of articles pertaining to consumer appliances, so words like Samsung and Sony etc obviously would not be present in the nltk words corpus. I could always go by first letter capitalization but I'm scraping a bunch of blogs for opinion mining and most commentors really don't bother with capitals.

Normally in blogs and other versions of socialmedia we cant expect the normal orthographic patterns like proper capitalization etc.. 

--
**********************************
JAGANADH G
http://jaganadhg.freeflux.net/blog
Reply all
Reply to author
Forward
0 new messages