Character n-grams in NLTK

5,496 views
Skip to first unread message

Correa Denzil

unread,
Jan 6, 2011, 10:34:06 AM1/6/11
to nltk-...@googlegroups.com
Hi,

Does NLTK have a provision to extract character n-grams from text? I would like to extract character n-grams (instead of traditional unigrams,bigrams) as features to aid my text classification task.

--Regards,
Denzil

JAGANADH G

unread,
Jan 6, 2011, 2:13:40 PM1/6/11
to nltk-...@googlegroups.com


On Thu, Jan 6, 2011 at 9:04 PM, Correa Denzil <mce...@gmail.com> wrote:
Hi,

Does NLTK have a provision to extract character n-grams from text? I would like to extract character n-grams (instead of traditional unigrams,bigrams) as features to aid my text classification task.



Writing a character n-gram package is straight forward and easy in Python. You can try that !!
--
**********************************
JAGANADH G
http://jaganadhg.freeflux.net/blog

Steven Bird

unread,
Jan 6, 2011, 11:54:14 PM1/6/11
to nltk-...@googlegroups.com
On 7 January 2011 06:13, JAGANADH G <jaga...@gmail.com> wrote:
>
> Writing a character n-gram package is straight forward and easy in Python.
> You can try that !!

True, though we've also provided it:

http://nltk.googlecode.com/svn/trunk/doc/api/nltk.util-module.html#ngrams

And if anyone is wondering whether NLTK supports certain
functionality, you might try consulting the documentation index here:

http://nltk.googlecode.com/svn/trunk/doc/api/identifier-index.html

-Steven

JAGANADH G

unread,
Jan 7, 2011, 12:13:55 AM1/7/11
to nltk-...@googlegroups.com
With nltk.util you can create character n-gram in this way

>>> t = "Does NLTK have a provision to extract character n-grams from text? I would like to extract character n-grams (instead of traditional unigrams,bigrams) as features to aid my text classification task."
>>> chrs = [c for c in t]
>>> from nltk.util import ngrams
>>> ngrams(chrs,3)
It will give you trigrams like

[('D', 'o', 'e'),
 ('o', 'e', 's'),
 ('e', 's', ' '),
 ('s', ' ', 'N'),
 (' ', 'N', 'L'),
 ('N', 'L', 'T'),
 ('L', 'T', 'K'),
 ('T', 'K', ' '),
 ('K', ' ', 'h'),
 (' ', 'h', 'a'),
 ('h', 'a', 'v'),
 ('a', 'v', 'e'),
 ('v', 'e', ' '),
 ('e', ' ', 'a'),
 (' ', 'a', ' '),
 ('a', ' ', 'p'),
...]
Reply all
Reply to author
Forward
0 new messages