Character n-grams in NLTK

Correa Denzil

unread,

Jan 6, 2011, 10:34:06 AM1/6/11

to nltk-...@googlegroups.com

Hi,

Does NLTK have a provision to extract character n-grams from text? I would like to extract character n-grams (instead of traditional unigrams,bigrams) as features to aid my text classification task.

--Regards,
Denzil

JAGANADH G

unread,

Jan 6, 2011, 2:13:40 PM1/6/11

to nltk-...@googlegroups.com

On Thu, Jan 6, 2011 at 9:04 PM, Correa Denzil <mce...@gmail.com> wrote:

Hi,

Does NLTK have a provision to extract character n-grams from text? I would like to extract character n-grams (instead of traditional unigrams,bigrams) as features to aid my text classification task.

Writing a character n-gram package is straight forward and easy in Python. You can try that !!
--
**********************************
JAGANADH G
http://jaganadhg.freeflux.net/blog

Steven Bird

unread,

Jan 6, 2011, 11:54:14 PM1/6/11

to nltk-...@googlegroups.com

On 7 January 2011 06:13, JAGANADH G <jaga...@gmail.com> wrote:
>
> Writing a character n-gram package is straight forward and easy in Python.
> You can try that !!

True, though we've also provided it:

http://nltk.googlecode.com/svn/trunk/doc/api/nltk.util-module.html#ngrams

And if anyone is wondering whether NLTK supports certain
functionality, you might try consulting the documentation index here:

http://nltk.googlecode.com/svn/trunk/doc/api/identifier-index.html

-Steven

JAGANADH G

unread,

Jan 7, 2011, 12:13:55 AM1/7/11

to nltk-...@googlegroups.com

With nltk.util you can create character n-gram in this way

>>> t = "Does NLTK have a provision to extract character n-grams from text? I would like to extract character n-grams (instead of traditional unigrams,bigrams) as features to aid my text classification task."
>>> chrs = [c for c in t]
>>> from nltk.util import ngrams
>>> ngrams(chrs,3)
It will give you trigrams like

[('D', 'o', 'e'),
('o', 'e', 's'),
('e', 's', ' '),
('s', ' ', 'N'),
(' ', 'N', 'L'),
('N', 'L', 'T'),
('L', 'T', 'K'),
('T', 'K', ' '),
('K', ' ', 'h'),
(' ', 'h', 'a'),
('h', 'a', 'v'),
('a', 'v', 'e'),
('v', 'e', ' '),
('e', ' ', 'a'),
(' ', 'a', ' '),
('a', ' ', 'p'),
...]

Reply all

Reply to author

Forward