Custom tokenization

Onyenwe Ik

unread,

Jul 8, 2013, 7:16:36 AM7/8/13

to nltk-...@googlegroups.com

Hi everyone,

I want to tokenize my corpus with nltk. There are issues with that like breaking-up non-breaking prefixes, not breaking hyphenated words, my corpus is in utf-8 code, etc.

My question is there a way to define your rule into nltk tokenizer? Does nltk allows such customization to suit your intended tokenization?

I want to use nltk to tokenize my corpus so that I will easily adapt to other its tools like POS, etc.

Thanks.

Jacob Perkins

unread,

Jul 8, 2013, 4:21:58 PM7/8/13

to nltk-...@googlegroups.com

Hi,

You can define your own tokenization rule using the RegexpTokenizer: http://nltk.org/api/nltk.tokenize.html#nltk.tokenize.regexp.RegexpTokenizer

There's also quite a few built-in tokenizers beyond the default word tokenizer, and you can see a demo of some of these at http://text-processing.com/demo/tokenize/

Jacob

---

http://streamhacker.com

http://twitter.com/japerk

Mirko Otto

unread,

Jul 8, 2013, 6:40:53 PM7/8/13

to nltk-...@googlegroups.com

Hi,

in the file attached is an example.

The "to_unicode_or_bust function" converts UTF-8 to Unicode. The
uString is an example in German.
Import the tokenizer from nltk. Then split into sentences and words
are tokenized.
The result is then written back to a file.

Enjoy,
Mirko

> --
> You received this message because you are subscribed to the Google Groups
> "nltk-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to nltk-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>

simpleTok.py

simpleTok.result

Onyenwe Ik

unread,

Jul 8, 2013, 6:50:37 PM7/8/13

to nltk-...@googlegroups.com

Thanks @Mirko and Jacob.

I will give you feedback...

Reply all

Reply to author

Forward