Custom tokenization

443 views
Skip to first unread message

Onyenwe Ik

unread,
Jul 8, 2013, 7:16:36 AM7/8/13
to nltk-...@googlegroups.com
Hi everyone,

I want to tokenize my corpus with nltk. There are issues with that like breaking-up non-breaking prefixes, not breaking hyphenated words, my corpus is in utf-8 code, etc.

My  question is there a way to define your rule into nltk tokenizer? Does nltk allows such customization to suit your intended tokenization?

I want to use nltk to tokenize my corpus so that I will easily adapt to other its tools like POS, etc.

Thanks.

Jacob Perkins

unread,
Jul 8, 2013, 4:21:58 PM7/8/13
to nltk-...@googlegroups.com
Hi,

You can define your own tokenization rule using the RegexpTokenizer: http://nltk.org/api/nltk.tokenize.html#nltk.tokenize.regexp.RegexpTokenizer
There's also quite a few built-in tokenizers beyond the default word tokenizer, and you can see a demo of some of these at http://text-processing.com/demo/tokenize/

Jacob
---

Mirko Otto

unread,
Jul 8, 2013, 6:40:53 PM7/8/13
to nltk-...@googlegroups.com
Hi,

in the file attached is an example.

The "to_unicode_or_bust function" converts UTF-8 to Unicode. The
uString is an example in German.
Import the tokenizer from nltk. Then split into sentences and words
are tokenized.
The result is then written back to a file.

Enjoy,
Mirko
> --
> You received this message because you are subscribed to the Google Groups
> "nltk-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to nltk-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>
simpleTok.py
simpleTok.result

Onyenwe Ik

unread,
Jul 8, 2013, 6:50:37 PM7/8/13
to nltk-...@googlegroups.com
Thanks @Mirko and Jacob.
I will give you feedback...
Reply all
Reply to author
Forward
0 new messages