Word Tokenizer and apostrphes.

58 views

Skip to first unread message

Ben

unread,

Feb 24, 2016, 11:56:12 AM2/24/16

to nltk-users

Hi there,

silly question: how can I get the nltk word tokenizer to ignore apostrophes? It works perfectly for me except for breaking up words like '4'th' and 'don't' into '4', 'th' and 'don', 't'. Is there a simple way to modify the tokenizer so it just ignores these kinds of cases?

Cheers.

DKing

unread,

Mar 3, 2016, 4:08:00 PM3/3/16

to nltk-users

Not silly. There are a number of prior posts that discuss alternative solutions. Short answer, there is no simple way (e.g. parameter setting) to modify the operation of the default nltk tokenizer (i.e. nltk.word_tokenize module). There are a number of other tokenizer modules all with their own quirks. You may be able to use regular expressions. So, for instance, the following might work for your purposes:

import re

text = "O'Leary, that's my hat!"

tokens = re.findall(r"\w+(?:[']\w+)*|'|[-.(]+|\S\w*", text)

# yields ["O'Leary", ',', "that's", 'my', 'hat', '!']

Reply all

Reply to author

Forward

0 new messages