Word Tokenizer and apostrphes.

58 views
Skip to first unread message

Ben

unread,
Feb 24, 2016, 11:56:12 AM2/24/16
to nltk-users
Hi there, 

silly question: how can I get the nltk word tokenizer to ignore apostrophes? It works perfectly for me except for breaking up words like '4'th' and 'don't' into '4', 'th' and 'don', 't'. Is there a simple way to modify the tokenizer so it just ignores these kinds of cases? 

Cheers. 

DKing

unread,
Mar 3, 2016, 4:08:00 PM3/3/16
to nltk-users
Not silly.  There are a number of prior posts that discuss alternative solutions. Short answer, there is no simple way (e.g. parameter setting) to modify the operation of the default nltk tokenizer (i.e. nltk.word_tokenize module).  There are a number of other tokenizer modules all with their own quirks.  You may be able to use regular expressions.  So, for instance, the following might work for your purposes:

import re
text = "O'Leary, that's my hat!"
tokens = re.findall(r"\w+(?:[']\w+)*|'|[-.(]+|\S\w*", text)
# yields ["O'Leary", ',', "that's", 'my', 'hat', '!']
Reply all
Reply to author
Forward
0 new messages