How to maintain things like $1 in tokenize()?

13 views
Skip to first unread message

peng...@gmail.com

unread,
May 8, 2016, 5:07:49 PM5/8/16
to nltk-users
Hi, I got the following results when I call main.sh. But I only want to main $1 as a token, but not "go,", "yet.", etc. Does anybody know how to maintain things like "$1" but tokenize ".", "," and "!", etc.?

~$ ./main.sh
['Eighty-seven', 'miles', 'to', 'go,', 'yet.', '$1', 'Onward!']

==> main.py <==
#!/usr/bin/env python
# vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8:

from nltk.tokenize import RegexpTokenizer
import sys

tokenizer = RegexpTokenizer(r'\S+')
print tokenizer.tokenize(''.join(sys.stdin.readlines()))


==> main.sh <==
#!/usr/bin/env bash
# vim: set noexpandtab tabstop=2:

echo 'Eighty-seven miles to go, yet. $1 Onward!' | ./main.py

Reply all
Reply to author
Forward
0 new messages