import re
rgx = re.compile( '(\w+?)(?:ing|ed|es|s)')
def get_base(word):
m = rgx.match(word)
if m:
return m.group(1)
else:
return word
words = ['hello', 'taxes', 'thoughts', 'walked', 'rakes']
for word in words:
print word, get_base(word)
Produces the following output
> python get_baseword.py
hello hello
taxes tax
thoughts thought
walked walk
rakes rak
I can think of a few things to do to refine this, but before I forge
ahead, I wanted to solicit advice.
Thanks,
John Hunter
: I would like to be able to get the root/base of a word by stripping
: off plurals, gerund endings, participle endings etc...
Hi John,
PyWordNet has a function in it called wntools.morphy() which tries to
take the morphological root of a word. For example:
###
>>> import wntools
>>> words = ['hello', 'taxes', 'thoughts', 'walked', 'rakes']
>>> for w in words: print wntools.morphy(w)
...
hello
tax
thought
None
rake
###
morphy() doesn't quite work out of the box, but just because it
doesn't guess the part of speech --- morphy() assumes that we want the
noun root by default. But then, we can do something like this:
###
>>> def root(w):
... return wntools.morphy(w, "noun") or wntools.morphy(w, "verb")
...
>>> root("walked")
'walk'
>>> root("sent")
'send'
>>> root("woke")
'wake'
###
So PyWordNet is a really useful tool if you're doing this sort of
stuff. PyWordNet and Wordnet can be found here:
http://pywordnet.sourceforge.net/
http://www.cogsci.princeton.edu/~wn/
Best of wishes to you!
Google for python stemmer ;-)
Regards,
Bengt Richter
I have installed both packages and put them into a head to head
competition. It appears to me on a VERY LIMITED sample, that wn does
better than stemmer. The only error (ie not what I expected) in wn
was with 'walking', which stemmer got. So I merged them together for
the mother of all morphological root finders.
import Stemmer, wntools
st = Stemmer.Stemmer('english')
def wnroot(w):
return wntools.morphy(w, "noun") or \
wntools.morphy(w, "verb") or \
wntools.morphy(w, "adjective") or \
wntools.morphy(w, "adverb")
def stemmer_or_wn(word):
root = wnroot(word)
if root != word:
return root
return st.stem(word)
words = ['sent', 'walking', 'thoughts', 'rakes', 'eaten', 'tried']
print 'words : ', words
print 'stemmer : ', st.stem( words )
print 'wntools : ', [wnroot(w) for w in words]
print 'combo : ', [stemmer_or_wn(w) for w in words]
~/python/examples $ python wordnet_demo.py
words : ['sent', 'walking', 'thoughts', 'rakes', 'eaten', 'tried']
stemmer : ['sent', 'walk', 'thought', 'rake', 'eaten', 'tri']
wntools : ['send', 'walking', 'thought', 'rake', 'eat', 'try']
combo : ['send', 'walk', 'thought', 'rake', 'eat', 'try']
Thanks for the help,
John Hunter