get word base

John Hunter

unread,

Jun 28, 2002, 4:14:50 PM6/28/02

to

I would like to be able to get the root/base of a word by stripping
off plurals, gerund endings, participle endings etc... Here is a
totally naive first attempt that gets it right sometimes:

import re

rgx = re.compile( '(\w+?)(?:ing|ed|es|s)')

def get_base(word):

m = rgx.match(word)
if m:
return m.group(1)
else:
return word

words = ['hello', 'taxes', 'thoughts', 'walked', 'rakes']

for word in words:
print word, get_base(word)

Produces the following output
> python get_baseword.py
hello hello
taxes tax
thoughts thought
walked walk
rakes rak

I can think of a few things to do to refine this, but before I forge
ahead, I wanted to solicit advice.

Thanks,
John Hunter

Daniel Yoo

unread,

Jun 28, 2002, 5:02:17 PM6/28/02

to

John Hunter <jdhu...@nitace.bsd.uchicago.edu> wrote:

: I would like to be able to get the root/base of a word by stripping

: off plurals, gerund endings, participle endings etc...

Hi John,

PyWordNet has a function in it called wntools.morphy() which tries to
take the morphological root of a word. For example:

###
>>> import wntools

>>> words = ['hello', 'taxes', 'thoughts', 'walked', 'rakes']

>>> for w in words: print wntools.morphy(w)
...
hello
tax
thought
None
rake
###

morphy() doesn't quite work out of the box, but just because it
doesn't guess the part of speech --- morphy() assumes that we want the
noun root by default. But then, we can do something like this:

###
>>> def root(w):
... return wntools.morphy(w, "noun") or wntools.morphy(w, "verb")
...
>>> root("walked")
'walk'
>>> root("sent")
'send'
>>> root("woke")
'wake'
###

So PyWordNet is a really useful tool if you're doing this sort of
stuff. PyWordNet and Wordnet can be found here:

http://pywordnet.sourceforge.net/
http://www.cogsci.princeton.edu/~wn/

Best of wishes to you!

Bengt Richter

unread,

Jun 28, 2002, 5:12:58 PM6/28/02

to

Google for python stemmer ;-)

Regards,
Bengt Richter

John Hunter

unread,

Jun 29, 2002, 10:48:51 AM6/29/02

to

Thanks Daniel and Bengt; excellent suggestions both.

I have installed both packages and put them into a head to head
competition. It appears to me on a VERY LIMITED sample, that wn does
better than stemmer. The only error (ie not what I expected) in wn
was with 'walking', which stemmer got. So I merged them together for
the mother of all morphological root finders.

import Stemmer, wntools
st = Stemmer.Stemmer('english')

def wnroot(w):

return wntools.morphy(w, "noun") or \

wntools.morphy(w, "verb") or \
wntools.morphy(w, "adjective") or \
wntools.morphy(w, "adverb")

def stemmer_or_wn(word):
root = wnroot(word)
if root != word:
return root
return st.stem(word)

words = ['sent', 'walking', 'thoughts', 'rakes', 'eaten', 'tried']

print 'words : ', words
print 'stemmer : ', st.stem( words )
print 'wntools : ', [wnroot(w) for w in words]
print 'combo : ', [stemmer_or_wn(w) for w in words]

~/python/examples $ python wordnet_demo.py
words : ['sent', 'walking', 'thoughts', 'rakes', 'eaten', 'tried']
stemmer : ['sent', 'walk', 'thought', 'rake', 'eaten', 'tri']
wntools : ['send', 'walking', 'thought', 'rake', 'eat', 'try']
combo : ['send', 'walk', 'thought', 'rake', 'eat', 'try']

Thanks for the help,
John Hunter