[nltk-users] Nominalization of adjectives

515 views
Skip to first unread message

Abhishek Ghose

unread,
May 14, 2010, 4:51:21 PM5/14/10
to nltk-users
Hi,

Is is possible to obtain nominalizations of adjectives from Wordnet?
For example, for the adjective 'happy', I would like to get
'happiness' (or a synset containing the lemma happiness) as the
desired output.


Thanks!

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To post to this group, send email to nltk-...@googlegroups.com.
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.

Sandra Derbring

unread,
May 14, 2010, 6:32:12 PM5/14/10
to nltk-...@googlegroups.com
Hi Abhishek,

If you look at WordNet's online search function, you can see that if you search for 'happy', then press the 'S' before one of the search results, you can choose something called derivationally related form. There's the word happiness.

It seems like, to obtain this from NLTK, you can use 'happy.derivationally_related_forms()' or happy.lemmas[0].derivationally_related_forms'. See more about this here: http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html. Go to section 2, the last example there. I haven't tried it out myself, but it seems doable.

I'm guessing this could render more forms than just the nominalizations for some words, but if you want, try it out, and I'd be interested in hearing about your results with this.


Best,
Sandra

2010/5/14 Abhishek Ghose <abhishek...@gmail.com>

Maciej Pastuszka

unread,
May 15, 2010, 12:17:04 PM5/15/10
to nltk-users
Hello,

I tried the following code first:

from nltk.corpus import wordnet as wn
for i in wn.synsets('happy', wn.ADJ):
for j in i.lemmas:
for k in j.derivationally_related_forms():
print k.name

The problem is that, apart from the noun 'happiness', it also returns
other lemma names from the synsets like 'felicity'.
Introducing stemming in the procedure seems to help. Here is the
rewritten code:

from nltk.corpus import wordnet as wn
from nltk.stem.porter import *
adjective = 'happy'
adjectiveStem = PorterStemmer().stem_word(adjective)
for i in wn.synsets(adjective, wn.ADJ):
for j in i.lemmas:
for k in j.derivationally_related_forms():
if k.name.count(adjectiveStem) > 0:
print k.name

You might also want to remove the repeated occurrences of the same
nouns (the example above returns the word 'happiness' three times). I
decided to use a table for that purpose:

from nltk.corpus import wordnet as wn
from nltk.stem.porter import *
adjective = 'happy'
adjectiveStem = PorterStemmer().stem_word(adjective)
nouns = []
for i in wn.synsets(adjective, wn.ADJ):
for j in i.lemmas:
for k in j.derivationally_related_forms():
if (k.name.count(adjectiveStem) > 0) and (k.name not in
nouns):
nouns.append(k.name)
print k.name

Maciej.

Maciej Pastuszka

unread,
May 15, 2010, 1:17:23 PM5/15/10
to nltk-users
Be careful with the stemmer, though. If you try the adjective 'funny',
the resulting nouns should be 'fun' and 'funniness'. The problem is
that the stemmer reduces 'funny' into 'funni' eliminating the noun
'fun' as not containing the full adjectival stem.

I hope you find a solution that will suit your purposes, Abhishek.

Abhishek Ghose

unread,
May 15, 2010, 4:17:03 PM5/15/10
to nltk-users
Hi,


Sandra,Maciej - Thanks for all the inputs!
derivationally_related_forms() is definitely what I was looking for.

I did note the thing about multiple lemmas.

What I have eventually resorted to doing is:
(1)Use morphy(my_adj,'a') to "clean" my adjective first (I will call
the 'cleaned' adjective also 'my_adj' )
(2)For all adjective synsets returned with synsets(my_adj,'a'),
collect the lemmas with their full names e.g. so 'pure' becomes
'pure.a.01.pure'
(3)Use derivationally_related_forms() on these lemmas, and retain the
ones that are nouns. So if derv_lemma is a lemma returned by
derivationally_related_forms(), I check whether derv_lemma.synset.pos
== 'n'

As Maciej pointed out, there are often multiple noun lemmas that you
end up with, and in quite a few case these noun lemmas come from the
different noun synsets too. For ex:

Adjective: happy

Adjective synsets:
happy.a.01
felicitous.s.02
glad.s.02
happy.s.04

Lemma(s) found: 1
happiness (Lemma('happiness.n.01.happiness'))

Sense(s) found: 2
happiness.n.01 : state of well-being characterized by emotions ranging
from contentment to intense joy
happiness.n.02 : emotions experienced when in a state of well-being



In the above example the reason there are 2 synsets but only one lemma
is I do a set() on the list of lemmas to enumerate only distinct
lemmas.
And quite surprisingly, it turns out that:

>>> set([wn.lemma('happiness.n.01.happiness'), wn.lemma('happiness.n.01.happiness')])
set([Lemma('happiness.n.01.happiness')])







Then there are adjectives (these are tagged in my corpus), that have
no nominalizations. Ex. 'five', 'most','sparkling'
Some missing nominalizations are surprising. One example my dataset
turned up was 'intimate', which didnt have a derivationally related
form that could be linked to 'intimacy' (or any noun)


Step (2) ran into a slight problem too as there seems to be some
inconsistency wrt case (or maybe its my understanding) in how lemmas
are named.
While the full name of a lemma 'x' is supposed to be the
full_synset_name.x (http://nltk.googlecode.com/svn/trunk/doc/api/
nltk.corpus.reader.wordnet.Lemma-class.html), I found this following
exception:
synset('union.a.01') in synsets('federal','a') ---> True
synset('union.a.01').lemmas : [Lemma('union.s.01.Union'),
Lemma('union.s.01.Federal')]

In case of multiple choices, I choose the first one. Seems to work out
well - here's a list of 10 adjectives and corresponding nouns:

commercial: commerce
new: newness
indistinguishable: indistinguishability
genuine: genuineness
absolute: absoluteness
necessary: necessity
young: young
distinctive: distinctiveness
important: importance
technical: technicality

Thanks, again!

Abhishek Ghose

unread,
May 15, 2010, 9:23:34 PM5/15/10
to nltk-users
Spotted a typo:

>>> set([wn.lemma('happiness.n.01.happiness'), wn.lemma('happiness.n.01.happiness')])

should be:

>>> set([wn.lemma('happiness.n.01.happiness'), wn.lemma('happiness.n.02.happiness')])

Drush D'Costa

unread,
May 15, 2010, 11:56:45 PM5/15/10
to nltk-users
I think that it is also possible to do this without using wordnet .
Lets say you are trying to get happiness from happy .

If we do one thing :

lets say given word is w

1) stem w ( porter or lancaster or whatever ) .
2) search english wordlist for words which start with the stem of w .
3) out of words which come out in step 2 , check to see which words
are verbs/adverbs/nouns ( what you want ) .
4) make a dictionary entry with dict['happy'] = list of step 3 . Save
this dictionary list
in some file or database for your future use .

Pros and cons.
Wordnet is great , but always when you loop over synsets of a given
word ,
you can get very different words . And moreover wordnet is
semantically oriented ( isn't it ? )
and your problem doesn't require the program to work with similar
words , it just wants syntactically
related words .
The method i have applied above , will give you slow results as
compared to wordnet ,
but as your dictionary grows , you can always check if something is in
your dictionary
or not first before trying the technique , taht wil be much faster
than wordnet .

Though i think that using wordnet with a Naive Bayees Classifier ( to
filter out all words you dont want in lemmas of the synset ) will give
somewhat improved results .

Regards

On May 16, 6:23 am, Abhishek Ghose <abhishek.ghose...@gmail.com>
wrote:

Abhishek Ghose

unread,
May 16, 2010, 6:19:25 AM5/16/10
to nltk-users
Semantic relatedness is a good thing here.... otherwise it would be
difficult to choose between connecting 'distinct' to 'distinction' or
'distinctness' (the latter is correct
).
Reply all
Reply to author
Forward
0 new messages