Issue 132 in uby: Use MorphaLemmatizer instead of StanfordLemmatizer in WNConvUtil

1 view
Skip to first unread message

u...@googlecode.com

unread,
Feb 18, 2015, 4:06:17 PM2/18/15
to uby-dev...@googlegroups.com
Status: Accepted
Owner: richard.eckart
Labels: Type-Enhancement Priority-Medium Milestone-0.7.0
Module-integration-wordnet

New issue 132 by richard.eckart: Use MorphaLemmatizer instead of
StanfordLemmatizer in WNConvUtil
https://code.google.com/p/uby/issues/detail?id=132

The StanfordLemmatizer is based on Morpha. How about switching to the
MorphaLemmatizer to reduce the dependency footprint of the uby wordnet
module? I think that would also allow switching from GPL to ASL for that
module.

--
You received this message because this project is configured to send all
issue notifications to this address.
You may adjust your notification preferences at:
https://code.google.com/hosting/settings

u...@googlecode.com

unread,
Feb 20, 2015, 5:09:11 AM2/20/15
to uby-dev...@googlegroups.com

Comment #1 on issue 132 by chmeyer.de: Use MorphaLemmatizer instead of
StanfordLemmatizer in WNConvUtil
https://code.google.com/p/uby/issues/detail?id=132

Might be a candidate for UBY hackday. I'm not convinced that we need decent
lemmatization at all. AFAIK, the question is: given an example sentence for
a synset, find the synonym's lemma whose form is used in the sentence. As
synsets typically consist of a highly limited number of synonyms, simple
prefix matching COULD already help and might even make this strange synset
mapping file obsolete.

u...@googlecode.com

unread,
Feb 27, 2015, 4:39:45 AM2/27/15
to uby-dev...@googlegroups.com

Comment #2 on issue 132 by tristan...@nothingisreal.com: Use
MorphaLemmatizer instead of StanfordLemmatizer in WNConvUtil
https://code.google.com/p/uby/issues/detail?id=132

Why use a separate lemmatizer at all? The UBY WordNet module already
depends on the extJWNL library, which contains a morphological analyzer
that can recover lemmas from an inflected form. Here's a short program
which prints the example sentence for every synonym of every synset:

Dictionary wn = Dictionary
.getInstance(new
URL("file:///path/to/WordNet31/properties_file.xml")
.openStream());
MorphologicalProcessor mp = wn.getMorphologicalProcessor();
for (POS pos : POS.values()) {
Iterator<Synset> synsetIterator = wn.getSynsetIterator(pos);
while (synsetIterator.hasNext()) {
Synset synset = synsetIterator.next();
String[] examples = synset.getGloss().split(";");
for (Word word : synset.getWords()) {
for (int i = 1; i < examples.length; i++) {
for (String exampleWord : examples[i].toLowerCase()
.replaceAll("[^a-zA-Z ]", " ").split("\\s+")) {
// Dummy lookup to work around Issue 6
mp.lookupAllBaseForms(pos,
exampleWord).contains(word.getLemma());
if (mp.lookupAllBaseForms(pos,
exampleWord).contains(word.getLemma())) {
System.out.println("Synonym " + word.getLemma()
+ " of synset " + synset.getOffset()
+ pos.getKey() + " has example" +
examples[i]);

u...@googlecode.com

unread,
Apr 3, 2015, 12:48:20 PM4/3/15
to uby-dev...@googlegroups.com
Updates:
Status: Fixed
Owner: chmeyer.de
Labels: -Module-integration-wordnet Module-integration.wordnet

Comment #3 on issue 132 by chmeyer.de: Use MorphaLemmatizer instead of
StanfordLemmatizer in WNConvUtil
https://code.google.com/p/uby/issues/detail?id=132

Although it is, unfortunately, not as easy as in Tristan's comment above,
we now have a new disambiguation approach without bloated dependencies. The
disambiguation involves several steps:

1a. Exact full word matching
1b. Exact full prefix matching
1c. Continuous prefix matching
2a. Full base form matching
2b. Partial base form matching
3. Partial prefix matching

There are still a number of examples (< 500), which cannot be
disambiguated. The vast majority of them are not actual sense examples
(e.g., http://wordnetweb.princeton.edu/perl/webwn?s=Chandi). This kind of
information is now stored as a Statement of type "usageNote". The new
method Synset.getUsageExamples() allows reconstructing the original WordNet
examples.

u...@googlecode.com

unread,
Apr 3, 2015, 3:05:43 PM4/3/15
to uby-dev...@googlegroups.com

Comment #4 on issue 132 by richard.eckart: Use MorphaLemmatizer instead of
StanfordLemmatizer in WNConvUtil
https://code.google.com/p/uby/issues/detail?id=132

So extJWNL doesn't contain a morphological analyzer?

u...@googlecode.com

unread,
Apr 7, 2015, 4:57:42 AM4/7/15
to uby-dev...@googlegroups.com

Comment #5 on issue 132 by chmeyer.de: Use MorphaLemmatizer instead of
StanfordLemmatizer in WNConvUtil
https://code.google.com/p/uby/issues/detail?id=132

extJWNL does contain a morphological analyzer, and we use it for the
disambiguation (in step 2a + b).

Actually, Tristan's code works like a charm. My comment that it is not easy
refers to the low quality of some WordNet example sentences which require
more elaborated processing/detection. This is why I came up with a slightly
more complex approach outlined above.
Reply all
Reply to author
Forward
0 new messages