Adding a French Wordnet to NLTK?

1,414 views
Skip to first unread message

wgw

unread,
Mar 25, 2010, 10:04:41 PM3/25/10
to nltk-users
Any tips on this problem?

Here is where I am so far.

I tracked down a free French wordnet at http://alpage.inria.fr/~sagot/wolf.html

It is a large xml file, with a few caveats (senses are sources, not
definitions). Any tips on how to interface or transform an xml wordnet
so the standard nltk functions can be used?

I'm still a bit fuzzy about how nltk hooks into the zipped (?) English
database. And of course how wolf is structured.

Here are some remarks from the site:

============
The WOLF is in the XML format used in the BalkaNet project. For now,
SENSE elements are filled with information on the sources thanks to
which the lexeme was found, and not with sense numbers.

For now, the WOLF and the Lefff are not mapped. In the following
months, Lefff entries should receive WOLF (i.e. PWN) synset ids.
============

Example of the contents of the database. First in English (search on
emphysema, not just as headword), then in French (emphysème):

<SYNSET><ID>ENG20-00006000-v</ID><POS>v</POS><SYNONYM></
SYNONYM><ILR><TYPE>hypernym</TYPE>ENG20-00005679-v</ILR><DEF>cough
spasmodically</DEF><USAGE>The patient with emphysema is hacking all
day</USAGE><DOMAIN>biology</DOMAIN><SUMO>Breathing<TYPE>+</TYPE></
SUMO></SYNSET>
<SYNSET><ID>ENG20-02603316-n</ID><POS>n</POS><SYNONYM></
SYNONYM><ILR><TYPE>hypernym</TYPE>ENG20-02802248-n</
ILR><ILR><TYPE>usage_domain</TYPE>ENG20-06425540-n</ILR><DEF>a
bronchodilator (trade names Ventolin or Proventil) used for asthma and
emphysema and other lung conditions; available in oral or inhalant
forms; side effects are tachycardia and shakiness</
DEF><DOMAIN>pharmacy</DOMAIN><SUMO>BiologicallyActiveSubstance<TYPE>+</
TYPE></SUMO></SYNSET>
<SYNSET><ID>ENG20-02856490-a</ID><POS>a</POS><SYNONYM></
SYNONYM><ILR><TYPE>derived</TYPE>ENG20-13341586-n</ILR><DEF>relating
to or resembling or being emphysema</DEF><DOMAIN>medicine</
DOMAIN><SUMO>DiseaseOrSyndrome<TYPE>+</TYPE></SUMO></SYNSET>
<SYNSET><ID>ENG20-03612030-n</ID><POS>n</POS><SYNONYM></
SYNONYM><ILR><TYPE>hypernym</TYPE>ENG20-02802248-n</
ILR><ILR><TYPE>usage_domain</TYPE>ENG20-06425540-n</ILR><DEF>a
bronchodilator (trade name Alupent) used to treat asthma and emphysema
and other lung conditions; available in oral or inhalant forms; side
effects include tachycardia and shakiness</DEF><DOMAIN>pharmacy</
DOMAIN><SUMO>BiologicallyActiveSubstance<TYPE>+</TYPE></SUMO></SYNSET>
<SYNSET><ID>ENG20-13455243-n</ID><POS>n</POS><SYNONYM></
SYNONYM><ILR><TYPE>hypernym</TYPE>ENG20-13448422-n</ILR><DEF>a chronic
emphysema of the horse that causes difficult expiration and heaving of
the flanks</DEF><DOMAIN>medicine</DOMAIN><SUMO>DiseaseOrSyndrome<TYPE>
+</TYPE></SUMO></SYNSET>
<SYNSET><ID>ENG20-13556463-n</ID><POS>n</POS><SYNONYM></
SYNONYM><ILR><TYPE>hypernym</TYPE>ENG20-13556330-n</ILR><DEF>form of
dyspnea in which the person can breathe comfortably only when standing
or sitting erect; associated with asthma and emphysema and angina
pectoris</DEF></SYNSET>

is an English entry; the following is a set of French entries:

<SYNSET><ID>ENG20-02802248-n</ID><POS>n</
POS><SYNONYM><LITERAL>bronchodilatateur<SENSE>0/1:enwikipedia</SENSE></
LITERAL></SYNONYM><ILR><TYPE>hypernym</TYPE>ENG20-03600430-n</
ILR><DEF>médicament destiné à traiter ou à prévenir la
bronchoconstriction ou bronchospasme, dans des maladies telles que
l'asthme, mais aussi l'emphysème, la pneumonie et les bronchites</
DEF><DOMAIN>pharmacy</DOMAIN><SUMO>BiologicallyActiveSubstance<TYPE>+</
TYPE></SUMO></SYNSET>
<SYNSET><ID>ENG20-13341586-n</ID><POS>n</
POS><SYNONYM><LITERAL>emphysème<SENSE>0/2:enwikipedia,frwiktionary</
SENSE></LITERAL></SYNONYM><ILR><TYPE>hypernym</TYPE>ENG20-13339337-n</
ILR><DEF>Au sens propre, l'emphysème est un terme d'anatomopathologie
désignant la destruction des voies aériennes distales</
DEF><DOMAIN>medicine</DOMAIN><SUMO>DiseaseOrSyndrome<TYPE>+</TYPE></
SUMO></SYNSET

Jordan Boyd-Graber

unread,
Mar 26, 2010, 2:43:07 PM3/26/10
to nltk-...@googlegroups.com
Unfortunately, the NLTK WordNet implementation is very much geared
toward the English distribution. New functions would have to be
written to extract the needed information from the XML. This would
likely require reworking the wordnet package in nltk. It would be
nice if the various functions working with the structure of wordnet
(similarity, relations, etc.) could be abstracted to different
languages with different distribution formats, but this is not the
case at present.

Cheers,

Jordan

> --
> You received this message because you are subscribed to the Google Groups "nltk-users" group.
> To post to this group, send email to nltk-...@googlegroups.com.
> To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.
>
>

--
--------------------
Jordan Boyd-Graber
3155 AV Williams
University of Maryland
College Park, MD 20742

Voice: 920.524.9464
j...@umiacs.umd.edu
http://umiacs.umd.edu/~jbg
--------------------

"In theory, there is no difference between theory and practice. But,
in practice, there is."
- Jan L.A. van de Snepscheut

wgw

unread,
Mar 26, 2010, 9:38:48 PM3/26/10
to nltk-users
Thanks! I will dig into the code a bit, and see whether I can adapt it
to the French data. This is in any case a very rough version of
Wordnet, so access will be idiosyncratic.

B

Pedro Marcal

unread,
Mar 27, 2010, 2:12:09 AM3/27/10
to nltk-...@googlegroups.com
In their access to the Japanese Wordnet, they have developed a front end (Python based-among other Perl,Jave etc.) based on putting the two wordnets in an sqlite3 database. It might be possible to use this as a frame and substitute other language wordnets into the bilingual frame. I have plans to work with this within the next six months, might be able to talk more intelligently then.
Regards,
Pedro

wgw

unread,
Mar 27, 2010, 2:07:58 PM3/27/10
to nltk-users
Oops! Anaphore resolution problem! Who is "their" in "In their access
to the Japanese Wordnet"?

I assume it is http://nlpwww.nict.go.jp/wn-ja/index.en.html. (With an
Interesting article on bootstrapping new wordnets from other language
wordnets.) I found nothing about how they designed their sqlite
database, though I did find their python interface: http://gist.github.com/79057

Perhaps they simply used the format listed on the main wn site:
http://wnsql.sourceforge.net/ ?

All that does give me more grist for my mill than I can probably
handle!

Thanks!

B

Pedro Marcal

unread,
Mar 28, 2010, 4:54:27 PM3/28/10
to nltk-...@googlegroups.com
Hi B,
Thanks for your email. I knew if you were interested you would Google away the Anaphore.
I will write Francis Bond of the Japanese Wordnet project and ask him what they did to load their sqlite. I have to load a Chinese Wordnet in the same way. It would be easier if they used the wnsql code that you refer to.
I agree with you "Too much Grist and not enough Mill!"
Regards,
Pedro

Reply all
Reply to author
Forward
0 new messages