French Wordnet, NLTK and wordnet formats

1,621 views
Skip to first unread message

Kan

unread,
Jul 15, 2010, 11:07:31 AM7/15/10
to nltk-users
Hello

I am new to the natural langage parsing thing but I have a good IT
background.
I am wanted to build a quick and dirty demo of what the technology can
do, but I need to do it in French.

I stumbled on this post http://groups.google.com/group/nltk-users/browse_thread/thread/37ab9e36cfa4d3ce?tvc=2
but unfortunately google interface for groups is so crappy that I was
not able to respond to the thread (maybe closed ?) and ended up
sending my message maybe to the original author (apologies to him for
the spam) or maybe into the great void beyond.

Anyway I wanted to work on inserting the French wordnet into the nltk
and tried to understand the format used at http://alpage.inria.fr/~sagot/wolf.html
.

Has anybody worked with it ? figured the format ?

First the file available seems not to be valid XML, it is a long suite
of <synset> elements without any header, without a root element and
with entities improperly escaped ...
So I was wondering if anybody familiar with this data source could
tell me if the file is broken or if it is the proper (strange) pseudo-
xml.
Since the post at http://groups.google.com/group/nltk-users/browse_thread/thread/37ab9e36cfa4d3ce?tvc=2
mention a root element and it is not in my available download, I
suspect something is fishy there.

If it is not the case, I can't figure out the format, entries are
something like that
<SYNSET><ID>ENG20-00006000-v</ID><POS>v</POS><SYNONYM></
SYNONYM><ILR><TYPE>hypernym</TYPE>ENG20-00005679-v</ILR><DEF>cough
spasmodically</DEF><USAGE>The patient with emphysema is hacking all
day</USAGE><DOMAIN>biology</DOMAIN><SUMO>Breathing<TYPE>+</TYPE></
SUMO></SYNSET>

My problem is I can't figure WHAT this is a synset FOR.
I supposed that the <ID>ENG20-00006000-v</ID> would point to another
source wordnet, maybe the princeton one but was not able to figure it
out.

Does anybody know enough about wordnet format and maybe this wordnet
to point me in the right direction.

Or maybe somebody has knowledge of a "better" french wordnet ;)

I would really love to work on this and see if I can abstract some
concepts and integrate them into the ntlk but I feel a bit stuck here.

Moreover, lots of Internet resources seems dead, for instance the
balkanet project that supposedly originated the xml format used above
and so there is no way to get back to the source and try to figure it
out.
The thing is many resources seems that way and on the other hand, I
have seen lots of efforts focused on "enhancing/""enriching" such or
such wordnet, nearly each with its own format (Eurowordnet, the weird
princeton format of the official original wordnet 3.0, some parts of
the SUMO projects ...).

Given the seemingly scares resources it seems a waste of doing like
that and maybe it is time to try to build a common format with room
for extensions and it seems to me that the nltk project can be
essential to that because it is an healthy project that can mature and
nurture such a format.

Lots of topics here I know but I want to see if I can get the ball
rolling and start some work in that direction.

If the natural language parsing community want to attract the
attention of the "money guys" and infuse many other field of work as
it seems destined to do, it should be easier to build on its work to
make concrete example for people with their own material and topics of
interest. I think the nltk is a BIG step in that direction and very
close to achieving that.

Plus it is in Python ;)

Thanks for reading, and thanks for the great work on this open source
project.

Cheers

K.

The for

Jordan Boyd-Graber

unread,
Jul 15, 2010, 10:55:57 PM7/15/10
to nltk-...@googlegroups.com
> Has anybody worked with it ? figured the format ?

I looked at Wolf a while ago, but I never put it into use.

> If it is not the case, I can't figure out the format, entries are
> something like that
> <SYNSET><ID>ENG20-00006000-v</ID><POS>v</POS><SYNONYM></
> SYNONYM><ILR><TYPE>hypernym</TYPE>ENG20-00005679-v</ILR><DEF>cough
> spasmodically</DEF><USAGE>The patient with emphysema is hacking all
> day</USAGE><DOMAIN>biology</DOMAIN><SUMO>Breathing<TYPE>+</TYPE></
> SUMO></SYNSET>

Here is how I would interpret the markup:
<SYNSET><ID>ENG20-00001740-n</ID>

This synset corresponds to the synset with offset 1740 in the noun
hierarchy in the English WordNet with version 2.0. (Note that this is
not the version distributed with NLTK, but it is possible to directly
call a corpus reader to load different versions (if you have the data
downloaded):

from nltk.corpus.reader.wordnet import WordNetCorpusReader
wn = WordNetCorpusReader(path_to_wordnet)

<POS>n</POS>
The part of speech, again in this case a noun.

<SYNONYM><LITERAL>entite<SENSE>96/4:fr.csen,fr.rocsen,fr.roen,enwikipedia</SENSE></LITERAL></SYNONYM>

One of the members of this synonym set with its representation,
literal and an external reference.


<DEF>concept formulant la categorisation et l'identique des choses de notre envi
ronnement</DEF>

Definition.

<BCS>2</BCS>

No idea.

<DOMAIN>factotum</DOMAIN>

The domain of the concept (e.g. general category)

<SUMO>Physical<TYPE>=</TYPE></SUMO>

Where it fits in the SUMO ontology.

</SYNSET>


> My problem is I can't figure WHAT this is a synset FOR.
> I supposed that the <ID>ENG20-00006000-v</ID> would point to another
> source wordnet, maybe the princeton one but was not able to figure it
> out.

Exactly. Once you've loaded WN 2.0 (as above), you would get the
corresponding English synset using:

syn = wn._synset_from_pos_and_offset(pos, int(offset))

(e.g. 'n' and 1740 for the example above)

> Or maybe somebody has knowledge of a "better" french wordnet ;)

I suspect there is one within the EuroWordNet project, but their
licensing is usually painful.

> The thing is many resources seems that way and on the other hand, I
> have seen lots of efforts focused on "enhancing/""enriching" such or
> such wordnet, nearly each with its own format (Eurowordnet, the weird
> princeton format of the official original wordnet 3.0, some parts of
> the SUMO projects ...).
>
> Given the seemingly scares resources it seems a waste of doing like
> that and maybe it is time to try to build a common format with room
> for extensions and it seems to me that the nltk project can be
> essential to that because it is an healthy project that can mature and
> nurture such a format.

Agreed. I think one of things that has kept multilingual WordNets out
of NLTK (apart from licensing) is the diversity of formats. It would
be nice if there were a standard that would work for all, but even
within EuroWordNet, that doesn't seem to be the case.

Cheers,

Jordan


--
--------------------
Jordan Boyd-Graber
3155 AV Williams
University of Maryland
College Park, MD 20742

Voice: 920.524.9464
j...@umiacs.umd.edu
http://umiacs.umd.edu/~jbg
--------------------

"In theory, there is no difference between theory and practice. But,
in practice, there is."
- Jan L.A. van de Snepscheut

Kan

unread,
Jul 16, 2010, 7:10:35 AM7/16/10
to nltk-users
Thanks a LOT

I was to stupid to look at the 2.0 Princeton wordnet after checking
the indexes of the 3.0 :(.
Thought that since the wolf is more recent ..

Thanks also for the explanation of the attributes and elements, I
figured most of them from the nltk book and surrounding informations
but having an authoritative answer is so much better.

The only remaining thing for me is this : if the synset are
referencing the Princeton wordnet, how are they doing with the French
words that are obviously not in the Princeton one ;)

I need to play with python to see what such synsets bring out and were
the matching between French and English is represented.

But now I have a starting point thanks to you.

And talking about this and looking at the format makes me crynge for
unique ids ... if I got it right the Princeton format is position
dependent !!!! crazy :)
The simple fact of giving a GUID to the words and synset and taking a
cur from "semantic practices" and RDF (same_as links) would be a huge
steps.
Of course everything seems simple at first and the devil is in the
detail but still. (as you quote "In theory, there is no difference
between theory and practice. But,
in practice, there is." ) ;)

I'll see what I can come up from (and probably will come back with my
tail between my legs as we say here ;) )

Thanks again.

I am still and always open to any information, cue, tools, guidance ;)

Cheers


Pedro Marcal

unread,
Jul 16, 2010, 12:08:32 PM7/16/10
to nltk-...@googlegroups.com
Hi Kan,
I have been through two projects of using NLTK based theories to parse in Japanese and Chinese respectively. I started from a similar xml based Japanese Wordnet and the same weird looking data. It comes from a dump of some open source Hierarchical database. Instead of wrestling with this format, I used it to build an English-Japanese dictionary. (similarly an English-Chinese ). This in any case required the addition of words not in Wordnet. In order to do parsing in a target language, one needs a good tagged corpora of circa 2.5 million words. This can be found for English in the nltk corporas. Japanese has no similar corpora so I only have a translator from Japanese to English. However Chinese has a very complete open source corpora of about 2.5 million words and I was able to build a two way translator English<=>Chinese.
In order to do semantic work you need at least a worndet in the target language, so I have built an equivalent Chinese wordnet in Python Dictionary format. I would like to see all Wordnets in every language use a well-defined python dictionary format. It seems to me that that is the goal you are after too.
Regards,
Pedro



--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To post to this group, send email to nltk-...@googlegroups.com.
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.


Jordan Boyd-Graber

unread,
Jul 16, 2010, 6:21:02 PM7/16/10
to nltk-users
> I was to stupid to look at the 2.0 Princeton wordnet after checking
> the indexes of the 3.0 :(.
> Thought that since the wolf is more recent ..

It is, but there's been a movement to standardize to 2.0. It's often
used for interlingual mappings (e.g. EuroWordNet). I don't know why
that is.

> The only remaining thing for me is this : if the synset are
> referencing the Princeton wordnet, how are they doing with the French
> words that are obviously not in the Princeton one ;)

I don't know Wolf well enough, but GermaNet follows the convention of
attaching it to an existing parent concept. For example, "Beinbruch"
(broken leg) isn't lexicalized in English, so they say that it's a
hyponym of "fracture".

> I need to play with python to see what such synsets bring out and were
> the matching between French and English is represented.

I'd like to get a better sense of Wolf; so please let me know how you find it.

> And talking about this and looking at the format makes me crynge for
> unique ids ... if I got it right the Princeton format is position
> dependent !!!! crazy :)

Sigh, yes. But this seems to be a trend. Princeton WordNet does have
unique, stable ids (the sense keys), but people don't seem to use them
(they'd rather use these fickle integer offsets).

> The simple fact of giving a GUID to the words and synset and taking a
> cur from "semantic practices" and RDF (same_as links)  would be a huge
> steps.

Something like this is in the works for a future version of WordNet,
I've heard rumored.

--
--------------------
Jordan Boyd-Graber
3155 AV Williams
University of Maryland
College Park, MD 20742

"In theory, there is no difference between theory and practice. But,
in practice, there is."

Reply all
Reply to author
Forward
0 new messages