I looked at Wolf a while ago, but I never put it into use.
> If it is not the case, I can't figure out the format, entries are
> something like that
> <SYNSET><ID>ENG20-00006000-v</ID><POS>v</POS><SYNONYM></
> SYNONYM><ILR><TYPE>hypernym</TYPE>ENG20-00005679-v</ILR><DEF>cough
> spasmodically</DEF><USAGE>The patient with emphysema is hacking all
> day</USAGE><DOMAIN>biology</DOMAIN><SUMO>Breathing<TYPE>+</TYPE></
> SUMO></SYNSET>
Here is how I would interpret the markup:
<SYNSET><ID>ENG20-00001740-n</ID>
This synset corresponds to the synset with offset 1740 in the noun
hierarchy in the English WordNet with version 2.0. (Note that this is
not the version distributed with NLTK, but it is possible to directly
call a corpus reader to load different versions (if you have the data
downloaded):
from nltk.corpus.reader.wordnet import WordNetCorpusReader
wn = WordNetCorpusReader(path_to_wordnet)
<POS>n</POS>
The part of speech, again in this case a noun.
<SYNONYM><LITERAL>entite<SENSE>96/4:fr.csen,fr.rocsen,fr.roen,enwikipedia</SENSE></LITERAL></SYNONYM>
One of the members of this synonym set with its representation,
literal and an external reference.
<DEF>concept formulant la categorisation et l'identique des choses de notre envi
ronnement</DEF>
Definition.
<BCS>2</BCS>
No idea.
<DOMAIN>factotum</DOMAIN>
The domain of the concept (e.g. general category)
<SUMO>Physical<TYPE>=</TYPE></SUMO>
Where it fits in the SUMO ontology.
</SYNSET>
> My problem is I can't figure WHAT this is a synset FOR.
> I supposed that the <ID>ENG20-00006000-v</ID> would point to another
> source wordnet, maybe the princeton one but was not able to figure it
> out.
Exactly. Once you've loaded WN 2.0 (as above), you would get the
corresponding English synset using:
syn = wn._synset_from_pos_and_offset(pos, int(offset))
(e.g. 'n' and 1740 for the example above)
> Or maybe somebody has knowledge of a "better" french wordnet ;)
I suspect there is one within the EuroWordNet project, but their
licensing is usually painful.
> The thing is many resources seems that way and on the other hand, I
> have seen lots of efforts focused on "enhancing/""enriching" such or
> such wordnet, nearly each with its own format (Eurowordnet, the weird
> princeton format of the official original wordnet 3.0, some parts of
> the SUMO projects ...).
>
> Given the seemingly scares resources it seems a waste of doing like
> that and maybe it is time to try to build a common format with room
> for extensions and it seems to me that the nltk project can be
> essential to that because it is an healthy project that can mature and
> nurture such a format.
Agreed. I think one of things that has kept multilingual WordNets out
of NLTK (apart from licensing) is the diversity of formats. It would
be nice if there were a standard that would work for all, but even
within EuroWordNet, that doesn't seem to be the case.
Cheers,
Jordan
--
--------------------
Jordan Boyd-Graber
3155 AV Williams
University of Maryland
College Park, MD 20742
Voice: 920.524.9464
j...@umiacs.umd.edu
http://umiacs.umd.edu/~jbg
--------------------
"In theory, there is no difference between theory and practice. But,
in practice, there is."
- Jan L.A. van de Snepscheut
--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To post to this group, send email to nltk-...@googlegroups.com.
To unsubscribe from this group, send email to nltk-users+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nltk-users?hl=en.
It is, but there's been a movement to standardize to 2.0. It's often
used for interlingual mappings (e.g. EuroWordNet). I don't know why
that is.
> The only remaining thing for me is this : if the synset are
> referencing the Princeton wordnet, how are they doing with the French
> words that are obviously not in the Princeton one ;)
I don't know Wolf well enough, but GermaNet follows the convention of
attaching it to an existing parent concept. For example, "Beinbruch"
(broken leg) isn't lexicalized in English, so they say that it's a
hyponym of "fracture".
> I need to play with python to see what such synsets bring out and were
> the matching between French and English is represented.
I'd like to get a better sense of Wolf; so please let me know how you find it.
> And talking about this and looking at the format makes me crynge for
> unique ids ... if I got it right the Princeton format is position
> dependent !!!! crazy :)
Sigh, yes. But this seems to be a trend. Princeton WordNet does have
unique, stable ids (the sense keys), but people don't seem to use them
(they'd rather use these fickle integer offsets).
> The simple fact of giving a GUID to the words and synset and taking a
> cur from "semantic practices" and RDF (same_as links) would be a huge
> steps.
Something like this is in the works for a future version of WordNet,
I've heard rumored.
--
--------------------
Jordan Boyd-Graber
3155 AV Williams
University of Maryland
College Park, MD 20742
Voice: 920.524.9464
j...@umiacs.umd.edu
http://umiacs.umd.edu/~jbg
--------------------
"In theory, there is no difference between theory and practice. But,
in practice, there is."