Simple WSD project example

67 views
Skip to first unread message

Marcel Wlotzka

unread,
Jun 11, 2014, 10:40:22 AM6/11/14
to dkpro-w...@googlegroups.com
Hi,

I want to use the DKPro WSD project to annotate word senses in quite short texts. I already have used basic features of DKPro and UIMA in some smaller projects but I am not fully aware of everything.

Is there any easier example project that does not use the  Semeval readers but a plain text?


I already have downloaded and started the DKPro WSD GPL example project and changed the sense inventory to UBY. After running I can see in console that word senses are getting annotated. But I have not yet found out where these senses are stored in. If I use a simple Writer to output all annotations I cannot see any annotations that are giving me the word senses.


My second problem was that I now wanted to create my own reader for simple text documents. For that I created a new project and imported all what I needed to use UBY and the WSD features.

My reader currently annotates every token with LexicalItemConstituent and WSDItem.

The LexicalItemConstituent gets a unique numeric ID and is declared as head.

The WSDItem is currently set as noun for every token and the SubjectOfDisambiguation is the token itself. Of cause this is not the best but I wanted to create a simple pipeline and add the lemmatization and correct part of speech tags later. Even if for most of the words these information are not perfect DKPro WSD should be able to disambiguate a few tokens already.

But: I do not get any outputs of DKPro in the WSD pipeline step. Do I miss anything?

My pipeline does these steps: Reader => BreakIteratorSegmenter => Annotator (LexicalItemConstituent and WSDItem annotations are added here) => simplifiedLesk => writer


Kind regards
Marcel Wlotzka

Tristan Miller

unread,
Jun 11, 2014, 12:30:32 PM6/11/14
to dkpro-w...@googlegroups.com
Dear Marcel,

On 11/06/14 04:40 PM, Marcel Wlotzka wrote:
> Is there any easier example project that does not use the Semeval
> readers but a plain text?

I'm afraid not -- we'd love to provide some more examples but
unfortunately haven't found the time to do so yet. Though I agree a
plain text reader would be a good thing for us to implement.

> I already have downloaded and started the DKPro WSD GPL example project
> and changed the sense inventory to UBY. After running I can see in
> console that word senses are getting annotated. But I have not yet found
> out where these senses are stored in. If I use a simple Writer to output
> all annotations I cannot see any annotations that are giving me the word
> senses.

All word sense–related annotations are stored as UIMA annotations. A
DKPro WSD reader tags all words to be annotated in the input with a
WSDItem annotation. The disambiguation annotator iterates through all
of these WSDItem annotations and, if it can successfully disambiguate
the word, creates a new WSDResult annotation. The WSDResult contains
five features of interest:

1. wsdItem - this points to the WSDItem annotation the disambiguation
result applies to.
2. senseInventory - the name of the sense inventory used (e.g.,
"WordNet", "UBY")
3. disambiguationMethod - the name of the technique used to disambiguate
the word (e.g., "Lesk", "Random")
4. senses - an array of Sense annotations corresponding to the word
senses the disambiguator chose for this word
5. comment - an optional comment string

A Sense annotation consists of Strings representing the word sense ID
and (optionally) a sense description, and a Double representing the
disambiguator's confidence that this word sense is the correct one.

The WSDWriter also loops over all the WSDItem annotations in the CAS,
and for each one it displays a human-readable version of all the
WSDResults associated with it. If it's not working for you perhaps you
could post a minimal example and the output?

> My second problem was that I now wanted to create my own reader for
> simple text documents. For that I created a new project and imported all
> what I needed to use UBY and the WSD features.
>
> My reader currently annotates every token with LexicalItemConstituent
> and WSDItem.
>
> The LexicalItemConstituent gets a unique numeric ID and is declared as head.
>
> The WSDItem is currently set as noun for every token and
> the SubjectOfDisambiguation is the token itself. Of cause this is not
> the best but I wanted to create a simple pipeline and add the
> lemmatization and correct part of speech tags later. Even if for most of
> the words these information are not perfect DKPro WSD should be able to
> disambiguate a few tokens already.
>
> But: I do not get any outputs of DKPro in the WSD pipeline step. Do I
> miss anything?
>
> My pipeline does these steps: Reader => BreakIteratorSegmenter =>
> Annotator (LexicalItemConstituent and WSDItem annotations are added
> here) => simplifiedLesk => writer

Again, seems like it should work. Can you post a minimal example? If
it's too large then you can send it to me off-list and I'll have a look.

Regards,
Tristan

--
Tristan Miller, Research Scientist
Ubiquitous Knowledge Processing Lab (UKP-TUDA)
Department of Computer Science, Technische Universität Darmstadt
Tel: +49 6151 16 6166 | Web: http://www.ukp.tu-darmstadt.de/

signature.asc

Marcel Wlotzka

unread,
Jun 12, 2014, 4:44:23 AM6/12/14
to dkpro-w...@googlegroups.com
Hi Tristan,

thank you for your reply. I have looked into my results again. As it seems the WSD annotation works well in my new project.

I was very confused because it is still not working in the GPL example for me and I started with this example to understand the framework.

In the GPL example I get a lot of sysout calls while disambiguation. But after these steps there are no WSDResult annotation in my cas.

In my own project I do not get any outputs while disambiguation but at the end I have a few WSDResult annotations which indeed have correct values. I have just not seen them because I expected that they are sticked to the position of the original token. They all have the position 0 and length 0, but since there is a link to the WSDItem this seems to be correct.


I still don't know whats the problem with my changes to the GPL example but since it is working in my own project this should not be a problem anymore.


Thank you for your help.

Kind regards
Marcel Wlotzka
Reply all
Reply to author
Forward
0 new messages