Tag probabilities?

22 views
Skip to first unread message

Bernd Moos

unread,
Apr 16, 2012, 8:20:42 AM4/16/12
to tt4j-...@googlegroups.com
This is a question from a very content user of TT4J :-) I'm using the wrapper to tag transcriptions of spoken language inside a Java application. My question is: is it possible to get the tag probabilities out of TT4J? I.e. can I run TreeTagger with the option "-prob" and then somehow use the token handler (or whatever other class) to read out the tag probability assigned by tree tagger for a specific token?

Thanks for your help...

Richard Eckart de Castilho

unread,
Apr 16, 2012, 9:57:54 AM4/16/12
to tt4j-...@googlegroups.com
Hello Bernd,

> This is a question from a very content user of TT4J :-) I'm using the wrapper to tag transcriptions of spoken language inside a Java application. My question is: is it possible to get the tag probabilities out of TT4J? I.e. can I run TreeTagger with the option "-prob" and then somehow use the token handler (or whatever other class) to read out the tag probability assigned by tree tagger for a specific token?

I tried it on the command line using

./tree-tagger -quiet -no-unknown -sgml -token -lemma -prob -threshold 0.1 ../lib/english-par-linux-3.2.bin test.txt
This DT this 1.000000
is VBZ be 1.000000
a DT a 1.000000
simple NN simple 0.733482 JJ simple 0.266518
can MD can 0.968350
test VV test 0.966410
. SENT . 1.000000

You could add the "-prob", "-threshold" and "0.1" (or whatever) arguments using the setArguments() method. If you do that, there are two possibilities

a) TT4J crashes
b) TT4J will report "NN" as a POS a complex lemma like "simple 0.733482 JJ simple 0.266518" to the token handler.

I didn't test it, but it's quite likely that b) is going to happen. Then you could parse the purported lemma reported by TT4J as you need it in your TokenHandler.

I think it might be interesting too, to extend TT4J to properly support probabilities. It should not be too much effort.

Best,

-- Richard

Richard Eckart de Castilho

unread,
Apr 17, 2012, 3:49:42 PM4/17/12
to tt4j-...@googlegroups.com
>> This is a question from a very content user of TT4J :-) I'm using the wrapper to tag transcriptions of spoken language inside a Java application. My question is: is it possible to get the tag probabilities out of TT4J? I.e. can I run TreeTagger with the option "-prob" and then somehow use the token handler (or whatever other class) to read out the tag probability assigned by tree tagger for a specific token?
>
> I think it might be interesting too, to extend TT4J to properly support probabilities. It should not be too much effort.

I tried running tree-tagger with the -prob and -threshold flags, but there seems to be a problem here with TreeTagger itself. Normally, TT starts outputting results after a couple of tokens have been passed to it, thus TT4J can run TreeTagger in a streaming mode. However, this does not work with the probabilities are enabled. Not running TT in this streaming mode would terribly slow down processing and it would mean that input needed to be provided sentence by sentence.

-- Richard

Bernd Moos

unread,
Apr 18, 2012, 3:02:42 AM4/18/12
to tt4j-...@googlegroups.com
Thanks a lot. I tried passing the appropriate parameters via TT4J and, as you say, the process seems to hang or at least takes ages to process the very first tag. I'll try to find another way of achieving this then...

Richard Eckart de Castilho

unread,
Apr 18, 2012, 3:27:50 AM4/18/12
to tt4j-...@googlegroups.com
Am 18.04.2012 um 09:02 schrieb Bernd Moos:

> Thanks a lot. I tried passing the appropriate parameters via TT4J and, as you say, the process seems to hang or at least takes ages to process the very first tag. I'll try to find another way of achieving this then...

I am trying to contact Helmut Schmid to see if that is a problem that can be fixed. If so, I'd be happy to add the functionality to TT4J.

-- Richard

Richard Eckart de Castilho

unread,
Apr 19, 2012, 5:02:21 PM4/19/12
to tt4j-...@googlegroups.com
Hello Bernd,

Helmut Schmid has released a new version of TreeTagger which resolves the problem. I have added support for the probabilities to the latest SVN version of TT4J now. Can you please test and comment if that works for you? You can find an illustration of how it works here:

http://code.google.com/p/tt4j/source/browse/tt4j/trunk/org.annolab.tt4j/src/test/java/org/annolab/tt4j/TreeTaggerWrapperTest.java

As far as I know, only the TreeTagger binary for Linux has been fixed so far. Is that enough for you to test?

If you have comments or problems, please report them to http://code.google.com/p/tt4j/issues/detail?id=13

Best,

-- Richard

Am 18.04.2012 um 09:02 schrieb Bernd Moos:

Bernd Moos

unread,
Apr 20, 2012, 3:37:48 AM4/20/12
to tt4j-...@googlegroups.com
Thanks so much. I will test this and get back here. I work on Windows, so it might take a while until I have set up the environment for Linux. Is there a chance that the Windows binary will also be updated?

Richard Eckart de Castilho

unread,
Apr 20, 2012, 3:44:24 AM4/20/12
to tt4j-...@googlegroups.com
I'll run a couple more tests myself on Linux and if I do not run into problems will ask Helmut Schmid if the other binaries can be updated as well. So far I did only some small tests (just added a few more this morning), but no tests with large documents or a stream of multiple documents yet.

-- Richard

Richard Eckart de Castilho

unread,
Apr 25, 2012, 5:22:01 PM4/25/12
to tt4j-...@googlegroups.com
Hello Bernd,

> I'll run a couple more tests myself on Linux and if I do not run into problems will ask Helmut Schmid if the other binaries can be updated as well. So far I did only some small tests (just added a few more this morning), but no tests with large documents or a stream of multiple documents yet.

The binaries for OS X and Windows have now been updated as well. My tests went well on Linux and OS X so far. I did not check out Windows so far. Once that is tested, I'll do a new release of TT4J with probabilities support.

It'd be great if you could save me some time and check if it works for you on Windows ;)

I tried to implement the probabilities support without loosing binary backward-compatibility of TT4J. So instead of adding additional parameters to the token() call, there is a new interface that your TokenHandler can implement to get the probabilities. I think that should work out.

Cheers,

-- Richard

Bernd Moos

unread,
Apr 26, 2012, 8:29:43 AM4/26/12
to tt4j-...@googlegroups.com
> It'd be great if you could save me some time and check if it works for you on Windows ;)

Works like a charm so far. I tested on Windows XP and Windows 7, using the new binaries, implementing the ProbabilityHandler interface and using an almost-real-world test-case (small corpus with 22 documents and roughly 100000 tokens). I can read and use the probability tags, and there are no error messages. :-))))

Richard Eckart de Castilho

unread,
Apr 26, 2012, 10:09:09 AM4/26/12
to tt4j-...@googlegroups.com
Am 26.04.2012 um 14:29 schrieb Bernd Moos:

> > It'd be great if you could save me some time and check if it works for you on Windows ;)
>
> Works like a charm so far. I tested on Windows XP and Windows 7, using the new binaries, implementing the ProbabilityHandler interface and using an almost-real-world test-case (small corpus with 22 documents and roughly 100000 tokens). I can read and use the probability tags, and there are no error messages. :-))))

This morning I ran into an ArrayIndexOutOfBoundsException on a large corpus (400 GB compressed XML data), which I have fixed now. I'll do another release once I am able to tag that corpus completely.

Cheers,

Richard
Reply all
Reply to author
Forward
0 new messages