Documentation of tagset used in english-par-linux-3.2.bin

12 views
Skip to first unread message

jp

unread,
Mar 7, 2014, 3:55:35 PM3/7/14
to tt4j-...@googlegroups.com
Hello!

I'm successfully using tt4j with the above-mentioned model for tagging English data. I'm a bit puzzled about the tagset that is used for English, though. 
On the official TreeTagger page, it specifies that the Penn-Treebank tags are used. So far so good.
After googling, I encountered some info page from the Uni of Washington, specifying that it uses 58 tags. http://courses.washington.edu/hypertxt/csar-v02/penntable.html

After using the groovy script to inspect the model file that I use, I got 59 tags that are not the same as those specified in the above link. Can you clear this up for me? Did I miss some documentation where the tagset (and their meanings) are specified?

Thanks in advance!

Best,

Jonathan

Richard Eckart de Castilho

unread,
Mar 7, 2014, 4:36:39 PM3/7/14
to tt4j-...@googlegroups.com
Hi,

well… this kind of question is why I implemented the code that inspects the parameter file for what's actually inside.

So the difference between the current parameter file and the tags documented on the website [1] you refer to are these:

The website contains these tags not in the parameter file:

VD, VDD, VDG, VDN, VDP, VDZ

Old versions of the TreeTagger homepage (e.g. [2]) do mention this:

> The tagset used by the TreeTagger is a refinement of this tagset where the second letter of the verb part-of-speech tags distinguishes between "be" verbs (B), "have" verbs (H) and other verbs (V).

Later they state this [3]:

> The tagset used by the TreeTagger is a refinement of this tagset: The second letter of the verb part-of-speech tags is used to distinguish between forms of the verb "to be" (B), the verb "to do" (D), the verb "to have" (H), and all the other verbs (V). So, "VDD" is the POS tag for the past tense form of the verb "to do", i.e. for the word "did".

So apparently, the tagset has changed over time. The current parameter file apparently does no longer know how to tag different forms of "do". It seems to just tag them as regular verbs. E.g.

I/PP do/VVP not/RB get/VV it/PP ./SENT

The parameter file contains these tags not on the website:

# - "#" character
'' - PTB quotes [3]
( - opening braces "(", "{" (not "<" or "[" which are tagged as SYM)
) - closing braces ")", "}" (not ">" or "]" which are tagged as SYM)
, - "," character
`` - PTB quotes [3]
NS - It may be the same as "NNS". Might be a typo in the training corpus? But really, no idea. Try tagging a large corpus to see when it actually is produced ;)

I tried the braces, comma, and # just by passing some test input to tree-tagger to see how it reacts. There might be additional characters with those tags.

So the question is, why are the tags in the model different from those on that website? I don't build those models, so I do not know in detail. What I do know is, that the models and binaries on the TreeTagger homepage are updated from time to time (the filename doesn't change in that process). We can only guess what happens when they are updated. But with the help of the Internet Archive, we can see that some time in the past, the model on the TreeTagger website apparently had the same tags as [1]. Some time later, updates were done and the tagset changed.

Cheers,

-- Richard

[1] http://courses.washington.edu/hypertxt/csar-v02/penntable.html
[2] https://web.archive.org/web/20060519085937/http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html
[3] https://web.archive.org/web/20060830180243/http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html
[4] http://www.cis.upenn.edu/~treebank/tokenization.html
Reply all
Reply to author
Forward
0 new messages