"Universal tagset" in NLTK 3?

3,811 views
Skip to first unread message

Rodger Kibble

unread,
Sep 17, 2014, 8:57:45 AM9/17/14
to nltk-...@googlegroups.com
Where is the "universal tagset" documented?  The link http://www.nltk.org/book/ch05.html#tab-universal-tagset doesn't seem to lead anywhere useful.  And how does it differ from the "simplified tagset" in NLTK 2?

thanks

Rodger Kibble
Goldsmiths UoL

Alex Rudnick

unread,
Sep 17, 2014, 5:17:00 PM9/17/14
to nltk-...@googlegroups.com
Hey Rodger,

The universal tagset is this one:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/274_Paper.pdf
https://code.google.com/p/universal-pos-tags/

It's a very small coarse-grained tagset, with just 12 categories:
- NOUN (nouns)
- VERB (verbs)
- ADJ (adjectives)
- ADV (adverbs)
- PRON (pronouns)
- DET (determiners and articles)
- ADP (prepositions and postpositions)
- NUM (numerals)
- CONJ (conjunctions)
- PRT (particles)
- . (punctuation marks)
- X (a catch-all for other categories such as abbreviations or foreign words)

The book should probably give a better reference for it, though, good point!
--
-- alexr

Rodger Kibble

unread,
Sep 19, 2014, 9:43:09 AM9/19/14
to nltk-...@googlegroups.com
Thanks Alex!

This is interesting, I get a different result from the example in the book.  This is what they get:

>>> from nltk.corpus import brown
>>> brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')
>>> tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
>>> tag_fd.most_common()
[('NOUN', 30640), ('VERB', 14399), ('ADP', 12355), ('.', 11928), ('DET', 11389),
 ('ADJ', 6706), ('ADV', 3349), ('CONJ', 2717), ('PRON', 2535), ('PRT', 2264),
 ('NUM', 2166), ('X', 106)]

I get 92 'X' and 14 'UNK'.   The Xes are indeed all foreign words.  The UNKs are 'West', 'East' and 'North'.  Digging a bit more, they seem to be the ones that are originally tagged NR-TL in Brown.

Alex Rudnick

unread,
Sep 19, 2014, 6:57:15 PM9/19/14
to nltk-...@googlegroups.com
Hey Rodger,

Ooh, that's interesting! That's a bug, then! NR-TL means "adverbial
noun in a title" [0], so it should be mapped to NOUN, like NR is.

[0] http://en.wikipedia.org/wiki/Brown_Corpus#Part-of-speech_tags_used

In fact, this is a bug in the mappings provided by Slav Petrov:
https://code.google.com/p/universal-pos-tags/source/browse/trunk/en-brown.map

We (or I) can poke them about adding NR-TL :D

Good find!

On Fri, Sep 19, 2014 at 6:43 AM, Rodger Kibble <rki...@gmail.com> wrote:
> Thanks Alex!
>
> This is interesting, I get a different result from the example in the book.
> This is what they get:
>
>>>> from nltk.corpus import brown
>>>> brown_news_tagged = brown.tagged_words(categories='news',
>>>> tagset='universal')
>>>> tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
>>>> tag_fd.most_common()
> [('NOUN', 30640), ('VERB', 14399), ('ADP', 12355), ('.', 11928), ('DET',
> 11389),
> ('ADJ', 6706), ('ADV', 3349), ('CONJ', 2717), ('PRON', 2535), ('PRT',
> 2264),
> ('NUM', 2166), ('X', 106)]
>
>
> I get 92 'X' and 14 'UNK'. The Xes are indeed all foreign words. The UNKs
> are 'West', 'East' and 'North'. Digging a bit more, they seem to be the
> ones that are originally tagged NR-TL in Brown.

--
-- alexr

Steven Bird

unread,
Sep 20, 2014, 11:13:56 AM9/20/14
to nltk-users
I've fixed the broken table in Chapter 5, sorry about that. The further reading section gives a reference for the Universal Tagset (though the link to our bibliography is currently broken).

The NLTK homepage also has a search field and if you search for "universal" you quickly find this, which includes the reference:
http://www.nltk.org/api/nltk.tag.html#module-nltk.tag.mapping

I don't understand why Rodger's output disagrees with what we have in the book!

-Steven


--
-- alexr

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alex Rudnick

unread,
Oct 28, 2014, 1:46:13 PM10/28/14
to nltk-...@googlegroups.com
That took a while, but now the mappings for Brown corpus tags to
universal POS tags should be complete, upstream.

https://code.google.com/p/universal-pos-tags/
--
-- alexr

Steven Bird

unread,
Oct 28, 2014, 2:26:35 PM10/28/14
to nltk-users
Thanks. I will update the version in our NLTK data collection. You can monitor progress at https://github.com/nltk/nltk_data/issues/15

Reply all
Reply to author
Forward
0 new messages