Unique id for token with specific POS

16 views
Skip to first unread message

gkl1...@gmail.com

unread,
Dec 12, 2017, 4:59:45 PM12/12/17
to nltk-users
I've searched through the forum here and apologies upfront if this has already been answered:

Is there a data set where a unique integer or universal/unique id has been assigned to a given token with a given POS in the English language?

Has someone already created such a list? Is it updated when new words are added to a given language?

>>> [(token, POS, universal_id)]

For example ... 

[('read', VB, 245), ('read', VBD, 246), ('read', N, 247), ('well', RB, 1124), ('well', JJ, 1125), ('well', UH, 1126), ('well', IN, 1127)] 

From a naive, uninformed standpoint, it would seem that if a POS tagger had identified the POS for a given token that a universal ID or unique integer could be assigned to that token with that specific POS in a given language ... this universal id/integer would be useful in many other types of software applications. 

This is commonly done in custom information retrieval engines (e.g. index, inverted index), but just wondering if someone had already done this. I believe Wordnet did this for synsets ... Thanks, 

Gerry


Dimitriadis, A. (Alexis)

unread,
Dec 14, 2017, 9:49:19 AM12/14/17
to nltk-...@googlegroups.com
There’s no such thing, in the nltk or elsewhere, since there is no exhaustive word list of any  language, and no definitive tagset. 

In Python, the tuple `(word, POS)` is the unique identifier you are looking for. 

Alexis


--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages