Tagging two consecutive words

Romain Cosson

ongelezen,

8 nov 2017, 17:37:3608-11-2017

aan nltk-users

Dear all,

I am having a problem and I'm sure there is a very simple way to solve it. I have a dictionnary of states and some of them are made of two words.

states = [["United","States"],["Spain"]...]

I am working with the IOB tags

tags = ["B-State","I-State","o"]

And I would like to implement a very simple tagger which would do this:

>>>tagger.tag(["The","United","Nations", "condemned", "the","United","States"])
[("The","o"),("United","o"),("Nations"),("condemned","o"),("the","o"),("United","B-State"),("States","I-State")]

Thank you very much for your help !!!

Romain

Alex Rudnick

ongelezen,

8 nov 2017, 18:15:0208-11-2017

aan nltk-...@googlegroups.com

Hey Romain,

This could be really simple, or actually quite complicated, depending
on how you want to view it.

If your problem is "how do I look at two words at a time in a list of
words", then that's not so hard. Maybe do something like:

for index in range(len(words)):
oneword = words[index]
# check to see if it's in the dictionary
if index < len(words) - 1:
twowords = [words[index], words[index+1]]
# check to see if they're in the dictionary as a sequence

... something like that.

The harder problem is that you've got to decide whether, in this
context, an instance of the string in your dictionary is actually a
reference to the sovereign nation. As a trivial example, what if you
find the word "Turkey' in your input?

Like almost all NLP, this becomes a problem of ambiguity resolution
really quickly.

Hope this helps!

--
-- alexr

Romain Cosson

ongelezen,

9 nov 2017, 05:00:4509-11-2017

aan nltk-users

Thank you for your very quick answer Alex !

Indeed it does not seem complicated to look at two consecutive terms, and I am not even trying do disambiguate "United States". My question is how do I do that with the taggers that are in the NLP library. nltk.BigramTagger, is not perfectly appropriate but it is kind of the function I am looking for. Maybe I just have to create my own class but I would'nt want to do that until I am sure it is not already implemented.

Thank you again ! :)

Romain

Dimitriadis, A. (Alexis)

ongelezen,

9 nov 2017, 05:24:0309-11-2017

aan nltk-...@googlegroups.com

There’s no “very simple way” to use an nltk tagger for this because this is not how taggers work. The nltk’s taggers are trained on annotated corpora, not on dictionaries, and use contextual information (e.g. the word and the POS tag of the previous word) to decide on a tag. Dictionary-based tagging (i.e. word look-up) corresponds to a unigram tagger, which you can’t use since you have multi-word expressions (MWEs). The only obvious way to build a tagger from your data is to somehow add annotations to a substantial corpus of text (with your own code), then use that to train a tagger. The nltk book shows how to write a chunker for named entity recognition.

My recommendation is to put the taggers aside and hand-code a solution to the problem. I’d start by building an index of the first word out of each key in your dictionary so that you know when you may be at the beginning of a MWE, and take it from there.

Alexis

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Romain Cosson

ongelezen,

9 nov 2017, 07:31:4709-11-2017

aan nltk-users

Thank you very much for your answer Alexis !

It is particularly clear and I will implement what you are suggesting. I will write my own class inheriting from ContextTaggers and have an other look at the chapter over chunkers in the nltk book.

Thank you again,

Romain

Allen beantwoorden

Auteur beantwoorden

Doorsturen