Filtering bigrams by POS-tag

568 views
Skip to first unread message

Louis Corbel

unread,
Apr 26, 2016, 7:05:34 AM4/26/16
to nltk-users
Hi guys,

So I'm trying to filter output bigrams depending of the POS-tag of the words contained in them. To be clear I would like to get only bigrams composed of a noun 'NN' and an adjective 'JJ'.

Here is the sample of the code concerned by this :

finder = BigramCollocationFinder.from_words(tokens)
finder.apply_ngram_filter(lambda w1, w2: ???)

Here tokens is a list of words postagged and lemmatized. I don't know what to write in the '???' part to achieve what i described earlier.

If you need more details explanations or other part of my code to get a grasp of my problem don't hesitate to ask.

Thanks,
Louis

Alexis

unread,
May 1, 2016, 10:10:08 AM5/1/16
to nltk-...@googlegroups.com
If your "words" are actually tuples of (word, tag),as in the output of `pos_tag()`, your filter function can look like this:

def filterNNJJ(pair1, pair2):
    (w1, t1), (w2, t2) = pair1, pair2
    return not ((t1 == "NN") and (t2 == "JJ"))

Or less readably as a lambda function:

lambda w1, w2: not ((w1[1] == "NN") and (w2[1] == "JJ"))

According to the documentation, the filter must return True for the pairs that you want to remove, hence the test is negated. In any event, it should take as many arguments as your ngram size (2 by default), and it can unpack them as necessary. If this doesn't work right away, have your function print out "w1" and "w2", and take it from there.

Alexis

Dr. Alexis Dimitriadis | Assistant Professor and Senior Research Fellow | Utrecht Institute of Linguistics OTS | Utrecht University | Trans 10, 3512 JK Utrecht, room 2.33 | +31 30 253 65 68 | a.dimi...@uu.nl | www.hum.uu.nl/medewerkers/a.dimitriadis

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages