pos tagging not too good with nltk

Akshay Bhatt

unread,

Apr 4, 2013, 3:05:20 PM4/4/13

to nltk-...@googlegroups.com

Hello,
I am trying to make an application that makes heavy use of pos tagging. But the pos tagging capabilities of nltk do not seem up to the mark to me - for instance :
import nltk
text = "Obama delivers his first speech."

sent = nltk.sent_tokenize(text)

loftags = []
for s in sent:
    d = nltk.word_tokenize(s)

    print nltk.pos_tag(d)

Result :

akshayy@ubuntu:~/summ$ python nn1.py
[('Obama', 'NNP'), ('delivers', 'NNS'), ('his', 'PRP$'), ('first', 'JJ'), ('speech', 'NN'), ('.', '.')]

This is not good as compared to Stanford NLP. Now at this moment, when I have enough invested time and efforts to Python envrionment, moving to java doesnt seem practical to me. Also, I am more inclined to C as compared to java, hence Python is closer there as well for future usage. Hence first question is:
1) Do you really think Java is better for nlp tasks with large set of tools available. I have done good research and found java has bigger community dedicated towards nlp tasks.

2) Instead, is there any good approach that I can create my own sentence parser and then pos tagger to have flexibility for modification for any language being used in future.

Leon Derczynski

unread,

Apr 4, 2013, 3:29:06 PM4/4/13

to nltk-...@googlegroups.com

Hi Akshay,

The Stanford Tagger signifies man-decades of research and work into PoS tagging, and is certainly the best for many genres. Of course, part of the performance depends on how you use and train it. In contrast, effort in NLTK has gone into creating an intuitive, flexible, easy-to-learn NLP framework connected to powerful tools. One of those tools is the Stanford tagger:

nltk.tag.stanford.StanfordTagger

1) Many languages have different ranges of tools available. This reduces the time it takes for developers of any discipline to get going. I don't think there's any way of comparing languages in this regard. If you're more comfortable with Java, go with Java. Bear in mind, though, that it will always be the case that there is some tool for some task which is in a different language to the one that you choose.

2) NLTK also comes with a selection of trainable taggers, and you might -- if you have access to the data -- like to train something on a big dataset that can give you some performance differences. Outside of NLTK but still Python, Anders Søgarrd's SCNN performs well for newswire PoS tagging and should be reasonably easy to integrate, though it relies on an external package (Orange).

All the best,

Leon

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Leon R A Derczynski
Research Associate, NLP Group

Department of Computer Science
University of Sheffield
Regent Court, 211 Portobello
Sheffield S1 4DP, UK

+45 5157 4948
http://www.dcs.shef.ac.uk/~leon/

Akshay Bhatt

unread,

Apr 4, 2013, 3:38:22 PM4/4/13

to nltk-...@googlegroups.com

Hi Leon,

Thanks for reply. The reason I am fond of Python is its amazing simplicity. It helps completely to focus on algorithm rather than data flow. But it comes at a price. I think , we have not seen any big data tools around for Python (I have to something with mongodb, but still). And rest for POS tagging, for future perspective and number of spoken languages, I think it would be a good idea for me to start off with something very own. Can you please give some points to begin the pos tagging (may be from sentence tree).

I will definitely try SCNN, but I have to work on large data set making it difficult to use Orange. I believe, there should be something in Python as well as comparable to big data tools like Hadoop, Hbase etc. for data analytics.

Leon Derczynski

unread,

Apr 4, 2013, 5:14:27 PM4/4/13

to nltk-...@googlegroups.com

Hi,

I think if you want high tagging accuracy, you'll have a slow tagger, wherever you go (esp. w/ the Stanford tagger). If you want top performance, you might try the a tagger using cuHMM and a CUDA Python wrapper, which is a computationally cheap way of getting through TB/PB of text in reasonable amounts of time; but accuracy's lower.

The NLTK book has a chapter dedicated to the basics (and not-so-basics) of PoS tagging, which enables you to construct your own tagger; http://nltk.org/book/ch05.html . Typically parsing is easier once you have a correctly PoS-tagged sentence, instead of the other way round.

There's probably a Python interface (or two) to Hadoop, Storm et al. somewhere, once you have solved your problem for the single-document case.