Problems with POS-tagging

260 views
Skip to first unread message

Edward Grefenstette

unread,
Jun 3, 2009, 11:28:02 AM6/3/09
to nltk-users
I'm writing a module which needs, at one point, to make use of a basic
POS tagger. I obviously went straight to chapter 5 of the book and
looked at the examples there for some inspiration, but discovered I
was having the following problem.

First, the first example in chapter 5 states that I should expect the
following output for the following input:
========================================================
>>> text = nltk.word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
('completely', 'RB'), ('different', 'JJ')]
========================================================

However what I get, with a warning, is:
========================================================
>>> text = nltk.word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
[('And', '$'), ('now', 'VBP'), ('for', 'VBN'), ('something', 'PDT'),
('completely', '.'), ('different', 'CD')]
========================================================
which is obviously wrong.

I also get the weird warning:
========================================================
/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-
packages/nltk-0.9.9-py2.6.egg/nltk/data.py:152: DeprecationWarning:
object.__init__() takes no parameters
str.__init__(self, path)
========================================================
the first time I run the above. Anyone know what this is, and how to
turn it off?

Additionally, I get similar mis-taggings when I run the next example,
which SHOULD produce:
========================================================
>>> text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit")
>>> nltk.pos_tag(text)
[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'),
('us', 'PRP'),
('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'),
('permit', 'NN')]
========================================================
but instead produces, when run on my machine:
========================================================
>>> text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit")
>>> nltk.pos_tag(text)
[('They', '$'), ('refuse', 'NNP'), ('to', 'JJ'), ('permit', 'CD'),
('us', '``'), ('to', 'JJ'), ('obtain', 'CD'), ('the', '``'),
('refuse', '-RRB-'), ('permit', 'CD')]
========================================================

Does anyone have any idea what's going on, and how to troubleshoot
this problem?

Best,
Edward

manas kashyop

unread,
Jun 3, 2009, 11:47:00 AM6/3/09
to nltk-...@googlegroups.com
can you please give me the details steps for installing nltk0.9.9
in fedora 10..........i have python version 2.5.2
installed.............please tell me the steps to install nltk
data...........if you have any ebook or maeterial on how to use
the tool kit plz mail it to me...........plz.........i need it
urgent on my summer project on
IITG.................................with hope............

manas kashyop

unread,
Jun 3, 2009, 11:52:36 AM6/3/09
to nltk-...@googlegroups.com

Edward Loper

unread,
Jun 3, 2009, 1:09:14 PM6/3/09
to nltk-...@googlegroups.com
On Wed, Jun 3, 2009 at 9:28 AM, Edward Grefenstette <egr...@gmail.com> wrote:
> I'm writing a module which needs, at one point, to make use of a basic
> POS tagger. I obviously went straight to chapter 5 of the book and
> looked at the examples there for some inspiration, but discovered I
> was having the following problem. [...]

Check to make sure that you have the current versions of the model
file that's used by the pos tagger, using the following code:

>>> import nltk
>>> nltk.download('maxent_treebank_pos_tagger')
[nltk_data] Downloading package 'maxent_treebank_pos_tagger' to
[nltk_data] /Users/edloper/nltk_data...
[nltk_data] Package maxent_treebank_pos_tagger is already up-to-
[nltk_data] date!

(You can make sure all corpora & models are up-to-date by running
"nltk.downloader.update()")

If this model file is up-to-date, then please let me know what version
of nltk you're using, and I'll try to figure out what's going on.

-Edward

Edward Loper

unread,
Jun 3, 2009, 1:13:22 PM6/3/09
to nltk-...@googlegroups.com
On Wed, Jun 3, 2009 at 9:47 AM, manas kashyop <kashyo...@gmail.com> wrote:
> can you please give me the details steps for installing  nltk0.9.9 [...]

See the nltk webpage (http://www.nltk.org/), in particular the linux
install instructions at:

http://www.nltk.org/download

Once nltk is installed, you can install corpora & models using the
downloader tool:

>>> nltk.download()

The book that describes the toolkit is at:

http://www.nltk.org/book

-Edward

Edward Grefenstette

unread,
Jun 3, 2009, 1:16:20 PM6/3/09
to nltk-users
I've just checked. It's all up to date.
I got the same weird error again while checking:
========================================================
/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-
packages/nltk-0.9.9-py2.6.egg/nltk/__init__.py:588:
DeprecationWarning: object.__new__() takes no parameters
========================================================
I ran nltk.downloader.update() as well, and reloaded everything, and
still get the same mis-tagged output:
========================================================
[('And', '$'), ('now', 'VBP'), ('for', 'VBN'), ('something', 'PDT'),
('completely', '.'), ('different', 'CD')]
========================================================

I'm using nltk 0.9.9 for python 2.6, and running python 2.6.2. Let me
know if you need any more info.

Best,
Edward

On Jun 3, 6:09 pm, Edward Loper <edlo...@gmail.com> wrote:

Edward Loper

unread,
Jun 3, 2009, 1:39:39 PM6/3/09
to nltk-...@googlegroups.com
On Wed, Jun 3, 2009 at 11:16 AM, Edward Grefenstette <egr...@gmail.com> wrote:
>
> I've just checked. It's all up to date.
> I got the same weird error again while checking:
> ========================================================
> /Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-
> packages/nltk-0.9.9-py2.6.egg/nltk/__init__.py:588:
> DeprecationWarning: object.__new__() takes no parameters
> ========================================================

This warning seems very odd, since nltk/__init__.py should only be 144
lines long. Can you take a look at the __init__py file inside the
egg, to see if it looks right? (If I remember correctly, eggs are
just zip files, so if you copy nltk-0.9.9-py2.6.egg to
/tmp/nltk-egg.zip then you should be able to unzip that and look
inside.)

Here's a few more things to try:

>>> tagger = nltk.data.load(nltk.tag._POS_TAGGER)
>>> print tagger
<ClassifierBasedTagger: <ConditionalExponentialClassifier: 46 labels,
203123 features>>

>>> tagger.feature_detector('This is a Test'.split(), 3, 'DT VBZ DT'.split())
{'word': 'Test', 'prevprevtag+word': 'VBZ+test', 'prevtag+word':
'DT+test', 'prevword+word': 'a+test', 'prevtag': 'DT', 'prevword':
'a', 'shape': 'upcase', 'prevprevtag': 'VBZ', 'word.lower': 'test',
'prevprevword': 'is', 'suffix2': 'st', 'suffix3': 'est', 'suffix1':
't'}

Let me know if the you get the same or different values from running
these commands.

-Edward

Edward Grefenstette

unread,
Jun 3, 2009, 1:51:49 PM6/3/09
to nltk-users
File '/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/
site-packages/nltk-0.9.9-py2.6.egg/nltk/__init__.py:588:' has only 144
lines, as you say. Nothing weird there, as far as I can tell. Tell me
if you want me to post it. The MD5 signature for my version of the
file is f2fb47c852064022900394411814aef6.

Upon running 'tagger = nltk.data.load(nltk.tag._POS_TAGGER)' I get the
warning:
========================================================
/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-
packages/nltk-0.9.9-py2.6.egg/nltk/data.py:152: DeprecationWarning:
object.__init__() takes no parameters
str.__init__(self, path)
========================================================

Following that, running 'print tagger' I get same as you:
========================================================
<ClassifierBasedTagger: <ConditionalExponentialClassifier: 46 labels,
203123 features>>
========================================================

Ditto when I run 'tagger.feature_detector('This is a Test'.split(), 3,
'DT VBZ DT'.split())':
========================================================
{'prevprevword': 'is', 'prevword+word': 'a+test', 'prevtag': 'DT',
'prevword': 'a', 'shape': 'upcase', 'prevprevtag+word': 'VBZ+test',
'word.lower': 'test', 'prevtag+word': 'DT+test', 'word': 'Test',
'prevprevtag': 'VBZ', 'suffix2': 'st', 'suffix3': 'est', 'suffix1':
't'}
========================================================

Aside from the weird error messages, it's all the same as your output.

Best,
Edward

On Jun 3, 6:39 pm, Edward Loper <edlo...@gmail.com> wrote:

Edward Loper

unread,
Jun 3, 2009, 2:16:31 PM6/3/09
to nltk-...@googlegroups.com
On Wed, Jun 3, 2009 at 11:51 AM, Edward Grefenstette <egr...@gmail.com> wrote:
> File '/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/
> site-packages/nltk-0.9.9-py2.6.egg/nltk/__init__.py:588:' has only 144
> lines, as you say. Nothing weird there, as far as I can tell. Tell me
> if you want me to post it. The MD5 signature for my version of the
> file is f2fb47c852064022900394411814aef6.

Ok, it looks like the warning about object.__new__ is unrelated. The
error actually occurs on line 588 of internals.py, but shouldn't break
anything. The warning is generated in py2.6 but not in py2.5, which
is why I hadn't noticed it before. (I just got around to installing
py2.6 now.)

When I ran the test case on my machine with py2.6, I got the warning
you described, but pos_tag() returns the correct tags for me.

So it looks like your feature detector is giving back the right thing.
Let's next check if the feature encoding and weights appear to agree
with what I have (ignoring the object.__new__ warning):

>>> import nltk
>>> tagger = nltk.data.load(nltk.tag._POS_TAGGER)
>>> features = tagger.feature_detector('This is a Test'.split(), 3, 'DT VBZ DT'.split())
>>> featvec = tagger.classifier()._encoding.encode(features, 'DT')
>>> print featvec
[(4466, 1), (13809, 1), (483, 1), (4465, 1), (1948, 1), (203089, 1)]

>>> for (featid, val) in featvec:
... print featid, tagger.classifier().weights()[featid]
4466 -0.15993160079
13809 -0.902055195151
483 1.58545078808
4465 0.651274565212
1948 -0.999881850667
203089 4.80474088694

-Edward

peter ljunglöf

unread,
Jun 3, 2009, 3:22:34 PM6/3/09
to nltk-...@googlegroups.com
The problem is in data.py, class FileSystemPathPointer, line 152. I've
submitted a new bug report, issue #390:

http://code.google.com/p/nltk/issues/detail?id=390

/Peter
--------- ----- --- -- -- - - - - - - - - -
peter ljunglöf (http://www.ling.gu.se/~peb)



Edward Loper

unread,
Jun 3, 2009, 3:30:17 PM6/3/09
to nltk-...@googlegroups.com
On Wed, Jun 3, 2009 at 1:22 PM, peter ljunglöf
<peter.l...@heatherleaf.se> wrote:
> The problem is in data.py, class FileSystemPathPointer, line 152. I've
> submitted a new bug report, issue #390:
>
>        http://code.google.com/p/nltk/issues/detail?id=390

Thanks. This bug (the one that causes the warning) actually shows up
in at least 3 places in the nltk code base -- searching for
"object.__new__" shows up some of them.

Note that this bug is (I believe) unrelated to whatever's causing
Edward Grefenstette to get strange results from nltk.pos_tag().

-Edward

Edward Grefenstette

unread,
Jun 3, 2009, 3:34:02 PM6/3/09
to nltk-users
Anyone have a suggested fix? Should I simply reinstall nltk, or should
we attempt to figure out where this is coming from and submit a bug
report?
For reference, I'm running Mac OS 10.5.7 and installed nltk using
python's setuptools (i.e. 'sudo easy_install nltk').

Best,
Edward

On Jun 3, 8:30 pm, Edward Loper <edlo...@gmail.com> wrote:
> On Wed, Jun 3, 2009 at 1:22 PM, peter ljunglöf
>

Edward Grefenstette

unread,
Jun 3, 2009, 3:41:04 PM6/3/09
to nltk-users
Sorry Ed, missed one of your emails.

When I print featvec, I get the same as you ([(4466, 1), (13809, 1),
(483, 1), (4465, 1), (1948, 1), (203089, 1)] )

however when I run the for loop, I get different output for the
weights:
=========================
4466 3.9462196265e+243
13809 -4.29874583705e+190
483 -7.89884975148e-142
4465 -8.68414277771e-277
1948 -2.10156554808e+141
203089 -2.85989234153e-69
=========================

Best,
Edward

On Jun 3, 7:16 pm, Edward Loper <edlo...@gmail.com> wrote:

Edward Loper

unread,
Jun 3, 2009, 4:25:22 PM6/3/09
to nltk-...@googlegroups.com
On Wed, Jun 3, 2009 at 1:41 PM, Edward Grefenstette <egr...@gmail.com> wrote:
> however when I run the for loop, I get different output for the
> weights:
> =========================
> 4466 3.9462196265e+243
> 13809 -4.29874583705e+190
> 483 -7.89884975148e-142
> 4465 -8.68414277771e-277
> 1948 -2.10156554808e+141
> 203089 -2.85989234153e-69
> =========================

Ok, well that's the culprit then. The weights are saved as a pickled
Numpy array -- I wonder if there's some incompatibility between how
Numpy stores arrays in different versions of numpy. What version of
numpy are you using?

>>> import numpy
>>> numpy.__version__
'1.3.0'

Also, just to reconfirm that this is the issue, let's introspect the
weight vector:

>>> import nltk
>>> tagger = nltk.data.load(nltk.tag._POS_TAGGER)

>>> print tagger.classifier().weights()
[ 2.95951185 2.95949328 2.959452 ..., 0.25529419 0.29965264
3.9481696 ]
>>> print type(tagger.classifier().weights())
<type 'numpy.ndarray'>

-Edward

Edward Grefenstette

unread,
Jun 3, 2009, 8:47:57 PM6/3/09
to nltk-users
I'm running numpy 1.2.1. Can I get 1.3.0 for python 2.6? Also, would
this be what's causing the fault?

Running print tagger.classifier().weights() gets me substantially
different results from yours:
[ -9.14654063e+055 6.75500959e-167 9.98469618e+021 ...,
-3.63519046e-225 2.97669861e-118 -3.38238123e+170]

The type of the weights is still <type 'numpy.ndarray'>, though.

Thanks for taking so much time on this issue, by the way.

Best,
Edward


On Jun 3, 9:25 pm, Edward Loper <edlo...@gmail.com> wrote:

Edward Grefenstette

unread,
Jun 4, 2009, 7:13:08 AM6/4/09
to nltk-users
SOLUTION: If anyone stumbles across the same problem as discussed in
this thread, check your numpy version. The POS-taggers worked fine
after upgrading numpy from 1.2.1 to 1.3.0. Thanks to Edward Loper for
helping me out.

Best,
Edward

manas kashyop

unread,
Jun 12, 2009, 7:22:52 AM6/12/09
to nltk-...@googlegroups.com
i am using nltk-0.9.9 and i am trying to execute following section of code

>>> import nltk
>>> text=nltk.word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)

but i am getting the error as-
----------------------------------------------------------------------------------------------------------------------
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.5/site-packages/nltk/tag/__init__.py", line
62, in pos_tag
tagger = nltk.data.load(_POS_TAGGER)
File "/usr/lib/python2.5/site-packages/nltk/data.py", line 587, in load
resource_val = pickle.load(_open(resource_url))
File "/usr/lib/python2.5/site-packages/nltk/data.py", line 666, in _open
return find(path).open()
File "/usr/lib/python2.5/site-packages/nltk/data.py", line 448, in find
raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource 'taggers/maxent_treebank_pos_tagger/english.pickle' not
found. Please use the NLTK Downloader to obtain the resource:
>>> nltk.download().
Searched in:
- '/home/manas/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
**********************************************************************
---------------------------------------------------------------------------------------------------------------------------------------
i am using net under proxy........so I cannot use
nltk.downloader()...............is there any solution to the above
problem...........with hope of getting a reply and lots of thanx in
advance..........................

manas kashyop

unread,
Jun 13, 2009, 9:17:07 AM6/13/09
to nltk-...@googlegroups.com

manas kashyop

unread,
Jun 13, 2009, 9:21:37 AM6/13/09
to nltk-...@googlegroups.com
i am trying to execute following section of code in nltk


>>> import nltk
>>> from nltk.corpus import indian
>>> nltk.corpus.indian.tagged_words()


and the output i am getting is
------------------------------------------------------------------------------------------------------------------------------------------
[('\xe0\xa6\xae\xe0\xa6\xb9\xe0\xa6\xbf\xe0\xa6\xb7\xe0\xa7\x87\xe0\xa6\xb0',
'NN'), ('\xe0\xa6\xb8\xe0\xa6\xa8\xe0\xa7\x8d\xe0\xa6\xa4\xe0\xa6\xbe\xe0\xa6\xa8',
'NN'), ...]
---------------------------------------------------------------------------------------------------------------------------------------------
which is in hexadecimal .......................how can i get the
output in human readable form?????????????
i am using nltk0.9.9 in fedora 10...................plz reply
.......................................with lots of thanx in
advance.............................

manas kashyop

unread,
Jun 13, 2009, 9:47:26 AM6/13/09
to nltk-...@googlegroups.com

Edward Grefenstette

unread,
Jun 14, 2009, 8:22:13 AM6/14/09
to nltk-users
On Jun 12, 12:22 pm, manas kashyop <kashyopma...@gmail.com> wrote:
> i am using nltk-0.9.9 and i am trying to execute following section of code
As far as I know, the nltk.download function uses standard ports (e.g.
80?). If you can post to this board and see web pages, you can most
likely use it.
If that fails, download the corpora manually (http://
nltk.googlecode.com/svn/trunk/nltk_data/index.xml) and extract them to
~/nltk_data
Use print. Non-ascii characters are represented in escaoed hex in
python strings, but will be represented characters when you use print
or write to file.
For example, try:
>>> print '\xe0\xa6\xae\xe0\xa6\xb9\xe0\xa6\xbf\xe0\xa6\xb7\xe0\xa7\x87\xe0\xa6\xb0'
If Indian characters don't show up, make sure your terminal has
support for Indian fonts.

Good luck!

Best,
Edward

manas kashyop

unread,
Jun 15, 2009, 1:05:41 AM6/15/09
to nltk-...@googlegroups.com
thanx alot sir ur solution is working perfectly...................
Reply all
Reply to author
Forward
0 new messages