Check to make sure that you have the current versions of the model
file that's used by the pos tagger, using the following code:
>>> import nltk
>>> nltk.download('maxent_treebank_pos_tagger')
[nltk_data] Downloading package 'maxent_treebank_pos_tagger' to
[nltk_data] /Users/edloper/nltk_data...
[nltk_data] Package maxent_treebank_pos_tagger is already up-to-
[nltk_data] date!
(You can make sure all corpora & models are up-to-date by running
"nltk.downloader.update()")
If this model file is up-to-date, then please let me know what version
of nltk you're using, and I'll try to figure out what's going on.
-Edward
See the nltk webpage (http://www.nltk.org/), in particular the linux
install instructions at:
Once nltk is installed, you can install corpora & models using the
downloader tool:
>>> nltk.download()
The book that describes the toolkit is at:
-Edward
This warning seems very odd, since nltk/__init__.py should only be 144
lines long. Can you take a look at the __init__py file inside the
egg, to see if it looks right? (If I remember correctly, eggs are
just zip files, so if you copy nltk-0.9.9-py2.6.egg to
/tmp/nltk-egg.zip then you should be able to unzip that and look
inside.)
Here's a few more things to try:
>>> tagger = nltk.data.load(nltk.tag._POS_TAGGER)
>>> print tagger
<ClassifierBasedTagger: <ConditionalExponentialClassifier: 46 labels,
203123 features>>
>>> tagger.feature_detector('This is a Test'.split(), 3, 'DT VBZ DT'.split())
{'word': 'Test', 'prevprevtag+word': 'VBZ+test', 'prevtag+word':
'DT+test', 'prevword+word': 'a+test', 'prevtag': 'DT', 'prevword':
'a', 'shape': 'upcase', 'prevprevtag': 'VBZ', 'word.lower': 'test',
'prevprevword': 'is', 'suffix2': 'st', 'suffix3': 'est', 'suffix1':
't'}
Let me know if the you get the same or different values from running
these commands.
-Edward
Ok, it looks like the warning about object.__new__ is unrelated. The
error actually occurs on line 588 of internals.py, but shouldn't break
anything. The warning is generated in py2.6 but not in py2.5, which
is why I hadn't noticed it before. (I just got around to installing
py2.6 now.)
When I ran the test case on my machine with py2.6, I got the warning
you described, but pos_tag() returns the correct tags for me.
So it looks like your feature detector is giving back the right thing.
Let's next check if the feature encoding and weights appear to agree
with what I have (ignoring the object.__new__ warning):
>>> import nltk
>>> tagger = nltk.data.load(nltk.tag._POS_TAGGER)
>>> features = tagger.feature_detector('This is a Test'.split(), 3, 'DT VBZ DT'.split())
>>> featvec = tagger.classifier()._encoding.encode(features, 'DT')
>>> print featvec
[(4466, 1), (13809, 1), (483, 1), (4465, 1), (1948, 1), (203089, 1)]
>>> for (featid, val) in featvec:
... print featid, tagger.classifier().weights()[featid]
4466 -0.15993160079
13809 -0.902055195151
483 1.58545078808
4465 0.651274565212
1948 -0.999881850667
203089 4.80474088694
-Edward
Thanks. This bug (the one that causes the warning) actually shows up
in at least 3 places in the nltk code base -- searching for
"object.__new__" shows up some of them.
Note that this bug is (I believe) unrelated to whatever's causing
Edward Grefenstette to get strange results from nltk.pos_tag().
-Edward
Ok, well that's the culprit then. The weights are saved as a pickled
Numpy array -- I wonder if there's some incompatibility between how
Numpy stores arrays in different versions of numpy. What version of
numpy are you using?
>>> import numpy
>>> numpy.__version__
'1.3.0'
Also, just to reconfirm that this is the issue, let's introspect the
weight vector:
>>> import nltk
>>> tagger = nltk.data.load(nltk.tag._POS_TAGGER)
>>> print tagger.classifier().weights()
[ 2.95951185 2.95949328 2.959452 ..., 0.25529419 0.29965264
3.9481696 ]
>>> print type(tagger.classifier().weights())
<type 'numpy.ndarray'>
-Edward
>>> import nltk
>>> text=nltk.word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
but i am getting the error as-
----------------------------------------------------------------------------------------------------------------------
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.5/site-packages/nltk/tag/__init__.py", line
62, in pos_tag
tagger = nltk.data.load(_POS_TAGGER)
File "/usr/lib/python2.5/site-packages/nltk/data.py", line 587, in load
resource_val = pickle.load(_open(resource_url))
File "/usr/lib/python2.5/site-packages/nltk/data.py", line 666, in _open
return find(path).open()
File "/usr/lib/python2.5/site-packages/nltk/data.py", line 448, in find
raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource 'taggers/maxent_treebank_pos_tagger/english.pickle' not
found. Please use the NLTK Downloader to obtain the resource:
>>> nltk.download().
Searched in:
- '/home/manas/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
**********************************************************************
---------------------------------------------------------------------------------------------------------------------------------------
i am using net under proxy........so I cannot use
nltk.downloader()...............is there any solution to the above
problem...........with hope of getting a reply and lots of thanx in
advance..........................
>>> import nltk
>>> from nltk.corpus import indian
>>> nltk.corpus.indian.tagged_words()
and the output i am getting is
------------------------------------------------------------------------------------------------------------------------------------------
[('\xe0\xa6\xae\xe0\xa6\xb9\xe0\xa6\xbf\xe0\xa6\xb7\xe0\xa7\x87\xe0\xa6\xb0',
'NN'), ('\xe0\xa6\xb8\xe0\xa6\xa8\xe0\xa7\x8d\xe0\xa6\xa4\xe0\xa6\xbe\xe0\xa6\xa8',
'NN'), ...]
---------------------------------------------------------------------------------------------------------------------------------------------
which is in hexadecimal .......................how can i get the
output in human readable form?????????????
i am using nltk0.9.9 in fedora 10...................plz reply
.......................................with lots of thanx in
advance.............................