Dear All,
I have some problems with loading the standard pos_tagger in nltk 3.2 on windows 10. This happened after upgrading to nltk 3.2 via conda today. Everything worked a week ago with nltk 3.1. I think I might have a small pointer to what is happening, but maybe I’m off. Maybe something might be mishandled in or before the split_resource_url function in nltk.data when looking for model paths on windows?
I saw the old post "Tokenizer = nltk.data.load('C:\nltk_data\tokenizers\punkt\english.pickle') failed" from Feb/2015, but I think my problem might be different.
This is a small example to illustrate:
text = "This is a text. And I think, I might have mae a little typo somewhere in here."
nltk.pos_tag([token for token in nltk.word_tokenize(text)])
---------------------------------------------------------------------------
URLError Traceback (most recent call last)
<ipython-input-43-97c95a83aed3> in <module>()
1 text = "This is a text. And I think, I might have mae a little typo somewhere in here."
----> 2 nltk.pos_tag((token for token in nltk.word_tokenize(text)))
C:\Anaconda3\lib\site-packages\nltk\tag\__init__.py in pos_tag(tokens, tagset)
108 :rtype: list(tuple(str, str))
109 """
--> 110 tagger = PerceptronTagger()
111 return _pos_tag(tokens, tagset, tagger)
112
C:\Anaconda3\lib\site-packages\nltk\tag\perceptron.py in __init__(self, load)
139 if load:
140 AP_MODEL_LOC = str(find('taggers/averaged_perceptron_tagger/'+PICKLE))
--> 141 self.load(AP_MODEL_LOC)
142
143 def tag(self, tokens):
C:\Anaconda3\lib\site-packages\nltk\tag\perceptron.py in load(self, loc)
207 '''
208
--> 209 self.model.weights, self.tagdict, self.classes = load(loc)
210 self.model.classes = self.classes
211
C:\Anaconda3\lib\site-packages\nltk\data.py in load(resource_url, format, cache, verbose, logic_parser, fstruct_reader, encoding)
799
800 # Load the resource.
--> 801 opened_resource = _open(resource_url)
802
803 if format == 'raw':
C:\Anaconda3\lib\site-packages\nltk\data.py in _open(resource_url)
922 return find(path_, ['']).open()
923 else:
--> 924 return urlopen(resource_url)
925
926 ######################################################################
C:\Anaconda3\lib\urllib\request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
160 else:
161 opener = _opener
--> 162 return opener.open(url, data, timeout)
163
164 def install_opener(opener):
C:\Anaconda3\lib\urllib\request.py in open(self, fullurl, data, timeout)
463 req = meth(req)
464
--> 465 response = self._open(req, data)
466
467 # post-process response
C:\Anaconda3\lib\urllib\request.py in _open(self, req, data)
486
487 return self._call_chain(self.handle_open, 'unknown',
--> 488 'unknown_open', req)
489
490 def error(self, proto, *args):
C:\Anaconda3\lib\urllib\request.py in _call_chain(self, chain, kind, meth_name, *args)
441 for handler in handlers:
442 func = getattr(handler, meth_name)
--> 443 result = func(*args)
444 if result is not None:
445 return result
C:\Anaconda3\lib\urllib\request.py in unknown_open(self, req)
1308 def unknown_open(self, req):
1309 type = req.type
-> 1310 raise URLError('unknown url type: %s' % type)
1311
1312 def parse_keqv_list(l):
URLError: <urlopen error unknown url type: c>
I copy/pasted parts of the source code of nltk.data into a notebook to figure out what is going on.
When I do:
from nltk.data import find as nfind
PICKLE = "averaged_perceptron_tagger.pickle"
AP_MODEL_LOC = str(nfind('taggers/averaged_perceptron_tagger/'+PICKLE))
AP_MODEL_LOC equals
'C:\\Users\\schuett.BWL\\AppData\\Roaming\\nltk_data\\taggers\\averaged_perceptron_tagger\\averaged_perceptron_tagger.pickle'
Which is the correct location of the file.
I searched my way through the load function in data.py until I found this part in the normalize_resource_url(resource_url) function (starting at line 170):
try:
protocol, name = split_resource_url(resource_url)
except ValueError:
# the resource url has no protocol, use the nltk protocol by default
protocol = 'nltk'
name = resource_url
if I set resource_url = AP_MODEL_LOC
and print(protocol, ' | ', name) the results of the above code, I get;
C | \Users\schuett.BWL\AppData\Roaming\nltk_data\taggers\averaged_perceptron_tagger\averaged_perceptron_tagger.pickle
Obviously, it splits the C away from the rest. If I wrench an ugly
if resource_url[:2] == "C:":
resource_url = "nltk:" + resource_url
into the split_resource_url function in data.py at line 126, the first split at “:” sets the protocol to “nltk” and everything works again:
text = "This is a text. And I think, I might have mae a little typo somewhere in here."
tokens = [token for token in nltk.word_tokenize(text)]
nltk.pos_tag(tokens)
[('This', 'DT'),
('is', 'VBZ'),
('a', 'DT'),
('text', 'NN'),
('.', '.'),
('And', 'CC'),
('I', 'PRP'),
('think', 'VBP'),
(',', ','),
('I', 'PRP'),
('might', 'MD'),
('have', 'VB'),
('mae', 'VBN'),
('a', 'DT'),
('little', 'JJ'),
('typo', 'NN'),
('somewhere', 'RB'),
('in', 'IN'),
('here', 'RB'),
('.', '.')]
I don’t know if the protocol has to be set somewhere before, if a normalization step should have scraped the C: before, or what really is the root of this is because I don’t have the time to investigate further right now. I have a conference deadline approaching fast. But hopefully this helpful.
Best regards
Harm