Problem on windows with loading tagger and split_resource_url?

48 views

Skip to first unread message

Harm Schütt

unread,

Mar 5, 2016, 5:29:01 PM3/5/16

to nltk-users

Dear All,

I have some problems with loading the standard pos_tagger in nltk 3.2 on windows 10. This happened after upgrading to nltk 3.2 via conda today. Everything worked a week ago with nltk 3.1. I think I might have a small pointer to what is happening, but maybe I’m off. Maybe something might be mishandled in or before the split_resource_url function in nltk.data when looking for model paths on windows?

I saw the old post "Tokenizer = nltk.data.load('C:\nltk_data\tokenizers\punkt\english.pickle') failed" from Feb/2015, but I think my problem might be different.

This is a small example to illustrate:

text = "This is a text. And I think, I might have mae a little typo somewhere in here."

nltk.pos_tag([token for token in nltk.word_tokenize(text)])

---------------------------------------------------------------------------

URLError                                  Traceback (most recent call last)

<ipython-input-43-97c95a83aed3> in <module>()

      1 text = "This is a text. And I think, I might have mae a little typo somewhere in here."

----> 2 nltk.pos_tag((token for token in nltk.word_tokenize(text)))

C:\Anaconda3\lib\site-packages\nltk\tag\__init__.py in pos_tag(tokens, tagset)

    108     :rtype: list(tuple(str, str))

    109     """

--> 110     tagger = PerceptronTagger()

    111     return _pos_tag(tokens, tagset, tagger)

C:\Anaconda3\lib\site-packages\nltk\tag\perceptron.py in __init__(self, load)

    139         if load:

    140             AP_MODEL_LOC = str(find('taggers/averaged_perceptron_tagger/'+PICKLE))

--> 141             self.load(AP_MODEL_LOC)

    143     def tag(self, tokens):

C:\Anaconda3\lib\site-packages\nltk\tag\perceptron.py in load(self, loc)

    207         '''

--> 209         self.model.weights, self.tagdict, self.classes = load(loc)

    210         self.model.classes = self.classes

C:\Anaconda3\lib\site-packages\nltk\data.py in load(resource_url, format, cache, verbose, logic_parser, fstruct_reader, encoding)

    800     # Load the resource.

--> 801     opened_resource = _open(resource_url)

    803     if format == 'raw':

C:\Anaconda3\lib\site-packages\nltk\data.py in _open(resource_url)

    922         return find(path_, ['']).open()

    923     else:

--> 924         return urlopen(resource_url)

    926 ######################################################################

C:\Anaconda3\lib\urllib\request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)

    160     else:

    161         opener = _opener

--> 162     return opener.open(url, data, timeout)

    164 def install_opener(opener):

C:\Anaconda3\lib\urllib\request.py in open(self, fullurl, data, timeout)

    463             req = meth(req)

--> 465         response = self._open(req, data)

    467         # post-process response

C:\Anaconda3\lib\urllib\request.py in _open(self, req, data)

    487         return self._call_chain(self.handle_open, 'unknown',

--> 488                                 'unknown_open', req)

    490     def error(self, proto, *args):

C:\Anaconda3\lib\urllib\request.py in _call_chain(self, chain, kind, meth_name, *args)

    441         for handler in handlers:

    442             func = getattr(handler, meth_name)

--> 443             result = func(*args)

    444             if result is not None:

    445                 return result

C:\Anaconda3\lib\urllib\request.py in unknown_open(self, req)

   1308     def unknown_open(self, req):

   1309         type = req.type

-> 1310         raise URLError('unknown url type: %s' % type)

   1312 def parse_keqv_list(l):

URLError: <urlopen error unknown url type: c>

I copy/pasted parts of the source code of nltk.data into a notebook to figure out what is going on.

When I do:

from nltk.data import find as nfind

PICKLE = "averaged_perceptron_tagger.pickle"

AP_MODEL_LOC = str(nfind('taggers/averaged_perceptron_tagger/'+PICKLE))

AP_MODEL_LOC equals

'C:\\Users\\schuett.BWL\\AppData\\Roaming\\nltk_data\\taggers\\averaged_perceptron_tagger\\averaged_perceptron_tagger.pickle'

Which is the correct location of the file.

I searched my way through the load function in data.py until I found this part in the normalize_resource_url(resource_url) function (starting at line 170):

try:

protocol, name = split_resource_url(resource_url)

except ValueError:

# the resource url has no protocol, use the nltk protocol by default

protocol = 'nltk'

name = resource_url

if I set resource_url = AP_MODEL_LOC

and print(protocol, ' | ', name) the results of the above code, I get;

C  |  \Users\schuett.BWL\AppData\Roaming\nltk_data\taggers\averaged_perceptron_tagger\averaged_perceptron_tagger.pickle

Obviously, it splits the C away from the rest. If I wrench an ugly

if resource_url[:2] == "C:":

resource_url = "nltk:" + resource_url

into the split_resource_url function in data.py at line 126, the first split at “:” sets the protocol to “nltk” and everything works again:

text = "This is a text. And I think, I might have mae a little typo somewhere in here."

tokens = [token for token in nltk.word_tokenize(text)]

nltk.pos_tag(tokens)

[('This', 'DT'),

 ('is', 'VBZ'),

 ('a', 'DT'),

 ('text', 'NN'),

 ('.', '.'),

 ('And', 'CC'),

 ('I', 'PRP'),

 ('think', 'VBP'),

 (',', ','),

 ('I', 'PRP'),

 ('might', 'MD'),

 ('have', 'VB'),

 ('mae', 'VBN'),

 ('a', 'DT'),

 ('little', 'JJ'),

 ('typo', 'NN'),

 ('somewhere', 'RB'),

 ('in', 'IN'),

 ('here', 'RB'),

 ('.', '.')]

I don’t know if the protocol has to be set somewhere before, if a normalization step should have scraped the C: before, or what really is the root of this is because I don’t have the time to investigate further right now. I have a conference deadline approaching fast. But hopefully this helpful.

Best regards

Harm

Reply all

Reply to author

Forward

0 new messages