Problem on windows with loading tagger and split_resource_url?

48 views
Skip to first unread message

Harm Schütt

unread,
Mar 5, 2016, 5:29:01 PM3/5/16
to nltk-users

Dear All, 


I have some problems with loading the standard pos_tagger in nltk 3.2 on windows 10. This happened after upgrading to nltk 3.2 via conda today. Everything worked a week ago with nltk 3.1. I think I might have a small pointer to what is happening, but maybe I’m off. Maybe something might be mishandled in or before the split_resource_url function in nltk.data when looking for model paths on windows? 


I saw the old post "Tokenizer = nltk.data.load('C:\nltk_data\tokenizers\punkt\english.pickle') failed" from Feb/2015, but I think my problem might be different. 


This is a small example to illustrate:


text = "This is a text. And I think, I might have mae a little typo somewhere in here."

nltk.pos_tag([token for token in nltk.word_tokenize(text)])

---------------------------------------------------------------------------
URLError                                  Traceback (most recent call last)
<ipython-input-43-97c95a83aed3> in <module>()
      1 text = "This is a text. And I think, I might have mae a little typo somewhere in here."
----> 2 nltk.pos_tag((token for token in nltk.word_tokenize(text)))
 
C:\Anaconda3\lib\site-packages\nltk\tag\__init__.py in pos_tag(tokens, tagset)
    108     :rtype: list(tuple(str, str))
    109     """
--> 110     tagger = PerceptronTagger()
    111     return _pos_tag(tokens, tagset, tagger)
    112 
 
C:\Anaconda3\lib\site-packages\nltk\tag\perceptron.py in __init__(self, load)
    139         if load:
    140             AP_MODEL_LOC = str(find('taggers/averaged_perceptron_tagger/'+PICKLE))
--> 141             self.load(AP_MODEL_LOC)
    142 
    143     def tag(self, tokens):
 
C:\Anaconda3\lib\site-packages\nltk\tag\perceptron.py in load(self, loc)
    207         '''
    208 
--> 209         self.model.weights, self.tagdict, self.classes = load(loc)
    210         self.model.classes = self.classes
    211 
 
C:\Anaconda3\lib\site-packages\nltk\data.py in load(resource_url, format, cache, verbose, logic_parser, fstruct_reader, encoding)
    799 
    800     # Load the resource.
--> 801     opened_resource = _open(resource_url)
    802 
    803     if format == 'raw':
 
C:\Anaconda3\lib\site-packages\nltk\data.py in _open(resource_url)
    922         return find(path_, ['']).open()
    923     else:
--> 924         return urlopen(resource_url)
    925 
    926 ######################################################################
 
C:\Anaconda3\lib\urllib\request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    160     else:
    161         opener = _opener
--> 162     return opener.open(url, data, timeout)
    163 
    164 def install_opener(opener):
 
C:\Anaconda3\lib\urllib\request.py in open(self, fullurl, data, timeout)
    463             req = meth(req)
    464 
--> 465         response = self._open(req, data)
    466 
    467         # post-process response
 
C:\Anaconda3\lib\urllib\request.py in _open(self, req, data)
    486 
    487         return self._call_chain(self.handle_open, 'unknown',
--> 488                                 'unknown_open', req)
    489 
    490     def error(self, proto, *args):
 
C:\Anaconda3\lib\urllib\request.py in _call_chain(self, chain, kind, meth_name, *args)
    441         for handler in handlers:
    442             func = getattr(handler, meth_name)
--> 443             result = func(*args)
    444             if result is not None:
    445                 return result
 
C:\Anaconda3\lib\urllib\request.py in unknown_open(self, req)
   1308     def unknown_open(self, req):
   1309         type = req.type
-> 1310         raise URLError('unknown url type: %s' % type)
   1311 
   1312 def parse_keqv_list(l):
 
URLError: <urlopen error unknown url type: c>

 

I copy/pasted parts of the source code of nltk.data into a notebook to figure out what is going on.

 

When I do:


from nltk.data import find as nfind

PICKLE = "averaged_perceptron_tagger.pickle"

AP_MODEL_LOC = str(nfind('taggers/averaged_perceptron_tagger/'+PICKLE))


AP_MODEL_LOC equals 
'C:\\Users\\schuett.BWL\\AppData\\Roaming\\nltk_data\\taggers\\averaged_perceptron_tagger\\averaged_perceptron_tagger.pickle'

 

Which is the correct location of the file.

I searched my way through the load function in data.py until I found this part in the normalize_resource_url(resource_url) function (starting at line 170):


try:

    protocol, name = split_resource_url(resource_url)

except ValueError:

    # the resource url has no protocol, use the nltk protocol by default

    protocol = 'nltk'

    name = resource_url

 

if I set resource_url = AP_MODEL_LOC

and print(protocol, ' | ', name) the results of the above code, I get;

C  |  \Users\schuett.BWL\AppData\Roaming\nltk_data\taggers\averaged_perceptron_tagger\averaged_perceptron_tagger.pickle

 

Obviously, it splits  the C away from the rest. If I wrench an ugly


if resource_url[:2] == "C:":

        resource_url = "nltk:" + resource_url


into the split_resource_url function in data.py at line 126, the first split at “:” sets the protocol to “nltk” and everything works again:


text = "This is a text. And I think, I might have mae a little typo somewhere in here."

tokens = [token for token in nltk.word_tokenize(text)]

nltk.pos_tag(tokens)

[('This', 'DT'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('text', 'NN'),
 ('.', '.'),
 ('And', 'CC'),
 ('I', 'PRP'),
 ('think', 'VBP'),
 (',', ','),
 ('I', 'PRP'),
 ('might', 'MD'),
 ('have', 'VB'),
 ('mae', 'VBN'),
 ('a', 'DT'),
 ('little', 'JJ'),
 ('typo', 'NN'),
 ('somewhere', 'RB'),
 ('in', 'IN'),
 ('here', 'RB'),
 ('.', '.')]

 

I don’t know if the protocol has to be set somewhere before, if a normalization step should have scraped the C: before, or what really is the root of this is because I don’t have the time to investigate further right now. I have a conference deadline approaching fast. But hopefully this helpful.

 

Best regards

Harm 

Reply all
Reply to author
Forward
0 new messages