tokenizer = nltk.data.load('C:\nltk_data\tokenizers\punkt\english.pickle') failed

3,516 views
Skip to first unread message

wine lover

unread,
Feb 28, 2015, 12:42:33 AM2/28/15
to nltk-...@googlegroups.com
Dear All,

I am trying to experiment the NLTK support for sentence tokenize. The code is 
import nltk.data
tokenizer = nltk.data.load('C:\nltk_data\tokenizers\punkt\english.pickle')

However, the running process failed with the following traceback. I could not figure out the reason. Thank you very much for the help.



Traceback (most recent call last):
  File "C:/Users/ugwz/PycharmProjects/project-2/nltk-demo.py", line 4, in <module>
    tokenizer = nltk.data.load('C:\nltk_data\tokenizers\punkt\english.pickle')
  File "C:\Users\ugwz\AppData\Roaming\Python\Python27\site-packages\nltk\data.py", line 774, in load
    opened_resource = _open(resource_url)
  File "C:\Users\ugwz\AppData\Roaming\Python\Python27\site-packages\nltk\data.py", line 893, in _open
    return urlopen(resource_url)
  File "C:\Python27\lib\urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python27\lib\urllib2.py", line 431, in open
    response = self._open(req, data)
  File "C:\Python27\lib\urllib2.py", line 454, in _open
    'unknown_open', req)
  File "C:\Python27\lib\urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "C:\Python27\lib\urllib2.py", line 1265, in unknown_open
    raise URLError('unknown url type: %s' % type)
urllib2.URLError: <urlopen error unknown url type: c>

Denzil Correa

unread,
Feb 28, 2015, 1:13:18 AM2/28/15
to nltk-...@googlegroups.com
Well, I am able to get it work. Try the below 

>>> import nltk
>>> from nltk.tokenize import sent_tokenize
>>> text = "I am trying to tokenize a sentence here. What are you up to?"
>>> sentences = sent_tokenize(text)
>>> print sentences
['I am trying to tokenize a sentence here.', 'What are you up to?']


--Regards,
Denzil


--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alexis Dimitriadis

unread,
Feb 28, 2015, 11:16:30 AM2/28/15
to nltk-...@googlegroups.com
The problem is that you did not use a "raw" string to specify the path to the tokenizer; your syntax highlighter even points out (in blue) where python interprets \n and \t as newline and tab! Always write windows filesystem paths as raw strings: r'C:\nltk_data...'

Alexis 

Dr. Alexis Dimitriadis | Assistant Professor and Senior Research Fellow | Utrecht Institute of Linguistics OTS | Utrecht University | Trans 10, 3512 JK Utrecht, room 2.33 | +31 30 253 65 68 | a.dimi...@uu.nl | www.hum.uu.nl/medewerkers/a.dimitriadis

wine lover

unread,
Feb 28, 2015, 7:16:29 PM2/28/15
to nltk-...@googlegroups.com
Hi Alexis,

Thank you for the reply.

I tried tokenizer = nltk.data.load(r'C:\nltk_data\tokenizers\punkt\english.pickle')

and tokenizer = nltk.data.load(ur'C:\nltk_data\tokenizers\punkt\english.pickle')

But both of them gives me the same error message.

Thanks.

Fred Mailhot

unread,
Feb 28, 2015, 7:31:37 PM2/28/15
to nltk-...@googlegroups.com
It looks like the nltk.data.load() call is trying to grab a resource from the web, so it's misparsing the file resource you're pointing it at. As Alexis pointed out, that should in principle be resolved by passing a raw string.

Note that if your nltk_data packages were correctly installed, then NLTK knows where they are on your system, and you don't need to pass an absolute path...

 $ ipython

In [1]: from nltk.data import load

In [2]: f = load("tokenizers/punkt/english.pickle")

In [3]: f.tokenize("this is a test!\nThis, too?")
Out[3]: ['this is a test!', 'This, too?']

Alexis Dimitriadis

unread,
Mar 1, 2015, 11:52:38 AM3/1/15
to nltk-...@googlegroups.com
I tried tokenizer = nltk.data.load(r'C:\nltk_data\tokenizers\punkt\english.pickle')

and tokenizer = nltk.data.load(ur'C:\nltk_data\tokenizers\punkt\english.pickle')

But both of them gives me the same error message.

Apparently the file english.pickle does not exist at the specified location. Can you tokenize a sentence using Denzil Correa's code? If not, you need to run nltk.download() and download the "book" collection-- see ch. 1 of the NLTK book.

Alexis

Dr. Alexis Dimitriadis | Assistant Professor and Senior Research Fellow | Utrecht Institute of Linguistics OTS | Utrecht University | Trans 10, 3512 JK Utrecht, room 2.33 | +31 30 253 65 68 | a.dimi...@uu.nl | www.hum.uu.nl/medewerkers/a.dimitriadis

Snijesh VP

unread,
May 20, 2019, 4:36:43 PM5/20/19
to nltk-users
Remove "C:\" from your path. This makes it like a url. I think you are using windows version.

import nltk.data
tokenizer
= nltk.data.load('tokenizers\punkt\english.pickle')

Reply all
Reply to author
Forward
0 new messages