tokenizer = nltk.data.load('C:\nltk_data\tokenizers\punkt\english.pickle') failed

wine lover

unread,

Feb 28, 2015, 12:42:33 AM2/28/15

to nltk-...@googlegroups.com

Dear All,

I am trying to experiment the NLTK support for sentence tokenize. The code is

import nltk.data
tokenizer = nltk.data.load('C:\nltk_data\tokenizers\punkt\english.pickle')

However, the running process failed with the following traceback. I could not figure out the reason. Thank you very much for the help.

Traceback (most recent call last):
  File "C:/Users/ugwz/PycharmProjects/project-2/nltk-demo.py", line 4, in <module>
    tokenizer = nltk.data.load('C:\nltk_data\tokenizers\punkt\english.pickle')
  File "C:\Users\ugwz\AppData\Roaming\Python\Python27\site-packages\nltk\data.py", line 774, in load
    opened_resource = _open(resource_url)
  File "C:\Users\ugwz\AppData\Roaming\Python\Python27\site-packages\nltk\data.py", line 893, in _open
    return urlopen(resource_url)
  File "C:\Python27\lib\urllib2.py", line 154, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python27\lib\urllib2.py", line 431, in open
    response = self._open(req, data)
  File "C:\Python27\lib\urllib2.py", line 454, in _open
    'unknown_open', req)
  File "C:\Python27\lib\urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "C:\Python27\lib\urllib2.py", line 1265, in unknown_open
    raise URLError('unknown url type: %s' % type)
urllib2.URLError: <urlopen error unknown url type: c>

Denzil Correa

unread,

Feb 28, 2015, 1:13:18 AM2/28/15

to nltk-...@googlegroups.com

Well, I am able to get it work. Try the below

>>> import nltk

>>> from nltk.tokenize import sent_tokenize

>>> text = "I am trying to tokenize a sentence here. What are you up to?"

>>> sentences = sent_tokenize(text)

>>> print sentences

['I am trying to tokenize a sentence here.', 'What are you up to?']

--Regards,
Denzil

http://correa.in

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alexis Dimitriadis

unread,

Feb 28, 2015, 11:16:30 AM2/28/15

to nltk-...@googlegroups.com

The problem is that you did not use a "raw" string to specify the path to the tokenizer; your syntax highlighter even points out (in blue) where python interprets \n and \t as newline and tab! Always write windows filesystem paths as raw strings: r'C:\nltk_data...'

Alexis

wine lover

unread,

Feb 28, 2015, 7:16:29 PM2/28/15

to nltk-...@googlegroups.com

Hi Alexis,

Thank you for the reply.

I tried tokenizer = nltk.data.load(r'C:\nltk_data\tokenizers\punkt\english.pickle')

and tokenizer = nltk.data.load(ur'C:\nltk_data\tokenizers\punkt\english.pickle')

But both of them gives me the same error message.

Thanks.

Fred Mailhot

unread,

Feb 28, 2015, 7:31:37 PM2/28/15

to nltk-...@googlegroups.com

It looks like the nltk.data.load() call is trying to grab a resource from the web, so it's misparsing the file resource you're pointing it at. As Alexis pointed out, that should in principle be resolved by passing a raw string.

Note that if your nltk_data packages were correctly installed, then NLTK knows where they are on your system, and you don't need to pass an absolute path...

$ ipython

In [1]: from nltk.data import load

In [2]: f = load("tokenizers/punkt/english.pickle")

In [3]: f.tokenize("this is a test!\nThis, too?")

Out[3]: ['this is a test!', 'This, too?']

Alexis Dimitriadis

unread,

Mar 1, 2015, 11:52:38 AM3/1/15

to nltk-...@googlegroups.com

I tried tokenizer = nltk.data.load(r'C:\nltk_data\tokenizers\punkt\english.pickle')

and tokenizer = nltk.data.load(ur'C:\nltk_data\tokenizers\punkt\english.pickle')

But both of them gives me the same error message.

Apparently the file english.pickle does not exist at the specified location. Can you tokenize a sentence using Denzil Correa's code? If not, you need to run nltk.download() and download the "book" collection-- see ch. 1 of the NLTK book.

Alexis

Snijesh VP

unread,

May 20, 2019, 4:36:43 PM5/20/19

to nltk-users

Remove "C:\" from your path. This makes it like a url. I think you are using windows version.

import nltk.data
tokenizer = nltk.data.load('tokenizers\punkt\english.pickle')

Reply all

Reply to author

Forward