Corpus loading

Mazhar Dootio

unread,

Feb 13, 2017, 4:54:30 PM2/13/17

to nltk-users

Hello every one

iam facing problem in loading Sindhi corpus. This corpus is unicode based corpus. I got following errors. Please rectify

import unicodedata

import nltk

contents = open("D:\SindhiCorpus.txt").read()

len(contents)

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-3-62b88b16c0e0> in <module>()
      1 import unicodedata
      2 import nltk
----> 3 contents = open("D:\SindhiCorpus.txt").read()
      4 len(contents)

C:\Users\mazhar\Anaconda3\lib\encodings\cp1252.py in decode(self, input, final)
     21 class IncrementalDecoder(codecs.IncrementalDecoder):
     22     def decode(self, input, final=False):
---> 23         return codecs.charmap_decode(input,self.errors,decoding_table)[0]
     24 
     25 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 30: character maps to <undefined>

Constantin Orăsan

unread,

Feb 13, 2017, 6:05:10 PM2/13/17

to nltk-...@googlegroups.com

Hello,

Make sure you specify the encoding of the file you open. It is very likely that the encoding you use is utf8

open(filename, encoding='utf8')

Regards,

Constantin

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mazhar Dootio

unread,

Feb 14, 2017, 3:47:54 AM2/14/17

to nltk-users

Thank you for help.

My problem is solved

Reply all

Reply to author

Forward