Corpus loading

22 views
Skip to first unread message

Mazhar Dootio

unread,
Feb 13, 2017, 4:54:30 PM2/13/17
to nltk-users
Hello every one
iam facing problem in loading Sindhi corpus. This corpus is unicode based corpus. I got following errors. Please rectify

import unicodedata
import nltk
contents = open("D:\SindhiCorpus.txt").read() 
len(contents) 

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-3-62b88b16c0e0> in <module>()
      1 import unicodedata
      2 import nltk
----> 3 contents = open("D:\SindhiCorpus.txt").read()
      4 len(contents)

C:\Users\mazhar\Anaconda3\lib\encodings\cp1252.py in decode(self, input, final)
     21 class IncrementalDecoder(codecs.IncrementalDecoder):
     22     def decode(self, input, final=False):
---> 23         return codecs.charmap_decode(input,self.errors,decoding_table)[0]
     24 
     25 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 30: character maps to <undefined>

Constantin Orăsan

unread,
Feb 13, 2017, 6:05:10 PM2/13/17
to nltk-...@googlegroups.com
Hello,

Make sure you specify the encoding of the file you open. It is very likely that the encoding you use is utf8

open(filename, encoding='utf8')
Regards,

Constantin
--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mazhar Dootio

unread,
Feb 14, 2017, 3:47:54 AM2/14/17
to nltk-users
Thank you for help. 
My problem is solved
Reply all
Reply to author
Forward
0 new messages