Help with error UnicodeDecodeError: 'charmap'

3,906 views
Skip to first unread message

Renato Sant Anna

unread,
Mar 29, 2021, 4:34:31 PM3/29/21
to Gensim
Hi all, 

I need help to solve the following error: 

UnicodeDecodeError
: 'charmap' codec can't decode byte 0x9d in position 2438: character maps to <undefined>

Anyone has any idea how to solve?

Best regards, 

Renato 

Renato Sant Anna

unread,
Mar 29, 2021, 5:21:08 PM3/29/21
to Gensim
UnicodeDecodeError Traceback (most recent call last) <ipython-input-16-b19291d690e0> in <module> 4 5 # Load data ----> 6 data = load_data(input_file) 7 8 # Create a preprocessor object <ipython-input-14-29d3533014b2> in load_data(input_file) 3 data = [] 4 with open(input_file, 'r') as f: ----> 5 for line in f.readlines(): 6 data.append(line[:-1]) 7 ~\anaconda3\lib\encodings\cp1252.py in decode(self, input, final) 21 class IncrementalDecoder(codecs.IncrementalDecoder): 22 def decode(self, input, final=False): ---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0] 24 25 class StreamWriter(Codec,codecs.StreamWriter): UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 2438: character maps to <undefined>

Ben Reaves

unread,
Mar 29, 2021, 10:08:20 PM3/29/21
to gen...@googlegroups.com
This is a problem of the file encoding. I got this same error when my input files had some Chinese characters in it. And some Emoji. I think gensim is expecting utf-8 encoding but your file has some other encoding.
Here is a video showing an example in which he discovered the encoding using Windows 10's notepad++,
and fixed the problem by adding an arg to a file open() statement.
I guess your load_data has an open() statement deep within it somewhere.

Best is probably to convert it to utf-8

--
You received this message because you are subscribed to the Google Groups "Gensim" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gensim+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gensim/4a198c18-0074-4cbb-9c51-d26ca03eec7bn%40googlegroups.com.


--
_____________________________________________________________________
Ben Reaves

--

Renato Sant Anna

unread,
Mar 29, 2021, 10:43:02 PM3/29/21
to gen...@googlegroups.com
Hi, the file is already with UTF-8. What worked but damaged the format of the file, with wrong characters was using engine = 'python' in the pd.read_csv

Renato Sant Anna

unread,
Mar 29, 2021, 10:49:27 PM3/29/21
to Gensim
If I use this way => data = pd.read_csv('pathname', engine = 'python') it works but the file becomes with wrong characters and my NLP becomes messy, so not a very good solution, 
but if I put the => data = pd.read_csv('pathname', encoding = 'utf-8', engine = 'python') the error persists, the message: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 2438: character maps to <undefined>

Renato Sant Anna

unread,
Mar 29, 2021, 11:11:42 PM3/29/21
to Gensim
I found a better solution using the "python" feature in the read_csv, and in the  to_csv using a special enconding format with the errors feature as well.  Thanks everyone!!!
Reply all
Reply to author
Forward
0 new messages