UnicodeDecodeError: 'utf8' codec can't decode byte in position: invalid start byte

Scott Solomon

unread,

Dec 12, 2014, 10:32:50 AM12/12/14

to gen...@googlegroups.com

I have a corpus comprised of a couple thousand .txt files.

For a small number of these files, i get error messages like these:

UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 1623: invalid start byte

UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 2920: invalid start byte

UnicodeDecodeError: 'utf8' codec can't decode byte 0xbf in position 1570: invalid start byte

Is there a way to set gensim to ignore characters it does not recognize so that instead of causing an error to the program that stops similarity scoring before it has finished, it will just skip that byte and continue processing?

I am printing the file names to the console, so when i get an error like above, i can find the file and just delete it and that solves the problem - until gensim gets to a subsequent file that has a similar error. Eventually, after deleting all the files with this issue, gensim spins through the remaining files and can score them successfully.

The first line of my code is:

# -*- coding: utf-8 -*-

Thanks,

Scott

Radim Řehůřek

unread,

Dec 13, 2014, 9:23:53 AM12/13/14

to gen...@googlegroups.com

Hello Scott,

gensim expects unicode on input.

So I'd suggest you convert your input texts into unicode -- ignoring/replacing invalid characters as you please, specify your encoding etc. -- and use that consistently.

Best,

Radim

Scott Solomon

unread,

Dec 15, 2014, 11:25:38 AM12/15/14

to gen...@googlegroups.com

"So I'd suggest you convert your input texts into unicode -- ignoring/replacing invalid characters as you please, specify your encoding etc."

I do not know how to do this. Any help is much appreciated! I have thousands of separate .txt files.

- Scott

Radim Řehůřek

unread,

Dec 15, 2014, 11:35:13 AM12/15/14

to gen...@googlegroups.com

Ah ok, sorry.

In Python, you'd read the content of a file like this: `content = open(filename).read()`. This gives you a binary string representation of the file contents.

You would convert this string (or any other binary string) to unicode using `content_unicode = unicode(content, encoding=WHATEVER, errors='replace')`. And pass to gensim such unicode strings, not the original binary strings.

See for example https://docs.python.org/2/library/functions.html#unicode . Or just google around, unicode in Python is a popular topic :)

HTH,

Radim

Roni Nemat

unread,

May 23, 2017, 10:23:05 AM5/23/17

to gensim

Hi Radim

How would the above work if I read a CSV file into pandas and reconvert it back to a list.

Thanks in advance

Reply all

Reply to author

Forward