I have a corpus comprised of a couple thousand .txt files.
For a small number of these files, i get error messages like these:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 1623: invalid start byte
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 2920: invalid start byte
UnicodeDecodeError: 'utf8' codec can't decode byte 0xbf in position 1570: invalid start byte
Is there a way to set gensim to ignore characters it does not recognize so that instead of causing an error to the program that stops similarity scoring before it has finished, it will just skip that byte and continue processing?
I am printing the file names to the console, so when i get an error like above, i can find the file and just delete it and that solves the problem - until gensim gets to a subsequent file that has a similar error. Eventually, after deleting all the files with this issue, gensim spins through the remaining files and can score them successfully.
The first line of my code is:
# -*- coding: utf-8 -*-
Thanks,
Scott