Fixed the Unicode issues and sent pull request.
By using |tee, you’re relying on the shell to handle the encoding issues. That’s bound to end in disaster, unless you understand how your shell handles Unicode (I don’t). So I’ve changed your program to write to the outfile directly.
It was almost working anyway — all you needed to do was use unicodecsv.reader instead of csv, and print to the file after encoding in UTF-8. (codecs.open(..,..,’utf-8’) handles the encoding automatically.)
Confusingly, it looks like unicodecsv.reader expects the file to be opened in the default encoding, not UTF-8.
To understand Unicode encodings and handling them in Python, see
(By the way, 250 MiB is an unreasonable size for a single repository, IMO. Consider breaking it up, or moving the dictionaries out of the repo.)
Fixed the Unicode issues and sent pull request.
By using |tee, you’re relying on the shell to handle the encoding issues. That’s bound to end in disaster, unless you understand how your shell handles Unicode (I don’t). So I’ve changed your program to write to the outfile directly.
(By the way, 250 MiB is an unreasonable size for a single repository, IMO. Consider breaking it up, or moving the dictionaries out of the repo.)
--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
I am trying to find common prefixes between two devanAgarI strings I read from two columns of a utf-8 csv file. (Where I am headed: get common suffixes of pAda-s of ardhasama-vRtta-s to facilitate memorization and appreciation.)
def Print(u):assert isinstance(u, unicode)print(u.encode('utf-8'))
import codecsimport localeimport syssys.stdin = codecs.getreader('utf-8')(sys.stdin)sys.stdout = codecs.getwriter('utf-8')(sys.stdout)