Need help with unicode handling in python

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,

Dec 7, 2013, 7:58:30 PM12/7/13

to sanskrit-p...@googlegroups.com

+bcc: वाचस्पतिः, श्रीरमणः

I am trying to find common prefixes between two devanAgarI strings I read from two columns of a utf-8 csv file. (Where I am headed: get common suffixes of pAda-s of ardhasama-vRtta-s to facilitate memorization and appreciation.)

I tried running this program with this :

python ardhasama_common_suffix.py |tee data/ardhasama_prefix.csv

but while I see somewhat recognizable output on the console, I get gibberish in the output.

Appreciate insights - or better yet, elegant python code which works as intended, or even better - both..

--
--
Vishvas /विश्वासः

Anubhav Chattoraj

unread,

Dec 8, 2013, 12:24:27 AM12/8/13

to sanskrit-p...@googlegroups.com

Fixed the Unicode issues and sent pull request.

By using |tee, you’re relying on the shell to handle the encoding issues. That’s bound to end in disaster, unless you understand how your shell handles Unicode (I don’t). So I’ve changed your program to write to the outfile directly.

It was almost working anyway — all you needed to do was use unicodecsv.reader instead of csv, and print to the file after encoding in UTF-8. (codecs.open(..,..,’utf-8’) handles the encoding automatically.)

Confusingly, it looks like unicodecsv.reader expects the file to be opened in the default encoding, not UTF-8.

To understand Unicode encodings and handling them in Python, see

(By the way, 250 MiB is an unreasonable size for a single repository, IMO. Consider breaking it up, or moving the dictionaries out of the repo.)

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,

Dec 8, 2013, 1:47:44 AM12/8/13

to sanskrit-p...@googlegroups.com

On Sat, Dec 7, 2013 at 9:24 PM, Anubhav Chattoraj <anubhav....@gmail.com> wrote:

Fixed the Unicode issues and sent pull request.

By using |tee, you’re relying on the shell to handle the encoding issues. That’s bound to end in disaster, unless you understand how your shell handles Unicode (I don’t). So I’ve changed your program to write to the outfile directly.

Thanks very much Anubhav!

(By the way, 250 MiB is an unreasonable size for a single repository, IMO. Consider breaking it up, or moving the dictionaries out of the repo.)

I had not realized that - thanks for letting me know!

Shreevatsa R

unread,

Dec 8, 2013, 9:55:26 AM12/8/13

to sanskrit-programmers

What Anubhav said.

To summarize a few guidelines here in short:

1. When programming in Python, always remain aware of whether a particular object is "unicode" (code points) or "str" (bytes).

[General info: This is like Java's "string" and "bytes" types. http://stackoverflow.com/a/4385653/4958

Basically, Unicode code points logically represent a character (like "092E: DEVANAGARI LETTER MA"), independent of encoding. Python contains both "unicode" objects that are these, or the C-like representation of the actual bytes used to represent the characters **in some encoding**. You can read http://www.joelonsoftware.com/articles/Unicode.html or the introduction section at http://docs.python.org/2/howto/unicode.html to get a rough understanding of the issues involved, as I mentioned before on this mailing list here: https://groups.google.com/d/msg/sanskrit-programmers/ggIxk_R88Es/E0S6NklVZtYJ ]

2. One recommendation I've seen is to always use the "unicode" type internally. So:

2a. As soon as you see some input from the external world, decode it immediately [e.g. for a file, the stream of bytes it contains may represent a stream of characters in the 'utf-8' encoding, so decode from the file into 'unicode' characters whenever you read from it], and

2b. whenever you write something to output (even "standard output"), always encode it and write out the actual stream of bytes to the output, so that there can be no confusion.

3. To this end, I've taken to putting

"from __future__ import unicode_literals"

at the top of my Python programs, so that whenever I write a line of code like

s = 'hello world'

it is equivalent to writing

s = u'hello world'

That is, so that all literals are interpreted as Unicode by default. This is the default in Python 3.

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,

Dec 8, 2013, 2:31:21 PM12/8/13

to sanskrit-p...@googlegroups.com

On Sat, Dec 7, 2013 at 4:58 PM, विश्वासो वासुकिजः (Vishvas Vasuki) <vishvas...@gmail.com> wrote:

I am trying to find common prefixes between two devanAgarI strings I read from two columns of a utf-8 csv file. (Where I am headed: get common suffixes of pAda-s of ardhasama-vRtta-s to facilitate memorization and appreciation.)

here is the result, for the curious.

Here is a problem I want to solve next -

For a given sama-vRtta, find other sama-vRttas that are close to it.

Closeness between two such vRtta string-s could be measured by the combined length of common maximal substrings which dont overlap in either vRtta-strings.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,

Dec 8, 2013, 2:32:18 PM12/8/13

to sanskrit-p...@googlegroups.com

उपकृतोऽस्मि श्रीवत्सानुभवौ!

भवतोः सूचनाः अत्र सङ्गृहीतवान् - https://sites.google.com/site/sanskritcode/tutorials/python-unicode।

JAGANADH G

unread,

Dec 8, 2013, 2:58:09 PM12/8/13

to sanskrit-p...@googlegroups.com

Hi

Refer http://css.dzone.com/articles/python-27-csv-files-unicode

Best regards

Jagan

**********************************
JAGANADH G
http://jaganadhg.in
ILUGCBE
http://ilugcbe.org.in

Mārcis Gasūns

unread,

Dec 9, 2013, 1:02:03 AM12/9/13

to sanskrit-p...@googlegroups.com

Namaste,

दा दा द द दा द दा द दा
दा दा द द दा द दा द दा
दा द दा दा द दा दा द दा

It's not how I'm used to write suffixes, but anyway.

1815 vi 8210 viṃśa
1199 sa 6582 sa
2 ā 5970 ā
1010 pra 4544 pra
1523 su 3823 suūti
1202 saṃ 3306 saṃkakṣa
941 pari 1941 pari
789 ni 1908 ni
149 an 1766 anakṣ
1727 upa 1716 upa
261 anu 1696 anu
1065 prati 1567 prati
1011 prā 1516 prāṃśu
628 ava 1490 ava

Here is how we count Sanskrit prefixes and in the same manner Sanskrit suffixes in .xls

https://www.dropbox.com/s/rdknwcqige4nk9c/Woerterbuch-Praefix_250000_071213_v2.xlsx

See the formula, but not sure if it will be of any help, neither good at Python or Perl.

On Sunday, 8 December 2013 23:58:09 UTC+4, ജഗന്നാഥ്/जगन्नाथ् जि wrote:

Refer http://css.dzone.com/articles/python-27-csv-files-unicode

Shreevatsa R

unread,

Dec 15, 2013, 2:28:31 AM12/15/13

to sanskrit-programmers

More information on the Unicode handling (multiple answers are good, not just the top ones):

http://stackoverflow.com/questions/4545661/unicodedecodeerror-when-redirecting-to-file

http://stackoverflow.com/questions/492483/setting-the-correct-encoding-when-piping-stdout-in-python

https://wiki.python.org/moin/PrintFails

If you're using 'unicode' internally everywhere (which I find best), then

* You can write a wrapper function and always use it instead of print:

def Print(u):
assert isinstance(u, unicode)
print(u.encode('utf-8'))

* Or, if you're sure that you'll only ever need to read UTF-8 encoded input, and will only ever need to write UTF-8 encoded output (this may not be a safe assumption, if your module/library is used by other programs), then you can even do this:

import codecs
import locale

import sys
sys.stdin = codecs.getreader('utf-8')(sys.stdin)

sys.stdout = codecs.getwriter('utf-8')(sys.stdout)

and just use print directly to sys.stdout.

(Module "locale" was included above because some people like to use 'locale.getpreferredencoding()' in the above instead of 'utf-8'.)

In Python 3, you don't need to do this, as sys.stdout is in text-mode by default, and you can directly print unicode strings to it.

http://stackoverflow.com/questions/4374455/how-to-set-sys-stdout-encoding-in-python-3

(You could use sys.stdout.detach() on the right-hand-side, to force sys.stdout to be in binary mode, wrapped in a codec that encodes unicode to utf-8, but that doesn't seem preferable.)

Mārcis Gasūns

unread,

Jan 20, 2014, 1:14:41 PM1/20/14

to sanskrit-p...@googlegroups.com

http://quantifyingmemory.blogspot.co.uk/2013/11/top-seven-tips-for-processing-foreign.html of interest.

Top Seven Tips for Processing 'Foreign' Text in Python (2.7)

Reply all

Reply to author

Forward