Need help with unicode handling in python

96 views
Skip to first unread message

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Dec 7, 2013, 7:58:30 PM12/7/13
to sanskrit-p...@googlegroups.com
+bcc: वाचस्पतिः, श्रीरमणः

I am trying to find common prefixes between two devanAgarI strings I read from two columns of a utf-8 csv file. (Where I am headed: get common suffixes of pAda-s of ardhasama-vRtta-s to facilitate memorization and appreciation.)
​​
I tried running this program with this :
python ardhasama_common_suffix.py |tee data/ardhasama_prefix.csv
but while I see somewhat recognizable output on the console, I get gibberish in the output.

Appreciate insights - or better yet, elegant python code which works as intended, or even better - both..

--
--
Vishvas /विश्वासः

Anubhav Chattoraj

unread,
Dec 8, 2013, 12:24:27 AM12/8/13
to sanskrit-p...@googlegroups.com

Fixed the Unicode issues and sent pull request.

By using |tee, you’re relying on the shell to handle the encoding issues. That’s bound to end in disaster, unless you understand how your shell handles Unicode (I don’t). So I’ve changed your program to write to the outfile directly.

It was almost working anyway — all you needed to do was use unicodecsv.reader instead of csv, and print to the file after encoding in UTF-8. (codecs.open(..,..,’utf-8’) handles the encoding automatically.)

Confusingly, it looks like unicodecsv.reader expects the file to be opened in the default encoding, not UTF-8.

To understand Unicode encodings and handling them in Python, see

(By the way, 250 MiB is an unreasonable size for a single repository, IMO. Consider breaking it up, or moving the dictionaries out of the repo.)

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Dec 8, 2013, 1:47:44 AM12/8/13
to sanskrit-p...@googlegroups.com
On Sat, Dec 7, 2013 at 9:24 PM, Anubhav Chattoraj <anubhav....@gmail.com> wrote:

Fixed the Unicode issues and sent pull request.

By using |tee, you’re relying on the shell to handle the encoding issues. That’s bound to end in disaster, unless you understand how your shell handles Unicode (I don’t). So I’ve changed your program to write to the outfile directly.

Thanks very much Anubhav!
 

(By the way, 250 MiB is an unreasonable size for a single repository, IMO. Consider breaking it up, or moving the dictionaries out of the repo.)

I had not realized that - thanks for letting me know!

Shreevatsa R

unread,
Dec 8, 2013, 9:55:26 AM12/8/13
to sanskrit-programmers
What Anubhav said.

To summarize a few guidelines here in short:

1. When programming in Python, always remain aware of whether a particular object is "unicode" (code points) or "str" (bytes).
[General info: This is like Java's "string" and "bytes" types. http://stackoverflow.com/a/4385653/4958
Basically, Unicode code points logically represent a character (like "092E: DEVANAGARI LETTER MA"), independent of encoding. Python contains both "unicode" objects that are these, or the C-like representation of the actual bytes used to represent the characters **in some encoding**. You can read http://www.joelonsoftware.com/articles/Unicode.html or the introduction section at http://docs.python.org/2/howto/unicode.html to get a rough understanding of the issues involved, as I mentioned before on this mailing list here: https://groups.google.com/d/msg/sanskrit-programmers/ggIxk_R88Es/E0S6NklVZtYJ ]

2. One recommendation I've seen is to always use the "unicode" type internally. So:
2a. As soon as you see some input from the external world, decode it immediately [e.g. for a file, the stream of bytes it contains may represent a stream of characters in the 'utf-8' encoding, so decode from the file into 'unicode' characters whenever you read from it], and 
2b. whenever you write something to output (even "standard output"), always encode it and write out the actual stream of bytes to the output, so that there can be no confusion.

3. To this end, I've taken to putting 
"from __future__ import unicode_literals"
at the top of my Python programs, so that whenever I write a line of code like 
    s = 'hello world'
it is equivalent to writing 
    s = u'hello world'
That is, so that all literals are interpreted as Unicode by default. This is the default in Python 3.

--
You received this message because you are subscribed to the Google Groups "sanskrit-programmers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sanskrit-program...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Dec 8, 2013, 2:31:21 PM12/8/13
to sanskrit-p...@googlegroups.com

On Sat, Dec 7, 2013 at 4:58 PM, विश्वासो वासुकिजः (Vishvas Vasuki) <vishvas...@gmail.com> wrote:

I am trying to find common prefixes between two devanAgarI strings I read from two columns of a utf-8 csv file. (Where I am headed: get common suffixes of pAda-s of ardhasama-vRtta-s to facilitate memorization and appreciation.)
​​

here is the result, for the curious.

Here is a problem I want to solve next - 
For a given sama-vRtta, find other sama-vRttas that are close to it.

Closeness between two such vRtta string-s could be measured by the combined length of common maximal substrings which dont overlap in either vRtta-strings.

विश्वासो वासुकिजः (Vishvas Vasuki)

unread,
Dec 8, 2013, 2:32:18 PM12/8/13
to sanskrit-p...@googlegroups.com
​उपकृतोऽस्मि श्रीवत्सानुभवौ!

भवतोः सूचनाः अत्र सङ्गृहीतवान् - https://sites.google.com/site/sanskritcode/tutorials/python-unicode।​

JAGANADH G

unread,
Dec 8, 2013, 2:58:09 PM12/8/13
to sanskrit-p...@googlegroups.com
**********************************
JAGANADH G
http://jaganadhg.in
ILUGCBE
http://ilugcbe.org.in

Mārcis Gasūns

unread,
Dec 9, 2013, 1:02:03 AM12/9/13
to sanskrit-p...@googlegroups.com
Namaste,

  • दा दा द द दा द दा द दा
  • दा दा द द दा द दा द दा
  • दा द दा दा द दा दा द दा
It's not how I'm used to write suffixes, but anyway.

  • 1815 vi 8210 viṃśa
  • 1199 sa 6582 sa
  • 2 ā 5970 ā
  • 1010 pra 4544 pra
  • 1523 su 3823 suūti
  • 1202 saṃ 3306 saṃkakṣa
  • 941 pari 1941 pari
  • 789 ni 1908 ni
  • 149 an 1766 anakṣ
  • 1727 upa 1716 upa
  • 261 anu 1696 anu
  • 1065 prati 1567 prati
  • 1011 prā 1516 prāṃśu
  • 628 ava 1490 ava
Here is how we count Sanskrit prefixes and in the same manner Sanskrit suffixes in .xls
See the formula, but not sure if it will be of any help, neither good at Python or Perl.

On Sunday, 8 December 2013 23:58:09 UTC+4, ജഗന്നാഥ്/जगन्नाथ् जि wrote:

Shreevatsa R

unread,
Dec 15, 2013, 2:28:31 AM12/15/13
to sanskrit-programmers
More information on the Unicode handling (multiple answers are good, not just the top ones):

If you're using 'unicode' internally everywhere (which I find best), then 

* You can write a wrapper function and always use it instead of print:

def Print(u):
  assert isinstance(u, unicode)
  print(u.encode('utf-8'))

* Or, if you're sure that you'll only ever need to read UTF-8 encoded input, and will only ever need to write UTF-8 encoded output (this may not be a safe assumption, if your module/library is used by other programs), then you can even do this:

import codecs
import locale
import sys
sys.stdin = codecs.getreader('utf-8')(sys.stdin)
sys.stdout = codecs.getwriter('utf-8')(sys.stdout) 

and just use print directly to sys.stdout.
(Module "locale" was included above because some people like to use 'locale.getpreferredencoding()' in the above instead of 'utf-8'.)

In Python 3, you don't need to do this, as sys.stdout is in text-mode by default, and you can directly print unicode strings to it.
(You could use sys.stdout.detach() on the right-hand-side, to force sys.stdout to be in binary mode, wrapped in a codec that encodes unicode to utf-8, but that doesn't seem preferable.)


Mārcis Gasūns

unread,
Jan 20, 2014, 1:14:41 PM1/20/14
to sanskrit-p...@googlegroups.com

Top Seven Tips for Processing 'Foreign' Text in Python (2.7)

Reply all
Reply to author
Forward
0 new messages