Question regarding handling of Unicode data in Devnagari

joy99

unread,

Sep 12, 2009, 1:00:31 PM9/12/09

to

Dear Group,

As per the standard posted by the UNICODE for the Devnagari script
used for Hindi and some other languages of India, we have a standard
set, like from the range of 0900-097F.
Where, we have numbers for each character:
like 0904 for Devnagari letter short a, etc.
Now, if write a program,

where
ch="0904"
and I like to see the Devnagari letter short a as output then how
should I proceed? Can codecs help me or should I use unicodedata?

If you can kindly help me.

Best Regards,
Subhabrata.

MRAB

unread,

Sep 12, 2009, 2:17:40 PM9/12/09

to pytho...@python.org

That number is hexadecimal, so the character/codepoint is unichr(int(ch,
16)) in Python 2.x.

Mark Tolonen

unread,

Sep 12, 2009, 5:23:11 PM9/12/09

to pytho...@python.org

"joy99" <subhakol...@gmail.com> wrote in message
news:fade868b-6a69-4b74...@p10g2000prm.googlegroups.com...

> Dear Group,
>
> As per the standard posted by the UNICODE for the Devnagari script
> used for Hindi and some other languages of India, we have a standard
> set, like from the range of 0900-097F.
> Where, we have numbers for each character:
> like 0904 for Devnagari letter short a, etc.
> Now, if write a program,
>
> where
> ch="0904"
> and I like to see the Devnagari letter short a as output then how
> should I proceed? Can codecs help me or should I use unicodedata?

Here are a number of ways to generate a Unicode character. Displaying them
is another matter. My newsreader program could display them properly but my
the interactive window in my Python editor could not.

c = unichr(0x904)
print c,unicodedata.name(c)
print u'\N{DEVANAGARI LETTER SHORT A}'
print u'\u0904'
print u''.join(unichr(c) for c in range(0x900,0x980))

OUTPUT
ऄ DEVANAGARI LETTER SHORT A
ऄ
ऄ
ऀँंःऄअआइईउऊऋऌऍऎएऐऑऒओऔकखगघङचछजझञटठडढणतथदधनऩपफबभमयरऱलळऴवशषसहऺऻ़ऽािीॉॊोौ्ॎॏॐक़ख़ग़ज़ड़ढ़फ़य़ॠॡॢॣ।॥०१२३४५६७८९॰ॱॲॳॴॵॶॷॸॹॺॻॼॽॾॿ

If you use an editor that can write Devnagari and save in an encoding such
as UTF-8, you can write Devnagari directly in the editor. You only need to
tell Python what encoding the source code is in. You'll also need a
terminal and know the encoding it uses for display of characters to actually
see the correct character. For example, below is a program written using
Pythonwin from the pywin32 extensions (version 214). It can write programs
in most encodings and its interactive window supports UTF-8.

I can type Chinese and my fonts support it so I'll use that in this example.
This message is sent in UTF-8 so hopefully it displays properly for you.

# coding: gbk
encoded_text = '你好！你在干什么？'
Unicode_text = u'你好！你在干什么？'
print encoded_text
print encoded_text.decode('gbk')
print Unicode_text
print Unicode_text.encode('utf-8')

OUTPUT:
ţۃáţ՚ىʲôÿ
你好！你在干什么？
你好！你在干什么？
你好！你在干什么？

'encoded_text' is a byte string encoded in the encoding the file is saved in
(*not*what the #coding line declares...*you* have to make sure they agree!).
Since my terminal is UTF-8, The gbk-encoded line is garbage.

The 2nd line should be correct because it decoded the byte string to
Unicode. 'print' will automatically encode Unicode text in the terminal's
encoding. As long as the terminal's encoding and font supports the Unicode
characters used (which in Pythonwin it does), the line will be correct.

The 3rd line works for the same reason the 2nd line does...The string is
already Unicode.

The 4th line works because it was explicitly encoded into UTF-8, and the
terminal supports it.

I hope this is useful to you.
-Mark