help wanted regarding displaying Japanese characters in a GUI using QT and python

prats

unread,

Apr 19, 2006, 1:43:23 PM4/19/06

to

I want to write a GUI application in PYTHON using QT. This application
is supposed to take in Japanese characters. I am using PyQt as the
wrapper for using QT from python. I am able to take input in japanese.
But I am unable to display them back to GUI. It displays some junk
characters Can anyone suggest me some way how to debug the issue.

The code used for tranferring data from view to document is:

"
codec = QTextCodec.codecForName('ISO-2022-JP')
encoded_string = codec.fromUnicode( string )
return str(encoded_string)
"

here string is QString object containing the data from the view.
I think the encoded_string is a QCString object and contains the
unicode coded characters of the japanese string given in the GUI?

how am I going to display the data back to the view from document.

I would be really grateful if somebody helps me in this regard.

Regards,
Pratik

David Boddie

unread,

Apr 19, 2006, 3:33:33 PM4/19/06

to

[Posting via Google's web interface again and hoping that double
newlines will prevent insane concatenation of lines...]

prats wrote:

> I want to write a GUI application in PYTHON using QT. This application
> is supposed to take in Japanese characters. I am using PyQt as the
> wrapper for using QT from python. I am able to take input in japanese.
> But I am unable to display them back to GUI. It displays some junk
> characters Can anyone suggest me some way how to debug the issue.

> The code used for tranferring data from view to document is:

> "
> codec = QTextCodec.codecForName('ISO-2022-JP')
> encoded_string = codec.fromUnicode( string )
> return str(encoded_string)
> "

> here string is QString object containing the data from the view.
> I think the encoded_string is a QCString object and contains the
> unicode coded characters of the japanese string given in the GUI?

Actually, it contains the original text in the ISO-2022-JP encoding and
not a unicode representation. You're just storing an anonymous sequence
of characters in your encoded_string variable which you then return.
Any
user interface element that receives these later on has to guess which
encoding is used to represent the text, and it sounds like it can't do
that.

> how am I going to display the data back to the view from document.

If you're using the text in the GUI, you shouldn't need to pass it
through the codec at all. It should be possible to display the original
string in any widget that can display text. Keep the text in a QString
and
it should just work.

David

prats

unread,

Apr 20, 2006, 4:06:25 AM4/20/06

to

No I need to replace the text given by the user in the GUI by a new
text already in ISO-2022-JP encoding. Then I would have to redisplay
this new text. I explain in detail. I have a text file(say) which has
something written in it using base64 encoding and using charset
ISO-2022-JP. I want to display this data in the GUI.
What I did was first to read in the text and then decode it using
'decodestring' function of base64 module in python.
"
import base64
decoded_string = base64.decodestring(encoded_string)
"
here the encoded string is the text that was read from the file.

How do I display this decoded_string to the GUI?

~pratik

prats

unread,

Apr 20, 2006, 6:52:09 AM4/20/06

to

Hi all,
this is in continuation to my previous post.
The text I want to display is (in base64 encoding):
"
SW4gdGhpcyBzYW1wbGUsIGUtbWFpbCB0aXRsZSBhbmQgdGV4dCBhcmUgd3JpdHRlbiBpbiBKYXBh
bmVzZS4gDQpPdXIgbGFuZ3VhZ2UgaGFzIHRocmVlIHR5cGVzIGNhbGxlZCCBZ0thdGFrYW5hgWgs
IIFnSGlyYWdhbmGBaCBhbmQgDQqBZ0thbmppgWguIA0KVGhpcyBlLW1haWwgY29udGFpbnMgYWxs
IHRoZSB0eXBlcy4gDQqDQ4NOg1aDQYLNgmiCd4Jrgm6CYIJjglKBRIJQgk+CyYLEg4GBW4OLi0CU
XILwIA0Kk/qWe4zqkc6JnoKzgrmC6YLngrWCooLFgreC5oFCIA0KgqCCooKkgqaCqIFBg0GDQ4NF
g0eDSSANCoKpgquCrYKvgrGBQYNKg0yDToNQg1IgDQqCs4K1greCuYK7gUGDVINWg1iDWoNcIA0K
gr2Cv4LCgsSCxoFBg16DYINjg2WDZyANCoLIgsmCyoLLgsyBQYNpg2qDa4Nsg20gDQqCzYLQgtOC
1oLZgUGDboNxg3SDd4N6IA0KgtyC3YLegt+C4IFBg32DfoOAg4GDgiANCoLiguSC5oFBg4aDhoOI
IA0KgueC6ILpguqC64FBg4mDioOLg4yDjSANCoLtgvCC8YFBg4+DSYOTIA0KgmCCYYJigmOCZIJl
gmaCZ4JogmmCaoJrgmyCbYJugm+CcIJxgnKCc4J0gnWCdoJ3gniCeSANCoKBgoKCg4KEgoWChoKH
goiCiYKKgouCjIKNgo6Cj4KQgpGCkoKTgpSClYKWgpeCmIKZgpogDQqBQoFBgWmBaoGBgXuBW4GW
gY+BdYF2gUmBlIGQgZOBlYFggYSBg4GbgX6BooGggZmB9CANCpOMi56Tc49hkkqL5pHjgViW2IJS
gXyCUYJUgXyCUiANCoKggqKCqIKikbmV25BWj2iDcoOLglCCVYpLIA0Kg0ODToNWg0GKlI6uie+O
0CANCg==
"

This text contains both english and japanese characters i.e first few
english characters followed by some japanese characters.

the decoded_string variable contains the first few english characters
and then all junk characters, which I guess are the "ISO-2022-JP"
encoded characters. but how do I get back those japanese characters in
a format so that they get properly displayed in th GUI. Do I need to
change any system settings for that purpose. I am using windows XP and
I have Japanese fonts installed in my PC. I have also set the default
font as japanese.

Please help me in this regard.
~pratik

Serge Orlov

unread,

Apr 20, 2006, 7:43:44 AM4/20/06

to

prats wrote:
> Hi all,
> this is in continuation to my previous post.
> The text I want to display is (in base64 encoding):

> This text contains both english and japanese characters i.e first few

> english characters followed by some japanese characters.
>
> the decoded_string variable contains the first few english characters
> and then all junk characters, which I guess are the "ISO-2022-JP"
> encoded characters.

You guess is wrong. Save you data in a file

"
import base64
bytes = base64.decodestring(encoded_string)
f = open("jp.txt","wb")
f.write(bytes)
f.close()
"
start Firefox, set View->Encoding->Auto-detect->Japanese and open
jp.txt. Now open menu View->Encoding and see that you data is encoded
in shift-jis encoding. To work with non-ascii character you need to
convert your text to unicode:

text = bytes.decode("shift-jis")

That's it. As David already said, you need to keep your text in
unicode.

> Do I need to change any system settings for that purpose.
> I am using windows XP and I have Japanese fonts installed
> in my PC. I have also set the default font as japanese.

AFAIK you _only_ need to turn on "Install files for Asian languages" in
regional settings. You don't need to mess with default font. The
following code works perfectly in IDLE on windows xp english edition:

"
import base64
bytes = base64.decodestring(encoded_string)
print bytes.decode("shift-jis")
"

prats

unread,

Apr 20, 2006, 9:20:01 AM4/20/06

to

I think I could not make myself clear. I have a GUI written in Python
and Qt and PyQt as the python wrappper fro QT. Now I have a string
which is base64 encoded. This string contains both japanese and english
charaters. I need to decode them and display them properly in the GUI
ie. with both english and japanese characters.
I need a way to display them. Qt doc says that QStrings are capable of
displaying all characters. So I need a way to get a QString from the
base64 encoded string.
~pratik

Serge Orlov

unread,

Apr 20, 2006, 9:57:00 AM4/20/06

to

prats wrote:
> I think I could not make myself clear.

On the contrary. You've given enough information for me to do what you
want: decoding your text and displaying it in a GUI. The fact that I
used another GUI is not important, read below why.

> I have a GUI written in Python
> and Qt and PyQt as the python wrappper fro QT. Now I have a string
> which is base64 encoded. This string contains both japanese and english
> charaters. I need to decode them and display them properly in the GUI
> ie. with both english and japanese characters.

> I need a way to display them. Qt doc says that QStrings are capable of
> displaying all characters.

(nitpick: not displaying but holding) And so is capable Python unicode
string. It was introduced more than 5 years ago if my memory serves me
right. It is the recommeded way to hold non-ascii characters in Python
and all toolkits are expected to play nice with it. I would be really
surprised if PyQt doesn't work with it.

> So I need a way to get a QString from the
> base64 encoded string.

Why don't you try to use unicode?

prats

unread,

Apr 20, 2006, 10:15:49 AM4/20/06

to

sorry I did not correctly read your point. I works fine. Thanks for
your help.
I have one more query. It was said that the text I was supposed to show
was written using "ISO-2022-JP" charset. But It didn't when I decoded
it using that charset. But it worked fine with the "shift-jis"
encoding. Is it the default charset used by python i.e. I mean to say
bytes would be by default "shift-jis"?
~pratik

John Machin

unread,

Apr 20, 2006, 11:13:01 AM4/20/06

to prats

On 20/04/2006 8:15 PM, prats wrote:
> sorry I did not correctly read your point. I works fine. Thanks for
> your help.
> I have one more query. It was said that the text I was supposed to show
> was written using "ISO-2022-JP" charset.

Where more than one encoding is in use for a language, some people just
guess. I've seen this with ASCII/EBCDIC and GB[K]/Big5.

> But It didn't when I decoded
> it using that charset. But it worked fine with the "shift-jis"
> encoding. Is it the default charset used by python i.e. I mean to say
> bytes would be by default "shift-jis"?

That may be Ruby's default, although I doubt it. Python was originally
written in Old High Dutch, but PEP 0.0001 did away with the ij ligature
so that Python could be expressed in ASCII, which has been the default
encoding ever since.

Serge Orlov

unread,

Apr 20, 2006, 11:27:25 AM4/20/06

to

No, the default charset in python is ascii. There is no absolutely
reliable way to find out the encoding of arbitrary bytes. But if you
have more than ten bytes and you know some properties of the text (like
you're sure your text contains only English and Japanese) then the
first thing you can do is to rule out invalid encodings:

def valid_en_jp_encodings(bytes):
try:
bytes.decode("ascii")
return ["ascii"]
except UnicodeDecodeError:
pass
encodings = "utf-8", "shift-jis", "iso-2022-jp", "euc-jp"
valid = []
for encoding in encodings:
try:
bytes.decode(encoding)
valid.append(encoding)
except UnicodeDecodeError:
pass
return valid

If this function returns a list with only one item you're lucky. If it
returns more than one item things are getting more complicated. You can
try to use http://chardet.feedparser.org/ to guess encoding or you can
present list of valid encodings to the user and let him/her make a
choice. There is also possibility that this function returns an empty
list, you will need to display a error message in such case.

David Boddie

unread,

Apr 20, 2006, 12:36:49 PM4/20/06

to

Out of interest, I've written some code to show your example text and
added it to the
PyQt Wiki:

http://www.diotavelli.net/PyQtWiki/Decoding_Japanese_Text

I used the codec for Shift-JIS to obtain a unicode representation of
the string, as
Serge suggested.

David

prats

unread,

Apr 20, 2006, 12:37:00 PM4/20/06

to

The text I got was from a outlook message. The snippet of the mail
message is:
"
Content-Type: text/plain;
charset="iso-2022-jp"
Content-Transfer-Encoding: base64

SW4gdGhpcyBzYW1wbGUsIGUtbWFpbCB0aXRsZSBhbmQgdGV4dCBhcmUgd3JpdHRlbiBpbiBKYXBh
bmVzZS4gDQpPdXIgbGFuZ3VhZ2UgaGFzIHRocmVlIHR5cGVzIGNhbGxlZCCBZ0thdGFrYW5hgWgs
IIFnSGlyYWdhbmGBaCBhbmQgDQqBZ0thbmppgWguIA0KVGhpcyBlLW1haWwgY29udGFpbnMgYWxs
IHRoZSB0eXBlcy4gDQqDQ4NOg1aDQYLNgmiCd4Jrgm6CYIJjglKBRIJQgk+CyYLEg4GBW4OLi0CU
XILwIA0Kk/qWe4zqkc6JnoKzgrmC6YLngrWCooLFgreC5oFCIA0KgqCCooKkgqaCqIFBg0GDQ4NF
g0eDSSANCoKpgquCrYKvgrGBQYNKg0yDToNQg1IgDQqCs4K1greCuYK7gUGDVINWg1iDWoNcIA0K
gr2Cv4LCgsSCxoFBg16DYINjg2WDZyANCoLIgsmCyoLLgsyBQYNpg2qDa4Nsg20gDQqCzYLQgtOC
1oLZgUGDboNxg3SDd4N6IA0KgtyC3YLegt+C4IFBg32DfoOAg4GDgiANCoLiguSC5oFBg4aDhoOI
IA0KgueC6ILpguqC64FBg4mDioOLg4yDjSANCoLtgvCC8YFBg4+DSYOTIA0KgmCCYYJigmOCZIJl
gmaCZ4JogmmCaoJrgmyCbYJugm+CcIJxgnKCc4J0gnWCdoJ3gniCeSANCoKBgoKCg4KEgoWChoKH
goiCiYKKgouCjIKNgo6Cj4KQgpGCkoKTgpSClYKWgpeCmIKZgpogDQqBQoFBgWmBaoGBgXuBW4GW
gY+BdYF2gUmBlIGQgZOBlYFggYSBg4GbgX6BooGggZmB9CANCpOMi56Tc49hkkqL5pHjgViW2IJS
gXyCUYJUgXyCUiANCoKggqKCqIKikbmV25BWj2iDcoOLglCCVYpLIA0Kg0ODToNWg0GKlI6uie+O
0CANCg==
"

Outlook could properly show the message. This means the encoding is
"iso-2022-jp". or else outlook couldnot have decoded it? I am unable to
explain this behaviour.

~Pratik