Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

base64 and unicode

0 views
Skip to first unread message

EuGeNe Van den Bulke

unread,
May 4, 2007, 3:06:01 AM5/4/07
to
Hi,

I am trying to convert the file hebrew.b64 attached into hebrew.lang
(text file usable by Inline Search <http://www.ieforge.com/InlineSearch>
for localization purposes.

>>> import base64
>>> base64.decode(file("hebrew.b64","r"),file("hebrew.lang","w"))

It runs but the result is not correct: some of the lines in hebrew.lang
are correct but not all of them (hebrew.expected.lang is the correct
file). I guess it is a unicode problem but can't seem to find out how to
fix it.

---- hebrew.b64 = file to convert ----

//4jACAARQBuAGcAbABpAHMAaAAgAHYAIAAxAC4ANAANAAoAMQA6AOIF0QXoBdkF6gUNAAoA
DQAKADEAMAAxADoA0QXZBdgF1QXZBSAA3AXQBSAA4AXeBeYF0AUNAAoAMQAwADIAOgDUBdIF
2QXiBSAA3AXhBdUF4wUgANQF0wXjBSwAIADeBd4F6QXZBdoFIADeBegF0AXpBSAA1AXTBeMF
DQAKADEAMAAzADoA0gXoBeEF1AUgANcF0wXpBdQFIADpBdwFIABJAG4AbABpAG4AZQAgAFMA
ZQBhAHIAYwBoACAA4AXeBeYF0AXUBS4AIADcBdcF5QUgAOIF3AUgACIA2wXfBSIAIADbBdMF
2QUgANwF4gXRBdUF6AUgANwF0wXjBSAA1AXUBdUF6AXTBdQFLgANAAoAMQAwADQAOgDZBekF
IADcBdoFIADQBeoFIADUBdIF6AXhBdQFIADUBdAF1wXoBdUF4AXUBSAA6QXcBSAASQBuAGwA
aQBuAGUAIABTAGUAYQByAGMAaAAuAA0ACgAxADAANQA6AN4F5gXQBSAAOgANAAoAMQAwADYA
OgDUBeoF0AXdBSAA6AXZBekF2QXVBeoFDQAKADEAMQAxADoA3gXmBdAFIADQBeoFIADUBdEF
0AUNAAoAMQAxADIAOgDeBeYF0AUgANAF6gUgANQF5wXVBdMF3QUNAAoAMQAxADMAOgDUBdMF
0gXpBSAA1AXbBdwFDQAKAA0ACgAjACAATQBlAG4AdQANAAoAMwAyADcANgA4ADoA0AXVBdMF
1QXqBQ0ACgAzADIANwA2ADkAOgDRBdMF1QXnBSAA0AXdBSAA5wXZBdkF3QUgAOIF0wXbBdUF
3wUNAAoAMwAyADcANwAwADoA1AXqBdAF3QUgANAF2QXpBdkF6gUuAC4ALgANAAoADQAKACMA
IABPAHAAdABpAG8AbgAgAGQAaQBhAGwAbwBnAA0ACgAxADAANwA6ANQF6gXQBd0FIADQBdkF
6QXZBeoFIADQBeoFIABJAG4AbABpAG4AZQAgAFMAZQBhAHIAYwBoAA0ACgAxADAAOAA6AOkF
5AXUBQ0ACgAxADAAOQA6ANEF1wXoBSAA0AXqBSAA1AXpBeQF1AUgANQF3gXVBeIF0wXkBeoF
IADiBdwF2QXaBSAAOgANAAoAMQAxADAAOgDpBdkF4AXVBdkF2QXdBSAA0QXpBeQF1AUgANkF
1QXkBdkF4gXVBSAA0QXUBeQF4gXcBdQFIADUBdEF0AXUBSAA6QXcBSAASQBuAHQAZQByAG4A
ZQB0ACAARQB4AHAAbABvAHIAZQByAA0ACgA=

---- hebrew.expected.lang = expected output ----
# English v 1.4
1:עברית

101:ביטוי לא נמצא
102:הגיע לסוף הדף, ממשיך מראש הדף
103:גרסה חדשה של Inline Search נמצאה. לחץ על "כן" כדי לעבור לדף ההורדה.
104:יש לך את הגרסה האחרונה של Inline Search.
105:מצא :
106:התאם רישיות
111:מצא את הבא
112:מצא את הקודם
113:הדגש הכל

# Menu
32768:אודות
32769:בדוק אם קיים עדכון
32770:התאם אישית...

# Option dialog
107:התאם אישית את Inline Search
108:שפה
109:בחר את השפה המועדפת עליך :
110:שינויים בשפה יופיעו בהפעלה הבאה של Internet Explorer

Could someone enlighten me on how to go from hebrew.b64 to
hebrew.expected.lang?

Thanks a lot,

EuGeNe -- http://www.3kwa.com

Duncan Booth

unread,
May 4, 2007, 3:36:31 AM5/4/07
to
EuGeNe Van den Bulke <eugene.va...@gmail.com> wrote:

> >>> import base64
> >>> base64.decode(file("hebrew.b64","r"),file("hebrew.lang","w"))
>
> It runs but the result is not correct: some of the lines in hebrew.lang
> are correct but not all of them (hebrew.expected.lang is the correct
> file). I guess it is a unicode problem but can't seem to find out how to
> fix it.

My guess would be that your problem is that you wrote the file in text
mode, so (assuming you are on windows) all newline characters in the output
are converted to carriage return/linefeed pairs. However, the decoded text
looks as though it is utf16 encoded so it should be written as binary. i.e.
the output mode should be "wb".

Simpler than using the base64 module you can just use the base64 codec.
This will decode a string to a byte sequence and you can then decode that
to get the unicode string:

with file("hebrew.b64","r") as f:
text = f.read().decode('base64').decode('utf16')

You can then write the text to a file through any desired codec or process
it first.

BTW, you may just have shortened your example too much, but depending on
python to close files for you is risky behaviour. If you get an exception
thrown before the file goes out of scope it may not get closed when you
expect and that can lead to some fairly hard to track problems. It is much
better to either call the close method explicitly or to use Python 2.5's
'with' statement.

EuGeNe Van den Bulke

unread,
May 4, 2007, 5:47:40 AM5/4/07
to
Duncan Booth wrote:
> However, the decoded text looks as though it is utf16 encoded so it should be written as binary. i.e.
> the output mode should be "wb".

Thanks for the "wb" tip that works (see bellow). I guess it is
experience based but how could you tell that it was utf16 encoded?

> Simpler than using the base64 module you can just use the base64 codec.
> This will decode a string to a byte sequence and you can then decode that
> to get the unicode string:
>
> with file("hebrew.b64","r") as f:
> text = f.read().decode('base64').decode('utf16')
>
> You can then write the text to a file through any desired codec or process
> it first.

>>> with file("hebrew.lang","wb") as f:
>>> ... file.write(text.encode('utf16'))

Done ... superb!

> BTW, you may just have shortened your example too much, but depending on
> python to close files for you is risky behaviour. If you get an exception
> thrown before the file goes out of scope it may not get closed when you
> expect and that can lead to some fairly hard to track problems. It is much
> better to either call the close method explicitly or to use Python 2.5's
> 'with' statement.

Yes I had shortened my example but thanks for the 'with' statement tip
... I never think about using it and I should ;)

Thanks,

EuGeNe -- http://www.3kwa.com

Duncan Booth

unread,
May 4, 2007, 9:03:55 AM5/4/07
to
EuGeNe Van den Bulke <eugene.va...@gmail.com> wrote:

> Duncan Booth wrote:
>> However, the decoded text looks as though it is utf16 encoded so it
>> should be written as binary. i.e. the output mode should be "wb".
>
> Thanks for the "wb" tip that works (see bellow). I guess it is
> experience based but how could you tell that it was utf16 encoded?

I pasted the encoded form into idle and decoded it base 64. It ends with \r
\x00\n\x00 and the nulls instantly suggest a 16 bit encoding. Scrolling to
the beginning and it starts \xff\xfe which is the BOM for little-endian
utf16.

0 new messages