encoding to and from UTF-8

Chun

no leída,

24 jun 2009, 4:42:0224/6/09

a

Hello,

I encode a unicode character as utf-8 but how do I convert back to
unicode?

% set a \u0160 << character is Š
¦
% set n [encoding convertto utf-8 $a]
Å
% binary scan $n H* hex
1
% puts $hex
c5a0 << correct

set n [encoding convertto unicode $n]
Å
% binary scan $n H* hex
1
% puts $hex
00c500a0 << Not what I was expecting

The other problem is:

% set a \u0160
¦
% binary scan $a H* hex
1
% puts $hex
60 << Not what was expecting 0160

Thanks
Michael

Pat Thoyts

no leída,

24 jun 2009, 7:27:0424/6/09

a

Chun <chun...@gmail.com> writes:

>Hello,
>
>I encode a unicode character as utf-8 but how do I convert back to
>unicode?
>
>% set a \u0160 << character is Š
>�
>% set n [encoding convertto utf-8 $a]

>Š

>% binary scan $n H* hex
>1
>% puts $hex
>c5a0 << correct
>
>set n [encoding convertto unicode $n]

The variable 'n' contains the utf-8 representation of your original
input. To convert it back into a tcl-internal representation you
should be doing:
set n [encoding convertfrom utf-8 $n]

Then if you really wanter the unicode encoding of this you can
set unicode_n [encoding convertto unicode $n]

>Š

>% binary scan $n H* hex
>1
>% puts $hex
>00c500a0 << Not what I was expecting
>
>The other problem is:
>
>% set a \u0160
>�
>% binary scan $a H* hex
>1
>% puts $hex
>60 << Not what was expecting 0160
>
>Thanks
>Michael

--
Pat Thoyts http://www.patthoyts.tk/
To reply, rot13 the return address or read the X-Address header.
PGP fingerprint 2C 6E 98 07 2C 59 C8 97 10 CE 11 E6 04 E0 B9 DD

Chun

no leída,

24 jun 2009, 7:32:4324/6/09

a

On Jun 24, 12:27 pm, Pat Thoyts <cnggub...@hfref.fbheprsbetr.arg>
wrote:

> PGP fingerprint 2C 6E 98 07 2C 59 C8 97 10 CE 11 E6 04 E0 B9 DD- Hide quoted text -
>
> - Show quoted text -- Hide quoted text -
>
> - Show quoted text -

Thanks Pat

this works.

% set n [encoding convertfrom utf-8 $n]
Å
% set unicode_n [encoding convertto unicode $n]
`
% binary scan $unicode_n H* hex
1
% puts $hex
0160

Michael

Kevin Kenny

no leída,

24 jun 2009, 7:51:2724/6/09

a

Chun wrote:
> Hello,
>
> I encode a unicode character as utf-8 but how do I convert back to
> unicode?
>

> % set a \u0160 << character is ï¿½
> ï¿½

> % set n [encoding convertto utf-8 $a]

> ï¿½

> % binary scan $n H* hex
> 1
> % puts $hex
> c5a0 << correct

Here you've taken Tcl's internal representation of the string
and produced a byte array containing the UTF-8 representation.
(That array's string representation is a sequence of ISO8859-1
characters corresponding to the bytes). [binary scan] is happy
to convert that to hexadecimal.

> set n [encoding convertto unicode $n]

> ï¿½

> % binary scan $n H* hex
> 1
> % puts $hex
> 00c500a0 << Not what I was expecting

Now you've taken the byte array from the previous step
and interpreted it as a string, asking Tcl to convert it
to Unicode. The result is that each of the two bytes
becomes an ISO8859-1 character, and gets encoded as
its Unicode counterpart, so you get two 16-bit characters.
What you probably wanted to do was either to encode
the original string, or decode the byte array back to
a string and then encode it:

% set n [encoding convertto unicode $a]
`

% binary scan $n H* hex
1
% puts $hex

6001

Here you see a single 16-bit character. The bytes are
swapped because I'm on a little-endian machine.

% set n [encoding convertto utf-8 $a]

ï¿½
% set n2 [encoding convertto unicode [encoding convertfrom utf-8 $n]]
`
% binary scan $n2 H* hex
1
% puts $hex
6001

> The other problem is:
>
> % set a \u0160

> ï¿½

> % binary scan $a H* hex
> 1
> % puts $hex
> 60 << Not what was expecting 0160

Now you're trying to apply [binary scan] to a string that isn't
a byte array. What [binary scan] does in that case is to interpret
each character as a byte and discard the most significant bits.

I suspect that you're working way too hard.

The [encoding convertto] and [encoding convertfrom] commands are
chiefly useful for dealing with strings that need to be embedded
in binary data. If that isn't what you have, you don't need to
use them. For day-to-day use, you just configure channels to have
the needed encoding, and read and write strings on those channels.

If you're simply trying to extract the information of 'what
Unicode code point is this character' or 'what character is this
Unicode code point', it's easier to use [scan] and [format]:

% foreach c [split $a {}] {
scan $c %c n
puts [format %#06x $n]
}
0x0160

The subject line suggests that you are trying to encode data in
the Windows code page 1252. cp1252 IS NOT UTF-8. It's IS08859-1,
with a number of characters in the range \x80-\x9f replaced by
Windows-specific things. Tcl will happily encode things in that
code page; use 'cp1252' in place of 'utf-8' or 'unicode'.

You may also be laboring under the misconception that because
your script was encoded in CP1252, that the strings at run time
will be CP1252. That's not true. Tcl converted your script to
its internal representation (which happens to be UTF-8, but that's
none of your business unless you're writing C code to deal with
Tcl strings). That leaves you with the simpler problem of
"how do I convert strings to/from a given encoding, given
Tcl's internal representation". That's what [encoding
convert*] does, and that's why there's a 'convertfrom' in
addition to a 'convertto'.

--
73 de ke9tv/2, Kevin