Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

How to detect if an encoding conversion will fail?

57 views
Skip to first unread message

Georgios Petasis

unread,
Aug 17, 2019, 9:07:04 AM8/17/19
to
Hi all,

I want to write to a file some Greek characters, in the iso8859-7
encoding. However, some words contain characters that cannot be mapped
into this encoding, and I end up writting "?" in the file.
How can I detect this and do not write the word at all?

George

Christian Gollwitzer

unread,
Aug 17, 2019, 9:42:57 AM8/17/19
to
Am 17.08.19 um 15:07 schrieb Georgios Petasis:
There is no way to ask the encoding system to throw an error on unknown
characters instead of converting it to the "unknown character"
character. The only way seems to be, that you convert it back and check
if it is identical, like this:

set roundtrip [encoding convertfrom iso8859-7 [encoding convertto
iso8859-7 input]]

if {$roundtrip ne $input} { error Loss! }

Unfortunately, there are still situations where an error is shown even
if the character existed; because you can compose accented letters in
Unicode by either accent + basechar or use a precombined letter (one of
the stupidest decisions in Unicode). This can only be checked by Unicode
normalizations, which is not part of the Tcl built-in commands.

Christian

stefan

unread,
Aug 18, 2019, 3:48:35 PM8/18/19
to
> This can only be checked by Unicode
> normalizations, which is not part of the Tcl built-in commands.

In that case, esp. if one cannot control the input or the input source is known to use some non-normal form (filesystems), there is tcllib to the rescue:

% package req unicode
1.0.0
% set input1 \u00e9
é
% set input2 \u0065\u0301

% expr {$input1 eq $input2}; # Ouch!
0
% expr {$input1 eq [::unicode::normalizeS C $input2]}
1

So, before doing the roundtrip, normalize your input.

HTH, Stefan

Harald Oehlmann

unread,
Aug 19, 2019, 5:43:14 AM8/19/19
to
I would really appreciate to enhance the encoding commands to get this
and the following information.

-> get information about invalid data in utf-8
-> get information of started bytes of multi-bytes without continuation
(first byte of an utf-8)

So much to do....

Thank you,
Harald
0 new messages