Recognising valid/complete UTF-8

Mike Griffiths

unread,

May 14, 2018, 4:00:59 PM5/14/18

to

Hi all.

I have a Tcl Telnet client which negotiates character encoding. The channel has -encoding binary (*), and I'm using [encoding convertfrom] on the received strings. However, right now I'm having to buffer until I get an end-of-line and then convert, because if I convert the bytes as I receive them, and my [gets] returns half a codepoint, the output isn't correct. So, my question: Is there a way to tell if the bytes in a variable represent a complete and valid string in a given encoding? And/or get the point in a string where this is no longer true? (I was hoping for either a useful [string is] sub-command, or something in [encoding], but there doesn't seem to be anything helpful.)

(*) I cannot change the encoding of the channel from binary; there are several ways to encode y-umlaut, all of which show up as byte 255 when I read from the channel if I use -encoding utf-8, but byte 255 is special to the telnet protocol, and I need to be able to differentiate between that and another form (from the server) of sending y-umlaut.

Mike Griffiths

unread,

May 14, 2018, 4:03:20 PM5/14/18

to

Having thought on it some more as I was typing up the question, one solution (of sorts) occurred to me - it may be more viable for me to read from my binary-encoded channel, take out the telnet control strings, and then shunt everything through another channel which does have -encoding utf-8 on it and use the result of that for my actual output. That way I can use Tcl's built-in conversion to read whole characters without having to try and do it myself...

Harald Oehlmann

unread,

May 15, 2018, 2:18:27 AM5/15/18

to

Am 14.05.2018 um 22:03 schrieb Mike Griffiths:
> On Monday, 14 May 2018 21:00:59 UTC+1, Mike Griffiths wrote:

Mike, sorry, did not read your long post.

On this wiki page, there is a discussion about receiving utf-8 by
sockets and to check for complete characters:

http://wiki.tcl.tk/515

Maybe helpful,
Harald

Mike Griffiths

unread,

May 15, 2018, 4:49:32 PM5/15/18

to

Hi Harald,

Many thanks for your reply, I'll take a look at that. My title is slightly misleading, though - although utf-8 is the one that raised the issue for me, it could potentially be any encoding I have to deal with. While UTF-8 is my current (and likely most common by far) problem, something that will work across the board (ideally utilising the built-in encoding support Tcl already has, rather than having to hardcode my own checks for each) would be very helpful.

Regards,
Mike

Rich

unread,

May 15, 2018, 5:13:58 PM5/15/18

to

Well, you are dipping your toes there into a very hard problem. For
the various byte encodings, there really is little beyond wizzadry
heuristics to use to deduce which encoding is being used.

At least for UTF-8 one can say that the stream of bytes either is valid
as a UTF-8 encoding or is invalid as a UTF-8 encoding.

The invalid is simple, whatever the byte stream is, it is not UTF-8.

The 'valid' one is tricker. It is possible, although highly unlikely,
that some random stream of bytes just happens to be structured such
that it is also a valid UTF-8 encoding. So even 'validity' for UTF-8
isn't an absolute. Although I'll admit that for this one, the chance
of this happening is remote enough you can simply ignore it in the
general sense.

But with the multitude of byte encodings, if you get a byte stream
containing a character 0xBF, how can you deduce that it was meant to be
"INVERTED QUESTION MARK" from ISO-8859-1 vs "LATIN SMALL LETTER Z WITH
DOT ABOVE" from ISO-8859-3 vs. a top right corner line draw character
from the old IBM code page 437 from the DOS days? That's where it gets
really really tricky....

Harald Oehlmann

unread,

May 15, 2018, 5:14:23 PM5/15/18

to

Mike,

the "encoding convertfrom" method should have an additional output
possibility of any fragment bytes:

% encoding convertfrom utf-8 -fragment ABC\xC0
ABC \xc0

Return value is a list of transformed bytes and remaining fragment bytes.

I would be glad to help with a corresponding TIP.

I fear I could not be to helpful with the implementation but anyway...

What do you think ?

Harald

Mike Griffiths

unread,

May 15, 2018, 8:10:38 PM5/15/18

to

Rich,

I know what the encoding should be (it's negotiated via Telnet), my issue is that that I need to receive it with -encoding binary on the socket, not -encoding $whatever (as I need to know specifically what bytes are received for the Telnet parsing, not what characters they correspond to after encoding translation), and I then need to convert the bytes into the specified encoding character by character myself. Tcl doesn't expose any method I've found, at the script level, for recognising if a string of bytes contain whole or partial characters in the given encoding, and [encoding convertfrom] "helpfully" doesn't balk on incomplete multi-byte characters, which avoids it throwing errors but does cause it to butcher output if you've only received half the bytes for a character from the stream when you try and convert. (I hope that makes sense and isn't too rambling.)

Mike Griffiths

unread,

May 15, 2018, 8:16:17 PM5/15/18

to

Harald,

That's more or less what I was thinking of/hoping for initially, yes. That way, you could do something like:

set return ""
set buffer ""
while { [gets $socket data] >= 0 } {
append buffer $data
set ret [encoding convertfrom $encoding -fragment $buffer]
append return [lindex $ret 0]
set buffer [lindex $ret 1]
}

I'm not sure what would happen in this case with an invalid byte sequence, however (and my knowledge of utf-8, or multi-byte encodings in general, is not particularly strong) - would there be a point where it could detect (and drop) the invalid bytes, or would it get stuck on them forever?

Rich

unread,

May 15, 2018, 8:47:15 PM5/15/18

to

It does. Your origional paragraph led me to believe you were planning
on receiveing a random byte string and then identify what character
encoding was used for that random byte string.

Given that you've already negioated what the encoding /should/ be, then
deciding yes/no is a far easier problem. You may end up having to
handle the partial characters yourself re. utf-8. For the single byte
encodings, there is no confusion. And for the 16-bit and 32-bit
encodings things get troublesome again fast (due to needing to also
handle endian issues on top as well).

Harald Oehlmann

unread,

May 16, 2018, 3:09:52 AM5/16/18

to

Dear Mike,

a) incomplete:

Yes, you got my ideas

b) errors:

Currently, any error is replaced by a "?" character.

That is also a field for a TIP and is more complicated.

Something like

encoding convertfrom $encoding -fragment -errorlist $buffer

Will return a 3 item list. THe 3rd item is the error list and contains a
list for each error. Each error is described as the position (in the
output plus the raw bytes which could not be parsed.
If "-fragment" is not given, an error is reported with the index = end
and the fragment bytes as 2nd list element.

Or we only allow the -errorlist and the fragment is a special case of an
error.

Any thoughts ?

Harald

heinrichmartin

unread,

May 16, 2018, 4:32:25 AM5/16/18

to

On Wednesday, May 16, 2018 at 9:09:52 AM UTC+2, Harald Oehlmann wrote:
> >
> > set return ""
> > set buffer ""
> > while { [gets $socket data] >= 0 } {
> > append buffer $data
> > set ret [encoding convertfrom $encoding -fragment $buffer]
> > append return [lindex $ret 0]
> > set buffer [lindex $ret 1]
> > }
> >
>

> encoding convertfrom $encoding -fragment -errorlist $buffer
>
> Will return a 3 item list. THe 3rd item is the error list and contains a
> list for each error. Each error is described as the position (in the
> output plus the raw bytes which could not be parsed.
> If "-fragment" is not given, an error is reported with the index = end
> and the fragment bytes as 2nd list element.
>
> Or we only allow the -errorlist and the fragment is a special case of an
> error.
>
> Any thoughts ?

I kind of dislike the idea of many independent flags messing with the representation of the result. Also index = end sounds like a nightmare as it prevents using lassign straight away.

I know that your suggested pattern is used in regexp, but I prefer output variables, i.e.

encoding convertfrom $encoding -fragmentvar buffer -errorvar errors $buffer

or in context

set return ""
set buffer ""
while { [gets $socket data] >= 0 } {
append buffer $data

append return [encoding convertfrom $encoding -fragmentvar buffer $buffer]
}

In this way, the return value of convertfrom is always just the result.