Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

UTF8 discontinuities

127 views
Skip to first unread message

luser droog

unread,
Jul 5, 2016, 1:31:09 AM7/5/16
to
I've been reworking my utf8<->ucs4 conversion code in
preparation for overlaying my own codes over the unused
portions of the first byte (thus, enabling my apl-like
golfing language to count 64 extra symbols as 1-byte
each). But upon reviewing the spec
http://www.ietf.org/rfc/rfc3629.txt
I noticed the "grammar" on p 4-5

UTF8-octets = *( UTF8-char )
UTF8-char = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4
UTF8-1 = %x00-7F
UTF8-2 = %xC2-DF UTF8-tail
[...page break...]
UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
%xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
%xF4 %x80-8F 2( UTF8-tail )
UTF8-tail = %x80-BF

And reviewing the old reviews of my code:
https://groups.google.com/d/topic/comp.lang.c/ki2ar-qVLAQ/discussion
http://codereview.stackexchange.com/questions/98838/utf-8-encoding-decoding

I don't think the bizarrities of this grammar never
quite came to light. What's with the A0-BF after E0
in UTF8-3? Or the 90-BF after F0? Or C0-C1? where did
they go?

I'm not sure my treatment as prefix|payload for
either encoding or decoding is sufficient to cope
with what's going on in there (because I'm not so
sure anymore).

Ben Bacarisse

unread,
Jul 5, 2016, 6:38:39 AM7/5/16
to
I'm not sure what you mean by "where did they go" (and the grammar is
messy enough that I've not checked it) so I'll just add a little
explanation. UTF-8 sequences look like this (originally extened to up
to six bytes):

0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

where the value of the code point is simply the concatenation of the xs.

The encoding has these useful properties: (a) You can always tell if you
are looking at a single byte sequence or a multi-byte sequence by
looking at the top bit. (b) The length of a multi-byte sequence is
encoded in the number of leading 1s in the first byte. (c) All
follow-on bytes have 10 as their top two bits so you can tell them apart
from the lead byte of a multi-byte sequence.

Does that tie up with the grammar and your understanding of the
encoding?

<snip>
--
Ben.

Eric Sosman

unread,
Jul 5, 2016, 7:01:26 AM7/5/16
to
On 7/5/2016 1:30 AM, luser droog wrote:
> I've been reworking my utf8<->ucs4 conversion code in
> preparation for overlaying my own codes over the unused
> portions of the first byte [...]

comp.lang.tactile.basic is down the hall to your left.

--
eso...@comcast-dot-net.invalid
"Don't be afraid of work. Make work afraid of you." -- TLM

Richard Bos

unread,
Jul 5, 2016, 9:18:50 AM7/5/16
to
Eric Sosman <eso...@comcast-dot-net.invalid> wrote:

> On 7/5/2016 1:30 AM, luser droog wrote:
> > I've been reworking my utf8<->ucs4 conversion code in
> > preparation for overlaying my own codes over the unused
> > portions of the first byte [...]
>
> comp.lang.tactile.basic is down the hall to your left.

Not on my server, it isn't. Now I'm curious. A quick websearch reveals
nothing.

Richard

Richard Heathfield

unread,
Jul 5, 2016, 9:25:06 AM7/5/16
to
Perhaps he's being soscastic, and actually means VB (which has been
well-described as 'stickle-brick programming' - very tactile in every
sense (except, it must be admitted, in the sense of the sense of touch)).

--
Richard Heathfield
Email: rjh at cpax dot org dot uk
"Usenet is a strange place" - dmr 29 July 1999
Sig line 4 vacant - apply within

Eric Sosman

unread,
Jul 5, 2016, 9:47:30 AM7/5/16
to
On 7/5/2016 9:24 AM, Richard Heathfield wrote:
> On 05/07/16 14:18, Richard Bos wrote:
>> Eric Sosman <eso...@comcast-dot-net.invalid> wrote:
>>
>>> On 7/5/2016 1:30 AM, luser droog wrote:
>>>> I've been reworking my utf8<->ucs4 conversion code in
>>>> preparation for overlaying my own codes over the unused
>>>> portions of the first byte [...]
>>>
>>> comp.lang.tactile.basic is down the hall to your left.
>>
>> Not on my server, it isn't. Now I'm curious. A quick websearch reveals
>> nothing.
>
> Perhaps he's being soscastic, and actually means VB (which has been
> well-described as 'stickle-brick programming' - very tactile in every
> sense (except, it must be admitted, in the sense of the sense of touch)).

I was merely trying to imagine a forum on which this thread's
subject might be topical. "BASIC for Clueless Gropers" came to mind.

supe...@casperkitty.com

unread,
Jul 5, 2016, 10:01:30 AM7/5/16
to
On Tuesday, July 5, 2016 at 12:31:09 AM UTC-5, luser droog wrote:
> I don't think the bizarrities of this grammar never
> quite came to light. What's with the A0-BF after E0
> in UTF8-3? Or the 90-BF after F0? Or C0-C1? where did
> they go?

The format for two-byte codes would allow all codes 0-2047 to be
encoded in two bytes, but in all well-formed strings byte values
0-127 are required to be encoded in a single byte. Likewise, the
format for three-byte codes would allow 0-65535 to be encoded in
three bytes, but values below 2048 are required to be encoded in
one or two bytes; the format for four-byte codes would allow all
codes to be expressed in four bytes, but for the requirement that
only codes above 65535 are allowed to use that format.

I would not be surprised if at some point in the design process
there were an intention to recognize non-normalized string formats
where lower codes could be encoded using whatever format was most
convenient (and programs might benefit from using such formats
internally even though they would be considered non-standard if
exported as-is) but the authors of the Standard wanted to ensure
that code which tried to e.g. replace all less-than signs with
"&lt;" wouldn't let through less-than signs that were encoded using
alternate forms.

luser droog

unread,
Jul 5, 2016, 1:42:55 PM7/5/16
to
I see now. Thanks. So the grammar I quoted is essentially the
formatting step quoted in Ben's message combined with the range-
checks for the larger encodings.

So, David Brown's range-checking code from the earlier thread
is indeed sufficient to "cope with the discontinuities".

I can imagine a more compact encoding where the payload of the
larger codes are biased to their valid ranges.

luser droog

unread,
Jul 5, 2016, 2:07:26 PM7/5/16
to
On Tuesday, July 5, 2016 at 8:25:06 AM UTC-5, Richard Heathfield wrote:
> On 05/07/16 14:18, Richard Bos wrote:
> > Eric Sosman <eso...@comcast-dot-net.invalid> wrote:
> >
> >> On 7/5/2016 1:30 AM, luser droog wrote:
> >>> I've been reworking my utf8<->ucs4 conversion code in
> >>> preparation for overlaying my own codes over the unused
> >>> portions of the first byte [...]
> >>
> >> comp.lang.tactile.basic is down the hall to your left.
> >
> > Not on my server, it isn't. Now I'm curious. A quick websearch reveals
> > nothing.
>
> Perhaps he's being soscastic, and actually means VB (which has been
> well-described as 'stickle-brick programming' - very tactile in every
> sense (except, it must be admitted, in the sense of the sense of touch)).
>

I am aware of the infelicity of creating a new thing that's
ever-so-slightly incompatible with everything else. And in
the case of single-byte encodings for APL languages, there
are a lot of existing options to choose from:
http://meta.codegolf.stackexchange.com/questions/9428/when-can-apl-characters-be-counted-as-1-byte-each

But none of the existing ones work quite the way I want.
In particular, none cooperates with UTF8-encoded characters
in the same stream. By overlaying my shortcut codes over the
80-BF range of the first byte, it enables a workflow where
a source file can be edited with normal UTF8 tools and then
passed through a compression program to yield the program-specific
encoded form. Similar programs could easily convert to/from other
popular APL encodings if desired.

I really am attempting to consider compatibility issues and
make the least-constraining choices to achieve the goal
(here: devising a single utf8-compatible encoding for all
input to the interpreter, in one swoop enabling script files,
pasting Unicode into the repl, and 1-byte-per-char counting
of common extended symbols for golfing metrics).

I am not making a new VB.

Eric Sosman

unread,
Jul 5, 2016, 2:22:29 PM7/5/16
to
On 7/5/2016 2:07 PM, luser droog wrote:
> On Tuesday, July 5, 2016 at 8:25:06 AM UTC-5, Richard Heathfield wrote:
>> On 05/07/16 14:18, Richard Bos wrote:
>>> Eric Sosman <eso...@comcast-dot-net.invalid> wrote:
>>>
>>>> On 7/5/2016 1:30 AM, luser droog wrote:
>>>>> I've been reworking my utf8<->ucs4 conversion code in
>>>>> preparation for overlaying my own codes over the unused
>>>>> portions of the first byte [...]
>>>>
>>>> comp.lang.tactile.basic is down the hall to your left.
>>>
>
> I am not making a new VB.

No, but you've launched a thread that has bugger-all to
do with C.

comp.lang.i.have.no.clue.about.topicality is down the
hall to your left, comp.lang.i.have.no.concern.for.others
is to the right. Pick either direction, don't stop walking.

Richard Damon

unread,
Jul 5, 2016, 2:33:02 PM7/5/16
to
Yes, it might well have been originally intended to allow the 'overlong'
encodings to exist, till someone realized the possible security flaws
that it could create. There is one variant 'Modified UTF-8' which allows
the null byte within a string to be encoded as C0 80 to allow the normal
null 00 to be used as a string terminator.

One big advantage of the number of invalid sequences in UTF-8 is that it
is easy to detect if a series of bytes is in-fact UTF-8 encoded. Almost
any reasonable byte sequence in some other encoding (other than plain
ASCII) will fail to validate as UTF-8.

Malcolm McLean

unread,
Jul 5, 2016, 3:21:49 PM7/5/16
to
On Tuesday, July 5, 2016 at 6:42:55 PM UTC+1, luser droog wrote:
>
> I can imagine a more compact encoding where the payload of the
> larger codes are biased to their valid ranges.
>
UTF-8 is designed to be backwards compatible with ascii, even
to the extent of being transparent to most ascii string handling
functions which don't treat character boundaries as special.

It's very easy to design a more compact Unicode encoding.
However in most circumstances, storage and memory are cheap,
and text isn't a significant contributor to data bulk. So
lack of compression isn't a big disadvantage.


supe...@casperkitty.com

unread,
Jul 5, 2016, 3:29:47 PM7/5/16
to
On Tuesday, July 5, 2016 at 2:21:49 PM UTC-5, Malcolm McLean wrote:
> UTF-8 is designed to be backwards compatible with ascii, even
> to the extent of being transparent to most ascii string handling
> functions which don't treat character boundaries as special.

UTF-8 was also designed with the intention of making it possible to
clip section of text without regard for character boundaries, recognize
and discard a partial character at the beginning or end, and be able
to properly interpret everything in the clip other than the partial
characters. Unfortunately, support for combining characters and
bidirectional text requires throwing those guarantees out the window.

luser droog

unread,
Jul 5, 2016, 4:24:24 PM7/5/16
to
On Tuesday, July 5, 2016 at 12:42:55 PM UTC-5, luser droog wrote:

> I see now. Thanks. So the grammar I quoted is essentially the
> formatting step quoted in Ben's message combined with the range-
> checks for the larger encodings.
>
> So, David Brown's range-checking code from the earlier thread
> is indeed sufficient to "cope with the discontinuities".
>

Apologies. It was Richard Damon who wrote the
nice range-checking code in the previous thread.

luser droog

unread,
Jul 5, 2016, 4:38:34 PM7/5/16
to
On Tuesday, July 5, 2016 at 1:22:29 PM UTC-5, Eric Sosman wrote:
> On 7/5/2016 2:07 PM, luser droog wrote:
> > On Tuesday, July 5, 2016 at 8:25:06 AM UTC-5, Richard Heathfield wrote:
> >> On 05/07/16 14:18, Richard Bos wrote:
> >>> Eric Sosman <eso...@comcast-dot-net.invalid> wrote:
> >>>
> >>>> On 7/5/2016 1:30 AM, luser droog wrote:
> >>>>> I've been reworking my utf8<->ucs4 conversion code in
> >>>>> preparation for overlaying my own codes over the unused
> >>>>> portions of the first byte [...]
> >>>>
> >>>> comp.lang.tactile.basic is down the hall to your left.
> >>>
> >
> > I am not making a new VB.
>
> No, but you've launched a thread that has bugger-all to
> do with C.
>

As I attempted to indicate by linking to a previous thread
(with C code in it), this is a follow-on question about implementing
a common internet standard in C code.

It's not the average message whining about some "mis-" feature of C.
But about using C to write computer programs.

I do concede the point that my question does not directly
pertain to implementation but is entirely about interpretation
of the spec. But even this I feel, due to the ubiquity of UTF8
as a format and the complete lack of standard C facilities for
working with this format, is not such a strain to topicality
though I may have fared better on stackoverflow.

> comp.lang.i.have.no.clue.about.topicality is down the
> hall to your left, comp.lang.i.have.no.concern.for.others
> is to the right. Pick either direction, don't stop walking.

I have found your posts to be very helpful and informative
in the past. But this reaction seems aggressive beyond what
is warranted.

luser droog

unread,
Jul 9, 2016, 2:50:11 AM7/9/16
to
On Tuesday, July 5, 2016 at 1:22:29 PM UTC-5, Eric Sosman wrote:
> On 7/5/2016 2:07 PM, luser droog wrote:
> > I am not making a new VB.
>
> No, but you've launched a thread that has bugger-all to
> do with C.
>

On third thought, you're right. I apologize. I had forgotten that
the privilege I had earned in comp.lang.postscript
(https://groups.google.com/d/topic/comp.lang.postscript/BvMSQeWjflQ/discussion)
does not extend throughout google's usenet window.

I will endeavor to apply more caution and selectivity before
posting new topics. Thank you for your diligence.

--
diligent dillinger

Siri Cruise

unread,
Jul 9, 2016, 10:13:53 AM7/9/16
to
In article <eb346ee0-9fea-4f0f...@googlegroups.com>,
luser droog <luser...@gmail.com> wrote:

> I don't think the bizarrities of this grammar never
> quite came to light. What's with the A0-BF after E0
> in UTF8-3? Or the 90-BF after F0? Or C0-C1? where did
> they go?

UTF8 is designed to always distinguish an initial octet from a trailing octet
and to determine the total number of octets from the first. That means given a
stream of bytes you can decide where the first character starts, you can be sure
a character is complete, and you don't have to read past the last octet of a
character to know where it ends.

--
:-<> Siri Seal of Disavowal #000-001. Disavowed. Denied. Deleted.
'I desire mercy, not sacrifice.'
If you assume the final scene is a dying delusion as Tom Cruise drowns below
the Louvre, then Edge of Tomorrow has a happy ending. Kill Tom repeat..

Malcolm McLean

unread,
Jul 9, 2016, 11:54:43 AM7/9/16
to
On Saturday, July 9, 2016 at 3:13:53 PM UTC+1, Siri Cruise wrote:
> In article <eb346ee0-9fea-4f0f...@googlegroups.com>,
> luser droog <luser...@gmail.com> wrote:
>
> > I don't think the bizarrities of this grammar never
> > quite came to light. What's with the A0-BF after E0
> > in UTF8-3? Or the 90-BF after F0? Or C0-C1? where did
> > they go?
>
> UTF8 is designed to always distinguish an initial octet from a
> trailing octet and to determine the total number of octets from
> the first. That means given a stream of bytes you can decide where
> the first character starts, you can be sure a character is complete,
> and you don't have to read past the last octet of a character to
> know where it ends.
>
Yes, it's an extremely good system and very backward-compatible with
ascii. If a UTF-8-naive ascii process chops a UTF-8 stream mid
character, there's obviously no way of recovering from that, and you
couldn't do so except by adding a lot more redundancy to the encodings,
but at least you can detect what has occurred and retrieve all the
rest of the characters.

Rosario19

unread,
Jul 18, 2016, 3:04:59 AM7/18/16
to
On Tue, 05 Jul 2016 11:38:29 +0100, Ben Bacarisse wrote:

>luser droog <luser...@gmail.com> writes:
>
>> I've been reworking my utf8<->ucs4 conversion code in

>I'm not sure what you mean by "where did they go" (and the grammar is
>messy enough that I've not checked it) so I'll just add a little
>explanation. UTF-8 sequences look like this (originally extened to up
>to six bytes):
>
> 0xxxxxxx
> 110xxxxx 10xxxxxx
> 1110xxxx 10xxxxxx 10xxxxxx
> 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
^^^ ^^ ^^

if the first byte can say the number of the followed bytes
why there is "10" in each follow bytes?
why not something as:

0xxxxxxx
110xxxxx xxxxxxxx
1110xxxx xxxxxxxx xxxxxxxx
11110xxx xxxxxxxx xxxxxxxx xxxxxxxx
?

is it important to know if there are error in write/read?

Malcolm McLean

unread,
Jul 18, 2016, 4:02:36 AM7/18/16
to
On Monday, July 18, 2016 at 8:04:59 AM UTC+1, Rosario19 wrote:
> On Tue, 05 Jul 2016 11:38:29 +0100, Ben Bacarisse wrote:
>
> >luser droog <luser...@gmail.com> writes:
> >
> >> I've been reworking my utf8<->ucs4 conversion code in
>
> >I'm not sure what you mean by "where did they go" (and the grammar is
> >messy enough that I've not checked it) so I'll just add a little
> >explanation. UTF-8 sequences look like this (originally extened to up
> >to six bytes):
> >
> > 0xxxxxxx
> > 110xxxxx 10xxxxxx
> > 1110xxxx 10xxxxxx 10xxxxxx
> > 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
> ^^^ ^^ ^^
>
> if the first byte can say the number of the followed bytes
> why there is "10" in each follow bytes?
> why not something as:
>
> 0xxxxxxx
> 110xxxxx xxxxxxxx
> 1110xxxx xxxxxxxx xxxxxxxx
> 11110xxx xxxxxxxx xxxxxxxx xxxxxxxx
> ?
>
> is it important to know if there are error in write/read?
>
Let's say a UTF-8 naive program chops a string at the 40 character
mark, maybe because it was written at a time when a screen had
40 characters per line and each char was always a character.
Now if the chop happens to be in the middle of a UTF-8 sequence,
obviously we're going to have difficulties recovering the chopped
character without putting the string back together again, and
there's no way round that other than by adding an awful lot of
redundancy to the encoding.
But at least we know that sequence one has the last character starting
with 0x11110xxx and has only one byte following, sequence two starts
with 0x10, then has another 0x10 byte. So we only lose one character,
and the sequence is not corrupted (spurious characters added).

0 new messages