Why SOME chars nonASCII?

qhsg...@outlook.com

unread,

Jun 23, 2021, 7:17:24 PM6/23/21

to

It seems absurd to me that a recent [few years] fad is to make some
chars 2-bytes, amongst existing one-byte-ASCII-strings.
What is the motive for this?
-- CRG

Eli the Bearded

unread,

Jun 23, 2021, 8:57:11 PM6/23/21

to

Mostly it is because of all of the non-English languages that don't fit
in the seven bits of ASCII. Even the eight bit ISO-8859-x family doesn't
cover lots of well-used languages. UTF-8 gives you most living languages
and many dead ones. UTF-8 isn't strictly "2-bytes", it is a variable
width encoding with ASCII compatibility for ASCII characters. High bit
sequences can be two, three, or four octets.

C encoding verifier I wrote:

/* Bit patterns for legitimate UTF-8:
*
* non-highbit:
* 0bbbbbbb
* two octet highbit:
* 110bbbbb 10bbbbbb
* three octet highbit:
* 1110bbbb 10bbbbbb 10bbbbbb
* four octet highbit:
* 11110bbb 10bbbbbb 10bbbbbb 10bbbbbb
*/

/* low bit (no highbit)
* 0bbbbbbb
* note that null is low bit
*/
#define UTF8_LOWBIT(oct) (0x00 == ((oct) & 0x80))

/* any continuation octet
* 10bbbbbb
*/
#define UTF8_CONTINUATION(oct) (0x80 == ((oct) & 0xC0))

/* start of two octet
* 110bbbbb
*/
#define UTF8_SEQUENCE_2(oct) (0xC0 == ((oct) & 0xE0))

/* start of three octet
* 1110bbbb
*/
#define UTF8_SEQUENCE_3(oct) (0xE0 == ((oct) & 0xF0))

/* start of four octet
* 11110bbb
*/
#define UTF8_SEQUENCE_4(oct) (0xF0 == ((oct) & 0xF8))

/* checks a string str of length len for legit UTF-8 bit patterns.
* null will not terminate the string -- those are legit 7bit ASCII.
* returns byte offset of first non-legit sequence or -1 if 100% okay.
*/
int
check_utf8(str, len)
unsigned char* str;
int len;
{
int seq, pos, run, octet;
run = 0;

for(pos = 0; pos < len; pos++) {
octet = str[pos];

/* start of a sequence */
if(run == 0) {
seq = pos;
}

if(UTF8_LOWBIT(octet)) {
if( run != 0 ) {
/* whoops, wanted highbit there */
return seq;
}
continue;
}

if(UTF8_CONTINUATION(octet)) {
if( run ) {
/* one of our expected run */
run --;
continue;
}
/* whoops, not the right spot for this */
return seq;
}

if( run ) {
/* whoops, should have had a continuation octet above */
return seq;
}

if(UTF8_SEQUENCE_2(octet)) {
run = 1; /* one more */
continue;
}

if(UTF8_SEQUENCE_3(octet)) {
run = 2; /* two more */
continue;
}

if(UTF8_SEQUENCE_4(octet)) {
run = 3; /* three more */
continue;
}

/* yikes! fall through! */
return seq;
}

return -1;
} /* check_utf8() */

https://github.com/Eli-the-Bearded/eli-mailx/blob/master/utf-8.c

Elijah
------
using K&R style to match the rest of mailx

Mike Spencer

unread,

Jun 23, 2021, 9:23:04 PM6/23/21

to

Eli the Bearded <*@eli.users.panix.com> writes:

> In alt.os.linux.slackware, <qhsg...@outlook.com> wrote:
>
>> It seems absurd to me that a recent [few years] fad is to make some
>> chars 2-bytes, amongst existing one-byte-ASCII-strings.
>> What is the motive for this?
>
> Mostly it is because of all of the non-English languages that don't fit
> in the seven bits of ASCII. Even the eight bit ISO-8859-x family doesn't
> cover lots of well-used languages.

Several of my correspondents (using Mac or Windoes) writing in
English do this in their own text and in text/articles copied from the
net.

Oddly, the non-ASCII chars are almost all punctuation: left & right
double & single quotes, em dash, ellipses and the degree symbol. Very
occasionally, there are French or Spanish names with non-ASCII chars
but the big nuisance is the punctuation. And of course, they send it
as quoted-printable.

I have an Emacs macro that finds the QP strings for the punctuation
and reverts them to ASCII before rmail-decode-quoted-printable but
it's a PITA.

> UTF-8 gives you most living languages and many dead ones. UTF-8
> isn't strictly "2-bytes", it is a variable width encoding with ASCII
> compatibility for ASCII characters. High bit sequences can be two,
> three, or four octets.
>
> C encoding verifier I wrote:
>

> [snip]
--
Mike Spencer Nova Scotia, Canada

Eli the Bearded

unread,

Jun 23, 2021, 10:19:11 PM6/23/21

to

In alt.os.linux.slackware, Mike Spencer <m...@bogus.nodomain.nowhere> wrote:
> Several of my correspondents (using Mac or Windoes) writing in
> English do this in their own text and in text/articles copied from the
> net.

Yes, a related problem. With UTF-8 comes a lot more punctuation options,
and a large number of programs silently "correct" things. Some people
believe very dearly the fancy punctionation is better, others believe
the opposite.

> Oddly, the non-ASCII chars are almost all punctuation: left & right
> double & single quotes, em dash, ellipses and the degree symbol. Very
> occasionally, there are French or Spanish names with non-ASCII chars
> but the big nuisance is the punctuation. And of course, they send it
> as quoted-printable.

Quoted printable is how to make UTF-8 seven bit safe and _mostly_
readable. That's the real goal of QP, making it _mostly_ readable if you
don't have software that can display it. Base64 is not readable and gets
used sometimes.

> I have an Emacs macro that finds the QP strings for the punctuation
> and reverts them to ASCII before rmail-decode-quoted-printable but
> it's a PITA.

I have vim settings for the same purpose, and a simple Perl script
for use outside of vim. The Perl script will look for my vim
configuration first, and if it doesn't find it use a built in set of
rules.

https://qaz.wtf/tmp/textify

I basically only try to fix punctuation issues I've encountered.
I do not try to replace accented vowels, for example.

Elijah
------
knows German rules for that, but not, say, French ones

Mike Spencer

unread,

Jun 24, 2021, 1:26:40 AM6/24/21

to

Eli the Bearded <*@eli.users.panix.com> writes:

> I basically only try to fix punctuation issues I've encountered.
> I do not try to replace accented vowels, for example.

Same. After undoing QP, the UTF8 punctuation apears in Emacs as 3
escaped octal digits making for hard reading.

Richard Kettlewell

unread,

Jun 24, 2021, 3:54:06 AM6/24/21

to

Eli the Bearded <*@eli.users.panix.com> writes:

> In alt.os.linux.slackware, <qhsg...@outlook.com> wrote:
>> It seems absurd to me that a recent [few years] fad is to make some
>> chars 2-bytes, amongst existing one-byte-ASCII-strings.
>> What is the motive for this?
>
> Mostly it is because of all of the non-English languages that don't fit
> in the seven bits of ASCII. Even the eight bit ISO-8859-x family doesn't
> cover lots of well-used languages. UTF-8 gives you most living languages
> and many dead ones. UTF-8 isn't strictly "2-bytes", it is a variable
> width encoding with ASCII compatibility for ASCII characters. High bit
> sequences can be two, three, or four octets.
>
> C encoding verifier I wrote:

[...]
> https://github.com/Eli-the-Bearded/eli-mailx/blob/master/utf-8.c

That has several bugs...

1) It accepts non-minimal sequences such as F0808080.

2) It accepts sequences mapping to UTF-16 surrogates, such as EDA080.

3) It accepts sequences mapping outside the Unicode code point range,
such as F7808080.

See https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf D92 for the
specification.

--
https://www.greenend.org.uk/rjk/

Eli the Bearded

unread,

Jul 1, 2021, 4:44:02 PM7/1/21

to

In alt.os.linux.slackware, Richard Kettlewell <inv...@invalid.invalid> wrote:
> Eli the Bearded <*@eli.users.panix.com> writes:
>> C encoding verifier I wrote:
>> https://github.com/Eli-the-Bearded/eli-mailx/blob/master/utf-8.c
>
> That has several bugs...
>
> 1) It accepts non-minimal sequences such as F0808080.
> 2) It accepts sequences mapping to UTF-16 surrogates, such as EDA080.
> 3) It accepts sequences mapping outside the Unicode code point range,
> such as F7808080.

Interesting critique. I may fix those, but I'm not sure they'll ever be
relevant to the level of strictness I need. I'm looking to catch
mislabeled "charset"s not devious attacks.

> See https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf D92 for the
> specification.

The Unicode website was briefly down, so I had put off responding to
this until I could check that. Would have been nice if you had included
a page number since that document is large and doesn't include a TOC.
Page 123 as numbered in document, page 54 as numbered by my PDF reader.

Elijah
------
recalls now non-minimal UTF-8 being used escape Apache document root once

Richard Kettlewell

unread,

Jul 2, 2021, 4:44:16 AM7/2/21

to

Eli the Bearded <*@eli.users.panix.com> writes:
> Richard Kettlewell <inv...@invalid.invalid> wrote:
>> Eli the Bearded <*@eli.users.panix.com> writes:
>>> C encoding verifier I wrote:
>>> https://github.com/Eli-the-Bearded/eli-mailx/blob/master/utf-8.c
>>
>> That has several bugs...
>>
>> 1) It accepts non-minimal sequences such as F0808080.
>> 2) It accepts sequences mapping to UTF-16 surrogates, such as EDA080.
>> 3) It accepts sequences mapping outside the Unicode code point range,
>> such as F7808080.
>
> Interesting critique. I may fix those, but I'm not sure they'll ever be
> relevant to the level of strictness I need. I'm looking to catch
> mislabeled "charset"s not devious attacks.

It was advertized as checking for “legitimate UTF-8”, not “UTF-8 but
also some other stuff that is not UTF-8”.

>> See https://www.unicode.org/versions/Unicode13.0.0/ch03.pdf D92 for the
>> specification.
>
> The Unicode website was briefly down, so I had put off responding to
> this until I could check that. Would have been nice if you had
> included a page number since that document is large and doesn't
> include a TOC. Page 123 as numbered in document, page 54 as numbered
> by my PDF reader.

I did’t think D92 would be hard to search for.

--
https://www.greenend.org.uk/rjk/

Sylvain Robitaille

unread,

Jul 7, 2021, 5:28:06 PM7/7/21

to

On 2021-06-24, Eli the Bearded wrote:

>> I have an Emacs macro that finds the QP strings for the punctuation
>> and reverts them to ASCII before rmail-decode-quoted-printable but
>> it's a PITA.
>

> I have vim settings for the same purpose, ...

Care to share your vim settings? I see that your Perl script reads it
in, or defaults to its own, but I'm certainly curious about what you've
done in vim ...

--
----------------------------------------------------------------------
Sylvain Robitaille s...@encs.concordia.ca

Systems analyst / AITS Concordia University
Faculty of Engineering and Computer Science Montreal, Quebec, Canada
----------------------------------------------------------------------

Richmond

unread,

Jul 7, 2021, 5:45:26 PM7/7/21

to

Mike Spencer <m...@bogus.nodomain.nowhere> writes:

> Several of my correspondents (using Mac or Windoes) writing in
> English do this in their own text and in text/articles copied from the
> net.
>
> Oddly, the non-ASCII chars are almost all punctuation: left & right
> double & single quotes, em dash, ellipses and the degree symbol. Very
> occasionally, there are French or Spanish names with non-ASCII chars
> but the big nuisance is the punctuation. And of course, they send it
> as quoted-printable.
>

Surely as most of the web is utf-8 it is good to use that as a standard.

There is no £ in seven bit ascii, there is in extended ascii, and in
iso, but it causes confusion when email programs do not state the
encoding used.

Eli the Bearded

unread,

Jul 8, 2021, 1:15:28 PM7/8/21

to

In alt.os.linux.slackware, Sylvain Robitaille <s...@encs.concordia.ca> wrote:
> On 2021-06-24, Eli the Bearded wrote:
> > I have vim settings for the same purpose, ...
> Care to share your vim settings? I see that your Perl script reads it
> in, or defaults to its own, but I'm certainly curious about what you've
> done in vim ...

The complete vim settings are basically the same as in the perl script,
but here:

base64 -d <<_B64_VIMRC > highbit_vimrc
IiBzbWFydCBxdW90ZXMKbWFwISDigJkgJwptYXAhIOKAmCAnCm1hcCEg4oCcICIKbWFwISDi
gJ0gIgptYXAhIOKAsyAiCiIgYnVsbGV0Cm1hcCEg4pePICoKIiBlbGxpcHNpcwptYXAhIOKA
piAuLi4KIiBuLWRhc2gKbWFwISDigJMgLS0KIiBtLWRhc2gKbWFwISDigJQgLS0KIiBVKzIy
MTIgbWludXMKbWFwISDiiJIgLQoiIFUrMjAxMCBoeXBoZW4KbWFwISDigJAgLQoiIGx5bngg
YnJva2VuIFVURi04Cm1hcCEgw6LCgMKcICIKbWFwISDDosKAwp0gIgptYXAhIMOiwoDCmSAn
Cm1hcCEgw6LCgMKUIC0tCm1hcCEgw6LCgMKmIC4uLgoiCiIgZmluZCBub24tYXNjaWkKbWFw
IDxGNT4gL1teCSAtfl08Y3I+Cg==
_B64_VIMRC

Elijah
------
yay for multiple encodings raw in one file

Sylvain Robitaille

unread,

Jul 12, 2021, 6:59:03 PM7/12/21

to

On 2021-07-08, Eli the Bearded wrote:

> The complete vim settings are basically the same as in the perl script,
> but here:
>
> base64 -d <<_B64_VIMRC > highbit_vimrc
> IiBzbWFydCBxdW90ZXMKbWFwISDigJkgJwptYXAhIOKAmCAnCm1hcCEg4oCcICIKbWFwISDi
> gJ0gIgptYXAhIOKAsyAiCiIgYnVsbGV0Cm1hcCEg4pePICoKIiBlbGxpcHNpcwptYXAhIOKA
> piAuLi4KIiBuLWRhc2gKbWFwISDigJMgLS0KIiBtLWRhc2gKbWFwISDigJQgLS0KIiBVKzIy
> MTIgbWludXMKbWFwISDiiJIgLQoiIFUrMjAxMCBoeXBoZW4KbWFwISDigJAgLQoiIGx5bngg
> YnJva2VuIFVURi04Cm1hcCEgw6LCgMKcICIKbWFwISDDosKAwp0gIgptYXAhIMOiwoDCmSAn
> Cm1hcCEgw6LCgMKUIC0tCm1hcCEgw6LCgMKmIC4uLgoiCiIgZmluZCBub24tYXNjaWkKbWFw
> IDxGNT4gL1teCSAtfl08Y3I+Cg==
> _B64_VIMRC

Beautiful. Thank you.