Mostly it is because of all of the non-English languages that don't fit
in the seven bits of ASCII. Even the eight bit ISO-8859-x family doesn't
cover lots of well-used languages. UTF-8 gives you most living languages
and many dead ones. UTF-8 isn't strictly "2-bytes", it is a variable
width encoding with ASCII compatibility for ASCII characters. High bit
sequences can be two, three, or four octets.
C encoding verifier I wrote:
/* Bit patterns for legitimate UTF-8:
*
* non-highbit:
* 0bbbbbbb
* two octet highbit:
* 110bbbbb 10bbbbbb
* three octet highbit:
* 1110bbbb 10bbbbbb 10bbbbbb
* four octet highbit:
* 11110bbb 10bbbbbb 10bbbbbb 10bbbbbb
*/
/* low bit (no highbit)
* 0bbbbbbb
* note that null is low bit
*/
#define UTF8_LOWBIT(oct) (0x00 == ((oct) & 0x80))
/* any continuation octet
* 10bbbbbb
*/
#define UTF8_CONTINUATION(oct) (0x80 == ((oct) & 0xC0))
/* start of two octet
* 110bbbbb
*/
#define UTF8_SEQUENCE_2(oct) (0xC0 == ((oct) & 0xE0))
/* start of three octet
* 1110bbbb
*/
#define UTF8_SEQUENCE_3(oct) (0xE0 == ((oct) & 0xF0))
/* start of four octet
* 11110bbb
*/
#define UTF8_SEQUENCE_4(oct) (0xF0 == ((oct) & 0xF8))
/* checks a string str of length len for legit UTF-8 bit patterns.
* null will not terminate the string -- those are legit 7bit ASCII.
* returns byte offset of first non-legit sequence or -1 if 100% okay.
*/
int
check_utf8(str, len)
unsigned char* str;
int len;
{
int seq, pos, run, octet;
run = 0;
for(pos = 0; pos < len; pos++) {
octet = str[pos];
/* start of a sequence */
if(run == 0) {
seq = pos;
}
if(UTF8_LOWBIT(octet)) {
if( run != 0 ) {
/* whoops, wanted highbit there */
return seq;
}
continue;
}
if(UTF8_CONTINUATION(octet)) {
if( run ) {
/* one of our expected run */
run --;
continue;
}
/* whoops, not the right spot for this */
return seq;
}
if( run ) {
/* whoops, should have had a continuation octet above */
return seq;
}
if(UTF8_SEQUENCE_2(octet)) {
run = 1; /* one more */
continue;
}
if(UTF8_SEQUENCE_3(octet)) {
run = 2; /* two more */
continue;
}
if(UTF8_SEQUENCE_4(octet)) {
run = 3; /* three more */
continue;
}
/* yikes! fall through! */
return seq;
}
return -1;
} /* check_utf8() */
https://github.com/Eli-the-Bearded/eli-mailx/blob/master/utf-8.c
Elijah
------
using K&R style to match the rest of mailx