code review utf-8 codex

123 views
Skip to first unread message

luser droog

unread,
Jul 18, 2015, 4:06:34 AM7/18/15
to
I tried to write this straight from the table in the rfc
(illustrating "my style"). What do y'all thinks?

$ cat io.c
//cf. http://www.ietf.org/rfc/rfc3629.txt p.3
#include<stdlib.h>

#define UCS_RANGE(_) \
_(0x00000000,0x0000007F,1, 0x80,0x00,0x7F) \
_(0x00000080,0x000007FF,2, 0xE0,0xC0,0x1F, 0xC0,0x80,0x3F) \
_(0x00000800,0x0000FFFF,3, 0xF0,0xE0,0x0F, 0xC0,0x80,0x3F, 0xC0,0x80,0x3F) \
_(0x00010000,0x0010FFFF,4, 0xF8,0xF0,0x07, 0xC0,0x80,0x3F, 0xC0,0x80,0x3F, 0xC0,0x80,0x3F)

#define MATCH1(msk,sig,val) \
if ((*p&msk)==sig) { *u++=*p&val; }
#define MATCH2(msk0,sig0,val0, msk1,sig1,val1) \
if ((*p&msk0)==sig0) { *u++=((*p&val0)<<6)|(p[1]&val1); ++p; }
#define MATCH3(msk0,sig0,val0, msk1,sig1,val1, msk2,sig2,val2) \
if ((*p&msk0)==sig0) { *u++=((*p&val0)<<12)|((p[1]&val1)<<6)|(p[2]&val2); ++p;++p; }
#define MATCH4(msk0,sig0,val0, msk1,sig1,val1, msk2,sig2,val2, msk3,sig3,val3) \
if ((*p&msk0)==sig0) { *u++=((*p*val0)<<18)|((p[1]&val1)<<12)|((p[2]&val2)<<6)|(p[3]&val3); ++p;++p;++p; }

#define UTF_CASE(a,z,n,...) \
else MATCH##n(__VA_ARGS__)

int *ucs4_utf8(char *str, int n){
char *p=str;
int *u,*buf=u=malloc(n*sizeof*u);
for (;*p;p++) {
if (0); UCS_RANGE(UTF_CASE)
}
return u;
}

#define OUTBYTE1(msk,sig,val) \
*p++=sig|(x&val);
#define OUTBYTE2(msk0,sig0,val0, msk1,sig1,val1) \
*p++=sig0|((x>>6)&val0); \
*p++=sig1|(x&val1);
#define OUTBYTE3(msk0,sig0,val0, msk1,sig1,val1, msk2,sig2,val2) \
*p++=sig0|((x>>12)&val0); \
*p++=sig1|((x>>6)&val1); \
*p++=sig2|(x&val2);
#define OUTBYTE4(msk0,sig0,val0, msk1,sig1,val1, msk2,sig2,val2, msk3,sig3,val3) \
*p++=sig0|((x>>18)&val0); \
*p++=sig1|((x>>12)&val1); \
*p++=sig2|((x>>6)&val2); \
*p++=sig3|(x&val3);

#define UCS_CASE(a,z,n,...) \
else if (ar[i]<z) { int x=ar[i]; OUTBYTE##n(__VA_ARGS__) }

char *utf8_ucs4(int *ar, int n, int *an){
char *p,*buf=p=malloc(n*4+1);
if (buf) {
int i;
for (i=0; i<n; i++) {
if (0); UCS_RANGE(UCS_CASE)
}
if (an)
*an=p-buf;
*p++=0;
}
return buf;
}


Rick C. Hodgin

unread,
Jul 18, 2015, 4:41:58 AM7/18/15
to
You have the worst coding style I've ever seen.
I would rewrite all of your code from scratch
rather than trying to maintain it. It reminds me
of those contests where people create code with
zero whitespaces to test out compiler parsers.

I think you must have an amazing mind to think
the way you do. But I think you need to place a
higher emphasis on readability, and on long-term
maintenance.

Best regards,
Rick C. Hodgin
Message has been deleted

Bartc

unread,
Jul 18, 2015, 5:44:43 AM7/18/15
to
This bit of code wasn't too bad.

But it seems he does prefer writing in macros rather than C!

--
Bartc

Rick C. Hodgin

unread,
Jul 18, 2015, 6:36:57 AM7/18/15
to
Bartc wrote:
> This bit of code wasn't too bad.
>
> But it seems he does prefer writing in macros
> rather than C!

This code is pretty short. It might be easy enough
to maintain. But it's borderline in that regard in
my book... and for anything longer...

There's definitely a method to his/her work. But,
the overall style is beyond reason. It looks like
code that's been made purposefully difficult to
understand. And for my dyslexic eyes, it would
take too long to sort out to be worth my effort
in understanding as I could likely rewrite it faster,
and claim "ownership" in the process (relating to
understanding, and effective long-term maintenance.

Richard Damon

unread,
Jul 18, 2015, 9:46:21 AM7/18/15
to
On 7/18/15 4:06 AM, luser droog wrote:
> I tried to write this straight from the table in the rfc
> (illustrating "my style"). What do y'all thinks?
>

I agree with the others that this seems more of an exercise to write
tricky macros rather than clean code. Considering the limited number of
cases, and that only macro really reused is the UCS_RANGE macro, it
would be much clearer to just write the expanded code.

Second, the code has a portability issue in that it assumes that an int
is big enough to store a UCS-4 character (better to use something like
int_least32_t). The code will fail if int only has 16 bit.

Third, the code has ZERO error detection. You should check that malloc
succeeded be going on,, and that you don't write more than n USC4
characters into the buffer in ucs4_utf8.

Lastly, you code does nothing to check that it is provided valid utf8,
you are only looking at the first byte and then assume that the
following bytes are correct, and silently discard bytes if you find a
leading byte with the value of a following byte.

Another part of this last error is that you don't detect improperly
encoded byte sequences, for example a UCS value of 0x00000001 which
should be encoded as 0x01 but instead encoded as 0xC0 0x81. Detecting
this is actually a REQUIRED part of the standard. (There are a few
variants to UTF-8 that allow CERTAIN of these illegal combinations for
programatic purposes, for instance allowing a NUL character to be
encoded as 0xC0 0x80 so that strings can contain NUL characters will
still being terminated with a 0x00 byte, but these still require that
most bad encoding be detected).

luser droog

unread,
Jul 18, 2015, 9:44:47 PM7/18/15
to
On Saturday, July 18, 2015 at 3:06:34 AM UTC-5, luser droog wrote:
> I tried to write this straight from the table in the rfc
> (illustrating "my style"). What do y'all thinks?
>

Here's a revised version with comment. Still working on the bulk
of Richard Damon's suggestions.

In a way, I think the addition of comments decreases readability
by increasing the overall size and separating things that are
closely related (like the values in the master table and the next
set of macros where each piece is given a name.

//cf. http://www.ietf.org/rfc/rfc3629.txt p.3
#include<stdlib.h>

/*
<-------- adapters ("apps-"hungarian naming scheme)
utf8_ucs4
ucs4_utf8
*/
int_least32_t *ucs4_utf8(char *str, int n);
char *utf8_ucs4(int_least32_t *ar, int n, int *an);

/* Master table macro describing the translation.
The arguments to the parameter macro are the lower bound of the UCS range for this case,
upper bound, number of corresponding encoded utf-8 bytes, followed by quadruples describing
each byte. The quadruple is the mask for the byte's header field, the value of the signature
matching the masked bits, and the mask for the value section (== logical NOT of the header
mask), and shift or number of lower bits allocated to later bytes.
*/
#define UCS_RANGE(_) \
_(0x00000000,0x0000007F,1, 0x80,0x00,0x7F,0) \
_(0x00000080,0x000007FF,2, 0xE0,0xC0,0x1F,6, 0xC0,0x80,0x3F,0) \
_(0x00000800,0x0000FFFF,3, 0xF0,0xE0,0x0F,12, 0xC0,0x80,0x3F,6, 0xC0,0x80,0x3F,0) \
_(0x00010000,0x0010FFFF,4, 0xF8,0xF0,0x07,18, 0xC0,0x80,0x3F,12, 0xC0,0x80,0x3F,6, 0xC0,0x80,0x3F,0)

/* Macros to match each of the 4 valid forms of utf-8 encoding.
If the first byte ANDed with the header mask is equal to the signature, then this
is the correct case. Decode the bytes, updating the buffer and u pointer, and
increment the p pointer so one more increment will point to the next byte.
*/
#define MATCH1(msk,sig,val,shf) \
if ((*p&msk)==sig) { *u++=*p&val; }
#define MATCH2(msk0,sig0,val0,shf0, msk1,sig1,val1,shf1) \
if ((*p&msk0)==sig0) { *u++=((*p&val0)<<shf0) | (p[1]&val1); ++p; }
#define MATCH3(msk0,sig0,val0,shf0, msk1,sig1,val1,shf1, msk2,sig2,val2,shf2) \
if ((*p&msk0)==sig0) { *u++=((*p&val0)<<shf0) | ((p[1]&val1)<<shf1) | (p[2]&val2); p+=2; }
#define MATCH4(msk0,sig0,val0,shf0, msk1,sig1,val1,shf1, msk2,sig2,val2,shf2, msk3,sig3,val3,shf3) \
if ((*p&msk0)==sig0) { *u++=((*p*val0)<<shf0) | ((p[1]&val1)<<shf1) | ((p[2]&val2)<<shf2) | (p[3]&val3); p+=3; }

/* Invoke the appropriate MATCH macro for each case.
*/
#define UTF_CASE(a,z,n,...) \
else MATCH##n(__VA_ARGS__)

/* Allocate buffer.
Iterate through input string,
decode each utf-8 sequence using the table
Return pointer to buffer.
*/
int_least32_t *ucs4_utf8(char *str, int n){
char *p=str;
int_least32_t *u,*buf=u=malloc(n*sizeof*u);
if (buf) {
for (;*p;p++) {
if (0); UCS_RANGE(UTF_CASE)
}
}
return buf;
}

/* Macros to output each of the 4 valid forms of utf-8 encoding.
Encode the appropriate signatures and masked+shifted portions of the UCS-4 value,
updating the p pointer.
*/
#define OUTBYTE1(msk,sig,val,shf) \
*p++=sig|(x&val);
#define OUTBYTE2(msk0,sig0,val0,shf0, msk1,sig1,val1,shf1) \
*p++=sig0|((x>>shf0)&val0); \
*p++=sig1|(x&val1);
#define OUTBYTE3(msk0,sig0,val0,shf0, msk1,sig1,val1,shf1, msk2,sig2,val2,shf2) \
*p++=sig0|((x>>shf0)&val0); \
*p++=sig1|((x>>shf1)&val1); \
*p++=sig2|(x&val2);
#define OUTBYTE4(msk0,sig0,val0,shf0, msk1,sig1,val1,shf1, msk2,sig2,val2,shf2, msk3,sig3,val3,shf3) \
*p++=sig0|((x>>shf0)&val0); \
*p++=sig1|((x>>shf1)&val1); \
*p++=sig2|((x>>shf2)&val2); \
*p++=sig3|(x&val3);

/* Invoke the appropriate OUTBYTE macro for each case
*/
#define UCS_CASE(a,z,n,...) \
else if (ar[i]<z) { int_least32_t x=ar[i]; OUTBYTE##n(__VA_ARGS__) }

/* Allocate buffer.
Iterate through input array,
encode each UCS-4 value using the table
Return pointer to buffer.
*/
char *utf8_ucs4(int_least32_t *ar, int n, int *an){
char *p,*buf=p=malloc(n*4+1);
if (buf) {
int i;
for (i=0; i<n; i++) {
if (0); UCS_RANGE(UCS_CASE)
}
if (an)
*an=p-buf;
*p++=0;
}
return buf;
}


And here's the same code processed with `cpp -P` and
`indent -gnu -i4 -br -ce -cdw -nbc -brf -brs -l100 -bbo`.

It's certainly smaller, but now it's chock-full of magic numbers.


int_least32_t *ucs4_utf8 (char *str, int n);
char *utf8_ucs4 (int_least32_t * ar, int n, int *an);
int_least32_t *
ucs4_utf8 (char *str, int n) {
char *p = str;
int_least32_t *u, *buf = u = malloc (n * sizeof *u);
if (buf) {
for (; *p; p++) {
if (0);
else if ((*p & 0x80) == 0x00) {
*u++ = *p & 0x7F;
} else if ((*p & 0xE0) == 0xC0) {
*u++ = ((*p & 0x1F) << 6) | (p[1] & 0x3F);
++p;
} else if ((*p & 0xF0) == 0xE0) {
*u++ = ((*p & 0x0F) << 12) | ((p[1] & 0x3F) << 6) | (p[2] & 0x3F);
p += 2;
} else if ((*p & 0xF8) == 0xF0) {
*u++ =
((*p *
0x07) << 18) | ((p[1] & 0x3F) << 12) | ((p[2] & 0x3F) << 6) | (p[3] & 0x3F);
p += 3;
}
}
}
return buf;
}

char *
utf8_ucs4 (int_least32_t * ar, int n, int *an) {
char *p, *buf = p = malloc (n * 4 + 1);
if (buf) {
int i;
for (i = 0; i < n; i++) {
if (0);
else if (ar[i] < 0x0000007F) {
int_least32_t x = ar[i];
*p++ = 0x00 | (x & 0x7F);
} else if (ar[i] < 0x000007FF) {
int_least32_t x = ar[i];
*p++ = 0xC0 | ((x >> 6) & 0x1F);
*p++ = 0x80 | (x & 0x3F);
} else if (ar[i] < 0x0000FFFF) {
int_least32_t x = ar[i];
*p++ = 0xE0 | ((x >> 12) & 0x0F);
*p++ = 0x80 | ((x >> 6) & 0x3F);
*p++ = 0x80 | (x & 0x3F);
} else if (ar[i] < 0x0010FFFF) {
int_least32_t x = ar[i];
*p++ = 0xF0 | ((x >> 18) & 0x07);
*p++ = 0x80 | ((x >> 12) & 0x3F);
*p++ = 0x80 | ((x >> 6) & 0x3F);
*p++ = 0x80 | (x & 0x3F);
}
}
if (an)
*an = p - buf;
*p++ = 0;
}
return buf;
}


Richard Damon

unread,
Jul 18, 2015, 10:44:30 PM7/18/15
to
On 7/18/15 9:44 PM, luser droog wrote:
> On Saturday, July 18, 2015 at 3:06:34 AM UTC-5, luser droog wrote:
>> I tried to write this straight from the table in the rfc
>> (illustrating "my style"). What do y'all thinks?
>>
>
> Here's a revised version with comment. Still working on the bulk
> of Richard Damon's suggestions.
>
> In a way, I think the addition of comments decreases readability
> by increasing the overall size and separating things that are
> closely related (like the values in the master table and the next
> set of macros where each piece is given a name.

Comments that say just what the code is doing that could be gotten by
just looking at the code are sort of useless, what comments should
express is WHY you are doing things and to express the 'API' of the
thing. If you have a bunch of interrelated things, you document that
cluster of things up front, with maybe a very short comment for each
item or group of related items.

> And here's the same code processed with `cpp -P` and
> `indent -gnu -i4 -br -ce -cdw -nbc -brf -brs -l100 -bbo`.
>
> It's certainly smaller, but now it's chock-full of magic numbers.
>
It has no more 'magic numbers' than the original, they may be a bit more
spread out but not by that much. Most of them actually aren't all that
'magic' but fall out of simple math based on 6 bits of character info
per extra byte added.

Even without the comments, I see this code as much more readable.

Note also, the redundancy becomes a bit clearer. Note that the routines
can be made a lot smaller by using some of the patterns in the structure
of the standard. For instance, converting a utf-8 to a ucs-4 can be done
simply as follows:


/* Given p points to the utf-8 buffer, ucs is the output UCS-4 value */

int len = 0;
int ucs;

if( (*p & 0x80) == 0x00) { len = 1; ucs = *p & 0x7F; }
else if((*p & 0xE0) == 0xC0) { len = 2; ucs = *p & 0x1F; }
else if((*p & 0xF0) == 0xE0) { len = 3; ucs = *p & 0x0F; }
else if((*p & 0xF8) == 0xF0) { len = 4; ucs = *p & 0x07; }
else { /* *p is a bad leading character */ }
p++;
int i;
for ( i = 1; i < len; i++){
if( (*p & 0xC0) != 0x80) { /* *p is a bad following character */ }
u = (u << 6) | (*p++ & 0x3F);
}

/* now check for over long encoding */
if((len == 2 && u < 0x00080) ||
(len == 3 && u < 0x00800) ||
(len == 4 && u < 0x10000)) {
/* over length encoding */
}

This is close to the size of your function, but has full error
detection. Drop the error detection and it is shorter than yours. Also
ALL the code is in one place to look at, as opposed to being spread out
between the point of invocation of the macro, and the definition of 3
GROUPS of macros.

Malcolm McLean

unread,
Jul 19, 2015, 3:16:25 AM7/19/15
to
This is mine.
It's nowhere near as elegant as yours and it doesn't have error detection.
I'm not sure how useful error detection is at this stage in real
programs. (In Baby X, there's a routine to check is a string is valid
UFT-8, but if someone want to display a button, you can't be
error-checking the string every time it's redrawn, the user just has
to see funny glyphs on screen)

static const unsigned int offsetsFromUTF8[6] =
{
0x00000000UL, 0x00003080UL, 0x000E2080UL,
0x03C82080UL, 0xFA082080UL, 0x82082080UL
};

static const unsigned char trailingBytesForUTF8[256] = {
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
};

int bbx_utf8_getch(const char *utf8)
{
int ch;
int nb;

nb = trailingBytesForUTF8[(unsigned char)*utf8];
ch = 0;
switch (nb)
{
/* these fall through deliberately */
case 3: ch += (unsigned char)*utf8++; ch <<= 6;
case 2: ch += (unsigned char)*utf8++; ch <<= 6;
case 1: ch += (unsigned char)*utf8++; ch <<= 6;
case 0: ch += (unsigned char)*utf8++;
}
ch -= offsetsFromUTF8[nb];

return ch;
}

Richard Damon

unread,
Jul 19, 2015, 4:04:26 PM7/19/15
to
On 7/19/15 3:16 AM, Malcolm McLean wrote:
>
> This is mine.
> It's nowhere near as elegant as yours and it doesn't have error detection.
> I'm not sure how useful error detection is at this stage in real
> programs. (In Baby X, there's a routine to check is a string is valid
> UFT-8, but if someone want to display a button, you can't be
> error-checking the string every time it's redrawn, the user just has
> to see funny glyphs on screen)
>

The need for error detection is that these sort of conversion routines are normally used taking input from something user controlled. Unless your app is one of the rare ones that is totally locked down and only accessible by fully trusted users, you application is apt to need to do some level of validation on the input. This becomes MUCH harder if characters have multiple encodings. It is much easier to start with good security practices then to later add them.

There are a few exceptions that might be worth allowing and clearly documenting in your comments. Things like allowing 0xC0, 0x80 as a NUL character if you are otherwise using NUL termination for strings (but beware that this will be converted to the UCS-4 NUL, so you can't use NUL termination in UCS-4 strings then). or maybe allow the old extended range of UTF-8 (up to 6 byte UTF-8 sequences to UCS-4 values 0x00000000 to 0x7FFFFFFF).


Note, the follow structure is very much 'magic numbers' and should be documented.
> static const unsigned int offsetsFromUTF8[6] =
> {
> 0x00000000UL, 0x00003080UL, 0x000E2080UL,
> 0x03C82080UL, 0xFA082080UL, 0x82082080UL
> };
>
> static const unsigned char trailingBytesForUTF8[256] = {
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
These next to lines shouldn't be 0, but something to flag an error.
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
> 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
The 4 and 5 are inconsistent with the code below.
> 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
> };
>
Note this routine converts a single character, but can't be used to convert a string.
> int bbx_utf8_getch(const char *utf8)
> {
int may not be big enough
> int ch;
> int nb;
>
> nb = trailingBytesForUTF8[(unsigned char)*utf8];
> ch = 0;
> switch (nb)
> {
You don't process values 4 or 5, thus inconsistent (and wrong magic number). Should also have something to handle the 0x80-0xBF first byte error code
> /* these fall through deliberately */
> case 3: ch += (unsigned char)*utf8++; ch <<= 6;
> case 2: ch += (unsigned char)*utf8++; ch <<= 6;
> case 1: ch += (unsigned char)*utf8++; ch <<= 6;
> case 0: ch += (unsigned char)*utf8++;
> }
> ch -= offsetsFromUTF8[nb];
>
> return ch;
> }
>

The big issue with the missing cases is that if you later convert this code to be inside a loop to convert strings, the error states will put you in an infinite loop as you never increment utf8.

Malcolm McLean

unread,
Jul 20, 2015, 12:04:36 AM7/20/15
to
The loop is written with

int bbx_utf8_skip(const char *utf8)
{
return trailingBytesForUTF8[(unsigned char) *utf8] + 1;
}

so you're right - it can break, if the last character in the string is
malformed. That's going to be a problem with any utf8 converter that is
character-wise - if you return an error than you force the user to
handle it, if you suppress the error then what are you meant to do
for the skip?

The skip function needs to be rewritten to check for NUL bytes within
the skip. It's important that caller can go

while ((ch = bbx_utf8_getch(str)))
{
/* process ch */
str += bbx_utf8_skip(str);
}

but it's not usually going to matter if ch is garbage on bad input,
because caller is going to have to consider that ch isn't a character
he can handle, in any case.
However now I'm aware of 0xFFFD, it's obviously sensible to return
that.

The main sentinel routine is this function here.
Again, I'm not handling the high codes -in fact the bbx_utf8_getch
function will return negative and they'll be thrown out.

int bbx_isutf8z(const char *str)
{
int len = 0;
int pos = 0;
int nb;
int i;
int ch;

while(str[len])
len++;
while(pos < len && *str)
{
nb = bbx_utf8_skip(str);
if(nb < 1 || nb > 4)
return 0;
if(pos + nb > len)
return 0;
for(i=1;i<nb;i++)
if( (str[i] & 0xC0) != 0x80 )
return 0;
ch = bbx_utf8_getch(str);
if(ch < 0x80)
{
if(nb != 1)
return 0;
}
else if(ch < 0x8000)
{
if(nb != 2)
return 0;
}
else if(ch < 0x10000)
{
if(nb != 3)
return 0;
}
else if(ch < 0x110000)
{
if(nb != 4)
return 0;
}
pos += nb;
str += nb;
}

return 1;
}


luser droog

unread,
Jul 20, 2015, 12:51:41 AM7/20/15
to
On Saturday, July 18, 2015 at 9:44:30 PM UTC-5, Richard Damon wrote:
> On 7/18/15 9:44 PM, luser droog wrote:
> > On Saturday, July 18, 2015 at 3:06:34 AM UTC-5, luser droog wrote:
> >> I tried to write this straight from the table in the rfc
> >> (illustrating "my style"). What do y'all thinks?
> >>
> >
> > Here's a revised version with comment. Still working on the bulk
> > of Richard Damon's suggestions.
> >
> >
<snip>
> It has no more 'magic numbers' than the original, they may be a bit more
> spread out but not by that much. Most of them actually aren't all that
> 'magic' but fall out of simple math based on 6 bits of character info
> per extra byte added.
>
> Even without the comments, I see this code as much more readable.
>
> Note also, the redundancy becomes a bit clearer. Note that the routines
> can be made a lot smaller by using some of the patterns in the structure
> of the standard. For instance, converting a utf-8 to a ucs-4 can be done
> simply as follows:
>

Please, sir, may I have another?
I've tried very much to embrace your critique as I hope this total
revision illustrates.

This function uses the "number of leading zeros" function (nlz) to detect
and parse the unary prefix. With additional assistance from Hacker's Delight.

enum errinfo {
invalid_encoding = 1,
invalid_extended_encoding = 2,
buffer_alloc_fail = 4,
bad_following_character = 8,
over_length_encoding = 16,
};

static int nlz(uint_least32_t x){ return 7 - (x? floor(log2((unsigned)x)): -1); }

int_least32_t *ucs4_utf8(char *str, int n, enum errinfo *errinfo){
unsigned char *p=str;
uint_least32_t x;
int i,pre,j;
if (errinfo) *errinfo=0;
int_least32_t *u,*buf=u=malloc(n*sizeof*u);
if (!buf) if (errinfo) *errinfo != buffer_alloc_fail;

if (buf)
for (i=0; i<n && *p; i++){
switch(pre=nlz(0xFF^(x=*p++))){
case 0: *u++=x;
break;
case 1: if (errinfo) *errinfo |= invalid_encoding;
break;
case 2:
case 3:
case 4:
x&=(1<<(8-pre))-1;
for (j=1; j<pre; j++){
if (nlz(0xFF^*p)!=1) if (errinfo) *errinfo |= bad_following_character;
x<<=6;
x|=*p++ & 0x3F;
}
break;
default: if (errinfo) *errinfo |= invalid_extended_encoding;
break;
}
if ((pre==2 && x<0x80)
|| (pre==3 && x<0x800)
|| (pre==4 && x<0x10000)) if (errinfo) *errinfo |= over_length_encoding;
}

return buf;
}

(much) better?

--
big words and flattery

Tim Rentsch

unread,
Jul 20, 2015, 2:15:40 PM7/20/15
to
luser droog <luser...@gmail.com> writes:

> I tried to write this straight from the table in the rfc
> (illustrating "my style"). What do y'all thinks?
>
> [snip]

Basically I agree with all of Richard Damon's comments.

(I see you have posted a followup intendind to act on
his suggestions, I may have further comments after looking
at that.)

Richard Damon

unread,
Jul 21, 2015, 10:44:15 PM7/21/15
to
On 7/20/15 12:51 AM, luser droog wrote:

> Please, sir, may I have another?
> I've tried very much to embrace your critique as I hope this total
> revision illustrates.

Your use of the nlz function seems to be an unnecessary complexity,
especially when used to check that the following bytes are of the form
10xxxxxx. I think you still can pass over the terminating NUL in the
string if it falls in the position of a following byte.

I also can't see where the result of a multi-byte code gets stored into
the output! This is one of the problems with writing tricky code, it
becomes easier to miss the basics.

While I don't quite agree with Richard Heathfield that readability
totally comes before correctness, starting from a readable and simply
expressed design is much easier to get to be correct. (and it is much
easier to take a readable program with a few problems and fix it then to
take an unreadable program and make it readable so it is maintainable,
or to fix the little bugs that pop up later).

You should only give up clarity and simplicity if you absolutely need
to. Only after getting it correct, and then finding that something isn't
fast enough (and profiling shows that THIS is were the slow down is) do
you even start to think about using tricks to speed stuff up (and you
still WANT to keep things readable, and well documented).

It would be a strange machine where a log and a switch is faster than a
short sequence of ifs with mask and compare.

Richard Heathfield

unread,
Jul 22, 2015, 2:58:34 AM7/22/15
to
On 22/07/15 03:44, Richard Damon wrote:

<snip>

> Your use of the nlz function seems to be an unnecessary complexity,
> especially when used to check that the following bytes are of the form
> 10xxxxxx. I think you still can pass over the terminating NUL in the
> string if it falls in the position of a following byte.
>
> I also can't see where the result of a multi-byte code gets stored into
> the output! This is one of the problems with writing tricky code, it
> becomes easier to miss the basics.
>
> While I don't quite agree with Richard Heathfield that readability
> totally comes before correctness, starting from a readable and simply
> expressed design is much easier to get to be correct.

So you not only agree with me that readability comes before correctness,
but you *also* agree with me that all generalisations (including this
one) are false. Good. ;-)

--
Richard Heathfield
Email: rjh at cpax dot org dot uk
"Usenet is a strange place" - dmr 29 July 1999
Sig line 4 vacant - apply within

Tim Rentsch

unread,
Jul 22, 2015, 10:06:18 AM7/22/15
to
Richard Damon <Ric...@Damon-Family.org> writes:

> [snip]

I just wanted to say I appreciate your commentary, for me it
filled in some gaps about what the specification requires.

luser droog

unread,
Jul 22, 2015, 2:25:28 PM7/22/15
to
On Tuesday, July 21, 2015 at 9:44:15 PM UTC-5, Richard Damon wrote:
> On 7/20/15 12:51 AM, luser droog wrote:
>
> > Please, sir, may I have another?
> > I've tried very much to embrace your critique as I hope this total
> > revision illustrates.
>
> Your use of the nlz function seems to be an unnecessary complexity,
> especially when used to check that the following bytes are of the form
> 10xxxxxx. I think you still can pass over the terminating NUL in the
> string if it falls in the position of a following byte.
>

Yes. I'm still not decided on what my application's policy should
be for this, and vetting the input more generally.

> I also can't see where the result of a multi-byte code gets stored into
> the output! This is one of the problems with writing tricky code, it
> becomes easier to miss the basics.
>

Good catch. I've factored that out of the switch now, so it can't be missed.


> While I don't quite agree with Richard Heathfield that readability
> totally comes before correctness, starting from a readable and simply
> expressed design is much easier to get to be correct. (and it is much
> easier to take a readable program with a few problems and fix it then to
> take an unreadable program and make it readable so it is maintainable,
> or to fix the little bugs that pop up later).
>
> You should only give up clarity and simplicity if you absolutely need
> to. Only after getting it correct, and then finding that something isn't
> fast enough (and profiling shows that THIS is were the slow down is) do
> you even start to think about using tricks to speed stuff up (and you
> still WANT to keep things readable, and well documented).
>
> It would be a strange machine where a log and a switch is faster than a
> short sequence of ifs with mask and compare.

I feel the last two paragraphs are incongruous wrt the nlz function.
It reduces the bitfield parsing to a simple powerful tool, abstracting it.
It can also be implemented for speed if needed. Here I went for size.

But it is a bit of a semantic mismatch to use it for leading ones.
So, I made it a macro--I know, I know, but wait! hold your judgement
'til you look at it. I think it helps simplify the way the whole algorithm
reads. And two other macros for generating masks with ones on the left or
right.

And I borrowed the simpler function naming from the contemporaneous thread.
And it uses the replacement character for all miscoded points.

$ cat io.c
//cf. http://www.ietf.org/rfc/rfc3629.txt p.3
#include<stdlib.h>
#include <stdio.h>
//#include <sys/bitops.h> // ilog2
#include <math.h> // log2

/*
<-------- adapters ("apps-"hungarian naming)
utf8 utf8(ucs4...)
ucs4 ucs4(utf8...)
*/

enum errinfo {
invalid_encoding = 1,
invalid_extended_encoding = 2,
buffer_alloc_fail = 4,
bad_following_character = 8,
over_length_encoding = 16,
code_point_out_of_range = 32,
};
int_least32_t *ucs4(char *str, int n, enum errinfo *errinfo);
char *utf8(int_least32_t *ar, int n, int *an, enum errinfo *errinfo);

/* number of leading zeros of byte-sized value */
static int nlz(uint_least32_t x){ return 7 - (x? floor(log2(x)): -1); }

/* number of leading ones of byte-sized value */
#define nlo(x) nlz(0xFF^(x))

/* generate unsigned mask of x lsb ones */
#define lsb(x) ((1U<<(x))-1)

/* generate byte mask of x msb ones */
#define msb(x) (0xFF^lsb(8-(x)))

int_least32_t *ucs4(char *str, int n, enum errinfo *errinfo){
unsigned char *p=str;
int_least32_t *u,*buf;
uint_least32_t x;
int pre;
int i,j;

if (errinfo) *errinfo=0;
buf=u=malloc(n*sizeof*u);
if (!buf)
if (errinfo) *errinfo |= buffer_alloc_fail;

if (buf)
for (i=0; i<n && *p; i++){
switch(pre=nlo(x=*p++)){
case 0: break;
case 1: if (errinfo) *errinfo |= invalid_encoding;
x=0xFFFD;
break;
case 2:
case 3:
case 4: x&=lsb(8-pre);
for (j=1; j<pre; j++){
if (nlo(*p)!=1)
if (errinfo) *errinfo |= bad_following_character;
x=(x<<6) | (*p++&lsb(6));
}
break;
default: if (errinfo) *errinfo |= invalid_extended_encoding;
x=0xFFFD;
break;
}
if (x < ((int[]){0,0,0x80,0x800,0x10000})[pre])
if (errinfo) *errinfo |= over_length_encoding;
*u++=x;
}

return buf;
}

char *utf8(int_least32_t *ar, int n, int *an, enum errinfo *errinfo){
int i;
int_least32_t x;
char *p,*buf=p=malloc((n+1)*4);
if (!buf) if (errinfo) *errinfo |= buffer_alloc_fail;

if (buf)
for (i=0; i<n; i++){
x=ar[i];
if (x <= lsb(7))
*p++=x;
else if (x <= lsb(11))
*p++=msb(2)| (x>>6),
*p++=msb(1)| (x & lsb(6));
else if (x <= lsb(16))
*p++=msb(3)| (x>>12),
*p++=msb(1)| ((x>>6) & lsb(6)),
*p++=msb(1)| (x & lsb(6));
else if (x <= 0x10FFFF)
*p++=msb(4)| (x>>18),
*p++=msb(1)| ((x>>12) & lsb(6)),
*p++=msb(1)| ((x>>6) & lsb(6),
*p++=msb(1)| (x & lsb(6));
else
if (errinfo) *errinfo |= code_point_out_of_range;
}

return buf;
}

luser droog

unread,
Aug 2, 2015, 4:16:44 PM8/2/15
to
On Saturday, July 18, 2015 at 3:06:34 AM UTC-5, luser droog wrote:
> I tried to write this straight from the table in the rfc
> (illustrating "my style"). What do y'all thinks?
>
> $ cat io.c
> //cf. http://www.ietf.org/rfc/rfc3629.txt p.3
<snip>

Note a revised version with a minimal unit-testing
framework (http://www.jera.com/techinfo/jtns/jtn002.html)
has been posted to
https://codereview.stackexchange.com/questions/98838/utf-8-encoding-decoding

--
bled this 0x31 dry
Reply all
Reply to author
Forward
0 new messages