UTF-8 conversion

Philip Lantz

unread,

Jul 19, 2015, 2:16:47 AM7/19/15

to

The articles here on UTF-8 conversion inspired me to write one of my
own, and I wonder what you all think.

/*
* ucs4 converts from a UTF-8 string to a UCS-4 string.
* The first two parameters are the output buffer and its size.
* The third parameter is the input string, which must be null
* terminated.
* The return value is the number of characters in the input string,
* not counting the null terminator, even if that value exceeds the
* size of the output buffer.
* If an illegal UTF-8 byte sequence is encountered, the return value
* is -1, and the output string contains characters up to the point of
* the error.
* The output string is always null terminated, even if the buffer is
* too small or an error is encountered.
* If the buffer size is 0, the buffer pointer may be NULL.
*/
int ucs4(uint32_t *buf, int sz, const uint8_t *p)
{
int i = 0;

while (*p) {
uint32_t c = *p++;

if (c & 0x80) {
int count;
if ((c & 0xe0) == 0xc0)
count = 2;
else if ((c & 0xf0) == 0xe0)
count = 3;
else if ((c & 0xf8) == 0xf0)
count = 4;
else
goto error;

c &= 0x7f >> count;
while (--count > 0) {
if ((*p & 0xc0) != 0x80)
goto error;
c = c << 6 | *p++ & 0x3f;
}
}

if (i < sz)
buf[i] = c;

i++;
}

if (i < sz)
buf[i] = 0;
else
buf[sz-1] = 0;

return i;

error:
if (i < sz)
buf[i] = 0;
else
buf[sz-1] = 0;
return -1;
}

/*
* utf8 converts from a UCS-4 string to a UTF-8 string.
* The first two parameters are the output buffer and its size.
* The third parameter is the input string, which must be null
* terminated.
* The return value is the number of characters in the input string,
* not counting the null terminator, even if the encoding exceeds
* the size of the output buffer.
* If a value in the input string is outside the legal range of Unicode
* code points, the output string contains characters up to the point of
* the error, and the return value is -1.
* UTF-16 surrogate bytes are encoded, even though strictly they should
* be treated as errors.
* The output string is always null terminated, even if the buffer is
* too small or an error is encountered.
* If the buffer size is 0, the buffer pointer may be NULL.
*/
int utf8(uint8_t *buf, int sz, const uint32_t *s)
{
int i = 0;
int terminator = 0;

while (*p) {
uint32_t c = *p++;

if (c > 0x10ffff) {
if (i < sz)
buf[i] = 0;
return -1;
}

int count;
if (c <= 0x7f)
count = 1;
else if (c <= 0x7ff)
count = 2;
else if (c <= 0xffff)
count = 3;
else
count = 4;

if (i + count < sz) {
if (count == 1)
buf[i] = c;
else {
buf[i] = 0xf00 >> count
| c >> (count - 1) * 6 & 0x7f >> count;
for (j = 1; j < count; j++)
buf[i+j] = 0x80 | c >> (count - j - 1) * 6 & 0x3f;
}
}
else if (!terminator) {
buf[i] = 0;
terminator = 1;
}

i += count;
}

if (i < sz)
buf[i] = 0;

return i;
}

Richard Damon

unread,

Jul 19, 2015, 4:33:26 PM7/19/15

to

On 7/19/15 2:16 AM, Philip Lantz wrote:
> The articles here on UTF-8 conversion inspired me to write one of my
> own, and I wonder what you all think.
>

One thing to note, that rather than just aborting the conversion on an error, Unicode actually defines a character ('Replacement Character' value 0xFFFD) that can be used as a replacement character for a bad character. This allows you to detect the bad character, replace it with the Replacement Character and continue processing the rest of the string (perhaps you want the return value to indicate that errors were detected, maybe returning -n if there is an error, or +n if there is none).

Note also, your error checking is a bit incomplete, you aren't blocking overlong encodings. (There are also other values that strictly shouldn't get through too)

Tim Rentsch

unread,

Jul 22, 2015, 1:57:08 PM7/22/15

to

Let's see, some comments. (As usual in such cases I try to steer
clear of matters that are minor or purely lexical style choices.)

I tried compiling it, there were a couple of undeclared
variables.

Your choice of semantics is reasonable. It might be nice to get
more information on an error, but then again doing that might be
wasted effort. The choice made gives notice that an error has
occurred, and also protects against downstream problems, so it
seems like a good balance. The function level comments do a good
job of descrbing the semantics.

The comment before ucs4() says "If the buffer size is 0, the
buffer pointer may be NULL." However, this seems at odds with
the code after (and looking again now, also just before) 'error:'
in that function:

error:
if (i < sz)
buf[i] = 0;
else
buf[sz-1] = 0;
return -1;

To me this looks like if 'sz' is less than 1, or 'buf' is NULL,
then the 'else' branch acts inappropriately. I think that is
also true of the normal return path that appears just preivously.
Also, in utf8(), I'm not sure if the logic around ensuring a 0
terminating byte gets written is correct; I think it succumbs
to the same problem but I haven't checked carefully to be sure.
In any event how that works (or is meant to work) is a bit subtle
and probably merits an explanatory comment.

The high level structure of the function definitions is somewhat
dissatisfying. The code is locally simple but it seems like at
a higher level it's more complicated than it needs to be. Partly
this is just a personal dislike of what seems like more control
flow than necessary. To put this another way, it seems like it
would be good to try to simplify the overall organization,
perhaps in some cases computing the values needed by means of
expressions rather than control flow. As it is now the balance
seems weighted too heavily in the direction of local concerns
as opposed to broader considerations. At that is my impression,
fwiw.