mbtowc question

7 views
Skip to first unread message

Neil Booth

unread,
Nov 17, 2007, 7:34:11 AM11/17/07
to
Consider the following code and the question in the #if 0-ed out
block. Is the #if 0-ed out line necessary for well-defined
behaviour?

Neil.

#include <assert.h>
#include <locale.h>
#include <stdlib.h>

/* Valid 2-byte shift-JIS character, not valid UTF-8 sequence. */
const char sjis[] = "\x95\x5c";
/* Valid UTF-8, of course. */
const char space[] = " ";

int main (void)
{
wchar_t wc;

/* Assume this locale exists and represents the obvious thing. */
setlocale (LC_CTYPE, "ja_JP.UTF-8");

/* Assert it is not state-dependent. */
assert (mbtowc (&wc, 0, 1) == 0);

/* Assert my charset beliefs. */
assert (mbtowc (&wc, space, sizeof space) == 1);
assert (mbtowc (&wc, sjis, sizeof sjis) == -1);

#if 0
/* Is this call to mbtowc necessary in general, given the prior
failure above, in order for the last call to mbtowc to behave
reliably? The standard does not appear to address this directly,
so I imagine it is required. */
mbtowc (&wc, 0, 1);
#endif

/* Withthe above line commented out, does the standard require
a conforming implementation to not trigger the assertion?
Having got this far precisely the same assertion passed above. */
assert (mbtowc (&wc, space, sizeof space) == 1);

return 0;
}

Antoine Leca

unread,
Nov 19, 2007, 8:57:47 AM11/19/07
to
En news:473edfbf$0$278$44c9...@news2.asahi-net.or.jp, Neil Booth va
escriure:

>
> /* Assume this locale exists and represents the obvious thing. */
> setlocale (LC_CTYPE, "ja_JP.UTF-8");
>
> /* Assert it is not state-dependent. */
> assert (mbtowc (&wc, 0, 1) == 0);

Why?
Is it a restriction you are imposing on implementations for your case in
order to simplify the problem, or are you asserting that the locale you
selected should not allow state-dependent encodings?


> /* Valid 2-byte shift-JIS character, not valid UTF-8 sequence. */
> const char sjis[] = "\x95\x5c";
>

> assert (mbtowc (&wc, sjis, sizeof sjis) == -1);
>
> #if 0
> /* Is this call to mbtowc necessary in general, given the prior
> failure above, in order for the last call to mbtowc to behave
> reliably? The standard does not appear to address this
> directly, so I imagine it is required. */
> mbtowc (&wc, 0, 1);
> #endif

First point is that in any case, including this call will not hurt: it is
certainly special-cased so its performance hit is probably low, furthermore
it could be put to be called _only_ from some error handler of the above
case, which is supposed to be exceptional.
This is specially important since you are dealing with a area of the
standard which is not well studied, so the probability to encounter a bugged
implementation is greater than for more mainstream functions.


I cannot find a specification of the behaviour for your exact case, however
when I read the related 7.24.6p3:

If an /mbstate_t/ object has been altered by any of the functions
described in this subclause, and is then used with a different
multibyte character sequence, [...], the behavior is undefined*.

Considering that mbtowc() could perfectly be implemented with mbrtowc() with
a private mbstate_t, this would make your case undefined behaviour.

Now, even if your iterpretation (which makes it well defined without the
call) can be supported, the probability is quite high of encountering an
implementer which took the sort track and implemented mbtowc() with
mbrtowc() but invoking a bad behaviour, and thinking she is covered by the
above clause from 7.24.6.


Antoine

Neil Booth

unread,
Nov 21, 2007, 8:46:54 AM11/21/07
to Antoine Leca
Thanks for your response.

Antoine Leca wrote:
>> /* Assume this locale exists and represents the obvious thing. */
>> setlocale (LC_CTYPE, "ja_JP.UTF-8");
>>
>> /* Assert it is not state-dependent. */
>> assert (mbtowc (&wc, 0, 1) == 0);
>
> Why?
> Is it a restriction you are imposing on implementations for your case in
> order to simplify the problem, or are you asserting that the locale you
> selected should not allow state-dependent encodings?

UTF-8 is not a state-dependent encoding. It also evidences that there
is no "internal state" to move out of the initial shift state.

>> /* Valid 2-byte shift-JIS character, not valid UTF-8 sequence. */
>> const char sjis[] = "\x95\x5c";
>>
>> assert (mbtowc (&wc, sjis, sizeof sjis) == -1);
>>
>> #if 0
>> /* Is this call to mbtowc necessary in general, given the prior
>> failure above, in order for the last call to mbtowc to behave
>> reliably? The standard does not appear to address this

>> directly, so I imagine it is required. */
>> mbtowc (&wc, 0, 1);
>> #endif
>
> First point is that in any case, including this call will not hurt: it is
> certainly special-cased so its performance hit is probably low, furthermore
> it could be put to be called _only_ from some error handler of the above
> case, which is supposed to be exceptional.

Agreed it can't hurt. I'd rather not do unnecessary things though.

> This is specially important since you are dealing with a area of the
> standard which is not well studied, so the probability to encounter a bugged
> implementation is greater than for more mainstream functions.
>
> I cannot find a specification of the behaviour for your exact case, however
> when I read the related 7.24.6p3:
>
> If an /mbstate_t/ object has been altered by any of the functions
> described in this subclause, and is then used with a different
> multibyte character sequence, [...], the behavior is undefined*.
>
> Considering that mbtowc() could perfectly be implemented with mbrtowc() with
> a private mbstate_t, this would make your case undefined behaviour.
>
> Now, even if your iterpretation (which makes it well defined without the
> call) can be supported, the probability is quite high of encountering an
> implementer which took the sort track and implemented mbtowc() with
> mbrtowc() but invoking a bad behaviour, and thinking she is covered by the
> above clause from 7.24.6.

As it's not discussed I suspect it boils down to (7.20.7); whether you
believe this call will "cause the internal conversion state of the
function to be altered as necessary. " As the encoding is already
asserted not state dependent, I find this a bit of a stretch.

You are right, there is an implementation with this behaviour, but the
author is considering changing it to be like all the other libraries
he tested (Linux, Windows, Solaris, ...). This is after fixing it so
that it was possible to get it out of a confused state at all with the
#if 0-ed out line; before then the library was permanently hosed for the
lifetime of the process. :(

Neil.

Antoine Leca

unread,
Nov 21, 2007, 10:25:31 AM11/21/07
to
En news:474436CE.4070003@null, Neil Booth va escriure:

> Antoine Leca wrote:
>>> /* Assert it is not state-dependent. */
>>> assert (mbtowc (&wc, 0, 1) == 0);
>>
>> Why?
>> Is it a restriction you are imposing on implementations for your
>> case in order to simplify the problem, or are you asserting that the
>> locale you selected should not allow state-dependent encodings?
>
> UTF-8 is not a state-dependent encoding.

I believe that the fact something is not necessary in a given case, does not
prevent an implementation to answer "something may be here" to the
feature-test functions, particularly when there is an opportunity for
optimisation.
The definition of 5.2.1.2, using /may/, opens this door.

When I have read the first half of a UTF-8 multibyte, I consider I have some
knowledge about the current character, which according to 7.24.6 I'll name
/[conversion] state/.
I do not see where my view breaks 5.2.1.2


> It also evidences that there is no "internal state" to move
> out of the initial shift state.

In the context of the non-restartable functions of 7.20 I agree there is no
such necessity. However I was more general, and considered the (inside)
implementation of the restartable ones as well.


> As it's not discussed I suspect it boils down to (7.20.7); whether you
> believe this call will "cause the internal conversion state of the
> function to be altered as necessary. " As the encoding is already
> asserted not state dependent, I find this a bit of a stretch.
>
> You are right, there is an implementation with this behaviour, but the
> author is considering changing it to be like all the other libraries

> he tested (Linux, Windows, Solaris, ...). [...]

Now we are completely in a "quality-of-implementation" issue, and I of
course agree with your conclusions.

This is why I would recommend to always negate the first assert when faced
with self-synchronising encodings, even if "stretches" a strict application
of 5.2.1.2, just to stay on the safe side.


Antoine

Reply all
Reply to author
Forward
0 new messages