how to make snprintf fail due to encoding error

Manlio Perillo

unread,

Oct 25, 2013, 3:21:33 PM10/25/13

to

Hi.

The question may seems unusual, but I need to make snprintf fail due to
encoding error in a portable way, to make sure the test suite of a
library I'm writing has 100% of test coverage.

What is a simple and standard conforming method to make snprintf return a
negative value?

Currently I'm doing (using GCC 4.7.2 on GNU Linux Debian 7.2:

snprintf(buf, n, "Hello %ls", L"\U10FFFFFF");

and it works, but it is not C99 [1].

[1] by the way, it is strange that GCC in C99 standard mode accepts it

Thanks Manlio

James Kuyper

unread,

Oct 25, 2013, 6:19:40 PM10/25/13

to

On 10/25/2013 03:21 PM, Manlio Perillo wrote:
> Hi.
>
> The question may seems unusual, but I need to make snprintf fail due to
> encoding error in a portable way, to make sure the test suite of a
> library I'm writing has 100% of test coverage.
>
> What is a simple and standard conforming method to make snprintf return a
> negative value?

...

"The snprintf function returns ... a negative value if an encoding error
occurred." (7.21.6.6p3), and that's the only way it can do so.

"An encoding error occurs if the character sequence presented to the
underlying mbrtowc function does not form a valid (generalized)
multibyte character, or if the code value passed to the underlying
wcrtomb does not correspond to a valid (generalized) multibyte
character. The wide character input/output functions and the byte
input/output functions store the value of the macro EILSEQ in errno if
and only if an encoding error occurs." (7.21.3p14)

The phrase "encoding error" is italicized, which is an ISO convention
for indicating that this paragraph defines the meaning of that phrase.
Therefore, in order to trigger an encoding error, you have to using
mbrtowc() or wcrtomb(), directly or indirectly.

You can't figure out what might cause this happen just by looking at the
description of snprintf(). The relevant part of that description says
"The snprintf function is equivalent to fprintf, except ..."
(7.31.6.6p2). You have to look at the description of fprintf(), where
you find that only wcrtomb() is relevant in this case, and only when
using the "%lc" or "%ls" format specifiers. I see from your example code
that you were probably already aware of that fact, but I'm not very
familiar with these parts of the standard, so I had to track things down
systematically.

The behavior of "%lc" is defined in terms of "%ls", so I'll just quote
the description for "%ls":
" ... the argument shall be a pointer to the initial element of an array
of wchar_t type. Wide characters from the array are converted to
multibyte characters (each as if by a call to the wcrtomb function, with
the conversion state described by an mbstate_t object initialized to
zero before the first wide character is converted) ..." (7.21.6.1p8).

Therefore, what you need to do is create a wchar_t array containing a
value that does not correspond to a valid generalized multibyte
character. Which values have that property is implementation specific.
The most portable way to identify such a value is to loop over all
possibilities from WCHAR_MIN to WCHAR_MAX, passing each one to wcrtomb()
until it returns a negative value. There need not be any such value.

Szabolcs Nagy

unread,

Oct 25, 2013, 6:48:31 PM10/25/13

to

Manlio Perillo <manlio_p...@spamlibero.it> wrote:
> encoding error in a portable way, to make sure the test suite of a
> library I'm writing has 100% of test coverage.

that won't work in general, several functions have failures
outside of your control

(in general you need to "mock" libc entirely to be able to
trigger all possible conformant implementation behaviour)

> Currently I'm doing (using GCC 4.7.2 on GNU Linux Debian 7.2:
>
> snprintf(buf, n, "Hello %ls", L"\U10FFFFFF");
>
> and it works, but it is not C99 [1].
>
> [1] by the way, it is strange that GCC in C99 standard mode accepts it

"The value of a string literal containing a multibyte character
or escape sequence not represented in the execution character set
is implementation-defined."

Manlio Perillo

unread,

Oct 26, 2013, 8:24:13 AM10/26/13

to

Il Fri, 25 Oct 2013 19:21:33 +0000, Manlio Perillo ha scritto:

> Hi.
>
> The question may seems unusual, but I need to make snprintf fail due to
> encoding error in a portable way, to make sure the test suite of a
> library I'm writing has 100% of test coverage.
>
> What is a simple and standard conforming method to make snprintf return
> a negative value?
>

I realized that the choice of the question was a bit unfortunate.

The real question is something like this:
"What is one possible wide character value that will make wcrtomb
*always* fail with EILSEQ?".

With always I mean that an implementation where wcrtomb does not fail, is
not standard conforming.

It seems the simplest answer is:

wchar_t invalid = (wchar_t) WEOF;

but I'm not 100% sure (it works on GCC).

Thanks Manlio

Manlio Perillo

unread,

Oct 26, 2013, 8:28:18 AM10/26/13

to

Il Fri, 25 Oct 2013 22:48:31 +0000, Szabolcs Nagy ha scritto:

> Manlio Perillo <manlio_p...@spamlibero.it> wrote:
>> encoding error in a portable way, to make sure the test suite of a
>> library I'm writing has 100% of test coverage.
>
> that won't work in general, several functions have failures outside of
> your control
>

Fortunately this is not the case for my library, since the implementation
is very simple.

It is a unit test library, based on xUnit (I'm actually using Python
unittest for API reference) where test results are written on stdout using
TAP protocol.

It is so simple, that 100% test coverage is not a problem, with both the
library code and test suite being strictly conforming to C99 (I hope).

> [...]

Thanks Manlio

Richard Damon

unread,

Oct 26, 2013, 1:45:54 PM10/26/13

to

The problem is that C doesn't define what encoding the Wide Character or
Multi-Byte character strings will use, so you you can't know in a
portable way what values would be an error.

You seem to be assuming the use of Unicode (UTF-8 for MB, and
UTF-16/UCS-2 or UTF-32/UCS-4 for WC, which is common, but not required.

Manlio Perillo

unread,

Oct 26, 2013, 2:24:50 PM10/26/13

to

Il Sat, 26 Oct 2013 13:45:54 -0400, Richard Damon ha scritto:

> [...]

>> "What is one possible wide character value that will make wcrtomb
>> *always* fail with EILSEQ?".
>>
>> With always I mean that an implementation where wcrtomb does not fail,
>> is not standard conforming.
>>
>>
>> It seems the simplest answer is:
>>
>> wchar_t invalid = (wchar_t) WEOF;
>>
>> but I'm not 100% sure (it works on GCC).
>>

> [...]

>>
> The problem is that C doesn't define what encoding the Wide Character or
> Multi-Byte character strings will use, so you you can't know in a
> portable way what values would be an error.
>

This is the reason why I was not sure if what I want is possible, and
posted the message here instead of comp.lang.c.

> You seem to be assuming the use of Unicode (UTF-8 for MB, and
> UTF-16/UCS-2 or UTF-32/UCS-4 for WC, which is common, but not required.

There is no such assumption in the code above, where WEOF is used.

I think using WEOF does not have the problems of the solution posted by
James Kuyper:

> The most portable way to identify such a value is to loop over all
> possibilities from WCHAR_MIN to WCHAR_MAX, passing each one to wcrtomb()
> until it returns a negative value. There need not be any such value.

On my system WEOF is -1, WCHAR_MIN is -2147483648 and WCHAR_MAX is
2147483647.

Thanks Manlio

James Kuyper

unread,

Oct 26, 2013, 7:34:45 PM10/26/13

to

On 10/26/2013 08:24 AM, Manlio Perillo wrote:
> Il Fri, 25 Oct 2013 19:21:33 +0000, Manlio Perillo ha scritto:
>
>> Hi.
>>
>> The question may seems unusual, but I need to make snprintf fail due to
>> encoding error in a portable way, to make sure the test suite of a
>> library I'm writing has 100% of test coverage.
>>
>> What is a simple and standard conforming method to make snprintf return
>> a negative value?
>>
>
> I realized that the choice of the question was a bit unfortunate.
>
> The real question is something like this:
> "What is one possible wide character value that will make wcrtomb
> *always* fail with EILSEQ?".

There is no such character.

> With always I mean that an implementation where wcrtomb does not fail, is
> not standard conforming.
>
>
> It seems the simplest answer is:
>
> wchar_t invalid = (wchar_t) WEOF;

WEOF is guaranteed to have a value which does not
correspond to that of any member of the extended character set. The same
is NOT guaranteed for (wchar_t) WEOF.
--
James Kuyper

Manlio Perillo

unread,

Oct 27, 2013, 4:48:04 AM10/27/13

to

Il Sat, 26 Oct 2013 19:34:45 -0400, James Kuyper ha scritto:

> [...]

>> The real question is something like this:
>> "What is one possible wide character value that will make wcrtomb
>> *always* fail with EILSEQ?".
>
> There is no such character.
>

So, this means that a conforming application cannot be written to test if
a wcrtomb implementation is conforming to C99?

>> With always I mean that an implementation where wcrtomb does not fail,
>> is not standard conforming.
>>
>>
>> It seems the simplest answer is:
>>
>> wchar_t invalid = (wchar_t) WEOF;
>
> WEOF is guaranteed to have a value which does not correspond to that of
> any member of the extended character set. The same is NOT guaranteed for
> (wchar_t) WEOF.

Thanks Manlio

James Kuyper

unread,

Oct 27, 2013, 8:51:53 AM10/27/13

to

On 10/27/2013 04:48 AM, Manlio Perillo wrote:
> Il Sat, 26 Oct 2013 19:34:45 -0400, James Kuyper ha scritto:
>
>> [...]
>>> The real question is something like this:
>>> "What is one possible wide character value that will make wcrtomb
>>> *always* fail with EILSEQ?".
>>
>> There is no such character.
>>
>
> So, this means that a conforming application cannot be written to test if
> a wcrtomb implementation is conforming to C99?

Correct - it is impossible to prove conformance; only non-conformance
can be proven. That's true in general, not just in this particular case.
What you might be able to do in a reasonable amount of time, is check
all possible wchar_t values from WCHAR_MIN to WCHAR_MAX to see what
wcrtomb() returns.

If there are no values for which wcrtomb() returns a negative value,
then any wide string passed to fprintf("%ls", wide_string) that produces
a negative return value when ferror(stdout) and feof(stdout) are both
false, proves the implementation to be non-conforming.

If there are values for which wcrtomb() returns a negative value, then
if there is any wchar_t array containing at least one such value before
the terminating null wide character which, if passed to fprintf("%ls"),
fails to return a negative value, the implementation is non-conforming.

Notice that, in either cse, there's an enormous number of possible
strings to be tested. Failing to deal correctly with any of those
strings would render an implementation non-conforming. Only by testing
every single such string under every possible circumstance that could
affect the behavior of fprintf() would you be able to prove conformance.
For instance, it might fail only if you pass a string containing the
Declaration of Independence on 2076-07-04 - so you have to check every
possible string after setting the clock to every possible value. locale
might be relevant to. The amount of load on the machine might matter.

That's an example of why conformance is unprovable, only non-conformance
can be proven. Failure to prove non-conformance, after having made a
sufficiently large number and sufficiently diverse variety of attempts
to do so, is supporting evidence for the claim that the implementation
is conforming. However, it's not proof of that claim.
--
James Kuyper

Tim Rentsch

unread,

Oct 27, 2013, 11:42:56 AM10/27/13

to

Manlio Perillo <manlio_p...@SPAMlibero.it> writes:

> What is a simple and standard conforming method to make snprintf return a
> negative value?

Apparently not possible in theory. In practice though
it should be easy to find one by doing

setlocale( LC_ALL, "C" );

and then doing a search using wcrtomb().

> Currently I'm doing (using GCC 4.7.2 on GNU Linux Debian 7.2:
>
> snprintf(buf, n, "Hello %ls", L"\U10FFFFFF");
>
> and it works, but it is not C99 [1].
>
> [1] by the way, it is strange that GCC in C99 standard mode accepts it

What makes you think it isn't C99? Did you look at 6.4.4.4 p1
in n1256?

Manlio Perillo

unread,

Oct 27, 2013, 12:19:13 PM10/27/13

to

Il Sun, 27 Oct 2013 08:51:53 -0400, James Kuyper ha scritto:

> [...]

> If there are values for which wcrtomb() returns a negative value, then
> if there is any wchar_t array containing at least one such value before
> the terminating null wide character which, if passed to fprintf("%ls"),

Note that in my original question I used "%ls" format string, but "%lc"
is perfectly fine, and this is what I'm using to test my function for
100% coverage:

my_fprintf_that_use_vsnprintf_internally("%lc", (wchar_t) WEOF))

> [...]

> That's an example of why conformance is unprovable, only non-conformance
> can be proven.

That's not true, IMHO.

Consider a similar problem to the one I posted, but in this case I want
to make fopen fail.

The standard says that
"The rules for composing valid ﬁle names are implementation-deﬁned."

This means that I can not compose a filename that will make fopen fail;
however there is a standard conforming way for doing what I want.
I can obtain a valid file name with tmpnam, then remove the file and call
fopen with the same name. I think that fopen SHALL fail in this case.

Thanks Manlio Perillo

Manlio Perillo

unread,

Oct 27, 2013, 2:44:03 PM10/27/13

to

Il Sun, 27 Oct 2013 08:42:56 -0700, Tim Rentsch ha scritto:

> Manlio Perillo <manlio_p...@SPAMlibero.it> writes:
>
>> What is a simple and standard conforming method to make snprintf return
>> a negative value?
>
> Apparently not possible in theory. In practice though it should be easy
> to find one by doing
>
> setlocale( LC_ALL, "C" );
>

This should not be required, unless setlocale was already called.

> [...]

>
>> Currently I'm doing (using GCC 4.7.2 on GNU Linux Debian 7.2:
>>
>> snprintf(buf, n, "Hello %ls", L"\U10FFFFFF");
>>
>> and it works, but it is not C99 [1].
>>
>> [1] by the way, it is strange that GCC in C99 standard mode accepts it
>
> What makes you think it isn't C99? Did you look at 6.4.4.4 p1 in n1256?

Yes, and there is no reference about the \U escaping.

Regards Manlio Perillo

Keith Thompson

unread,

Oct 27, 2013, 3:22:23 PM10/27/13

to

In both N1256 (C99) and N1570 (C11), 6.4.4.4p1 has:

escape-sequence:
...
universal-character-name

"universal-character-name" is defined in 6.4.3:

universal-character-name:
\u hex-quad
\U hex-quad hex-quad

hex-quad:
hexadecimal-digit hexadecimal-digit
hexadecimal-digit hexadecimal-digit

--
Keith Thompson (The_Other_Keith) ks...@mib.org <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Manlio Perillo

unread,

Oct 27, 2013, 3:43:41 PM10/27/13

to

Il Sun, 27 Oct 2013 12:22:23 -0700, Keith Thompson ha scritto:

> [...]

> In both N1256 (C99) and N1570 (C11), 6.4.4.4p1 has:
>
> escape-sequence:
> ...
> universal-character-name
>
> "universal-character-name" is defined in 6.4.3:
>
> universal-character-name:
> \u hex-quad \U hex-quad hex-quad
>
> hex-quad:
> hexadecimal-digit hexadecimal-digit hexadecimal-digit
> hexadecimal-digit

Ah, right; thanks.

I really missed it, since universal-character-name was not defined in
6.4.4.4.

Regards Manlio Perillo

James Kuyper

unread,

Oct 27, 2013, 4:16:09 PM10/27/13

to

On 10/27/2013 12:19 PM, Manlio Perillo wrote:
> Il Sun, 27 Oct 2013 08:51:53 -0400, James Kuyper ha scritto:
>
>> [...]
>> If there are values for which wcrtomb() returns a negative value, then
>> if there is any wchar_t array containing at least one such value before
>> the terminating null wide character which, if passed to fprintf("%ls"),
>
> Note that in my original question I used "%ls" format string, but "%lc"
> is perfectly fine, and this is what I'm using to test my function for
> 100% coverage:
>
> my_fprintf_that_use_vsnprintf_internally("%lc", (wchar_t) WEOF))

That's certainly a feasible approach; but it misses out on the
possibility that the "%lc" specifier is implemented correctly, while the
"%ls" format specifier is not.

>> [...]
>> That's an example of why conformance is unprovable, only non-conformance
>> can be proven.
>
> That's not true, IMHO.

I think, from your discussion below, that you don't understand the point
I was making.

> Consider a similar problem to the one I posted, but in this case I want
> to make fopen fail.
>
> The standard says that
> "The rules for composing valid ﬁle names are implementation-deﬁned."
>
> This means that I can not compose a filename that will make fopen fail;
> however there is a standard conforming way for doing what I want.
> I can obtain a valid file name with tmpnam, then remove the file and call
> fopen with the same name. I think that fopen SHALL fail in this case.

"Opening a file with read mode ('r' as the first character in the mode
argument) fails if the file does not exist or cannot be read." (7.21.5.3p4)

You didn't specify that you were trying to open the file in read mode,
but I'll take that as implied. You're correct that fopen() should fail
under those circumstance, unless a file with that name is created
between the call to remove() and the call to fopen() - either by some
other process, or by some other thread in the same process.

However, that just means that a successful call to fopen() which
specifies read mode in those circumstances proves non-conformance. An
unsuccessful call to fopen() in those circumstances does NOT prove
conformance. If you perform this test N times, failure on every single
test run does not prove conformance - if it would have succeeded if you
had run test number N+1, the implementation is non-conforming, no matter
how big N is.
--
James Kuyper