Initializing a character array with a string literal?

Jef Driesen

unread,

Mar 15, 2010, 5:28:44 AM3/15/10

to

Hi,

Sorry for cross-posting to c.l.c and c.l.c++, but I would like to know
the answer in both C and C++.

I know it is possible to initialize a character array with a string literal:

char str[] = "hello";

which is often more convenient than having to write:

char str[] = {'h', 'e', 'l', 'l', 'o', 0};

But in my case the array is not a real string but a byte array. Hence I
don't want the terminating null character, and I use unsigned char for
the data type. Now, s it allowed to write this:

unsigned char str[5] = "hello";

It works fine with gcc (in C code), but with msvc (in C++ code) I get an
error "C2117: array bounds overflow".

So I wonder if this construct is allowed, and whether there is a
difference in C and C++.

Thanks,

Jef

Nick Keighley

unread,

Mar 15, 2010, 5:35:06 AM3/15/10

to

this is one of those places where C and C++ differ. C allows the
constuct, C++ doesn't. No doubt comp.lang.c++ can suggest some
workarounds.

Alf P. Steinbach

unread,

Mar 15, 2010, 5:53:23 AM3/15/10

to

* Nick Keighley:

The easiest in C++ is just to accept the superfluous nullbyte and write

unsigned char bytes[] = "hello";

Cheers & hth.,

- Alf

Jef Driesen

unread,

Mar 15, 2010, 6:44:43 AM3/15/10

to

That works, but it's a little bit inconvenient because I use
sizeof(bytes) in a few places.

Ersek, Laszlo

unread,

Mar 15, 2010, 7:17:06 AM3/15/10

to

In article <h5nnn.397200$Dy7.2...@newsfe26.ams2>, Jef Driesen <jefdr...@hotmail.com.invalid> writes:

> But in my case the array is not a real string but a byte array. Hence I
> don't want the terminating null character, and I use unsigned char for
> the data type. Now, s it allowed to write this:
>
> unsigned char str[5] = "hello";

This makes me think that you want to use the array for "binary
purposes", like writing it to a socket or to a binary file, so that it
leaves the boundaries of the system. In that case, the above
initialization is not portable, because it initializes str[0] .. str[4]
to platform-dependent values.

If your execution character set is ASCII-based, the above will amount to

char unsigned str[5] = { 0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu };

If your execution character set is EBCDIC-based, it will amount to

char unsigned str[5] = { 0x88u, 0x85u, 0x93u, 0x93u, 0x96u };

Even if you're sure that the execution character set will be
ASCII-based, the byte array form is much clearer on the issue, in my
opinion.

lacos

Alf P. Steinbach

unread,

Mar 15, 2010, 7:46:43 AM3/15/10

to

* Jef Driesen:

OK.

C++ solution for that:

typedef unsigned char Byte;
typedef Byte ByteArr5[5];

Byte data[] = "hello";
ByteArr5& bytes = reinterpret_cast<ByteArr5&>( data );

The awkwardness implies that you're working at cross-purposes with the language,
though. E.g. perhaps the size should be a named constant. Or perhaps use a
std::vector or Boost::array or whatever. Or perhaps this part should really be
written in pure C and just accessed from C++. Something.

Cheers, & still hth.,

- Alf

Jef Driesen

unread,

Mar 15, 2010, 8:19:49 AM3/15/10

to

It is indeed used as binary data, but the contents happens to be ASCII
data (and a number of zero bytes too, so it's definitely not usable as a
null terminated string). The reason why I like the string literal, is
that it makes the initialization a lot easier to read. If I see

unsigned char str[5] = "hello";

unsigned char str[5] = {0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu};

it's not immediately clear the second variant equals to "hello".

But I have to admit I didn't know that the character 'h' is not always
equal to 0x68. I assumed that for characters in the ASCII range this is
safe?

Jef Driesen

unread,

Mar 15, 2010, 8:24:21 AM3/15/10

to

The code is actually written in C. But it uses a number of C99 features
(such as variable declaration that are not at the top of a block) that
are not supported by the msvc C compiler, so I compile it as C++ code.

Thus adjusting my sizeof's is a less ugly solution in my case.

Tom St Denis

unread,

Mar 15, 2010, 8:51:50 AM3/15/10

to

On Mar 15, 8:24 am, Jef Driesen <jefdrie...@hotmail.com.invalid>
wrote:

> The code is actually written in C. But it uses a number of C99 features
> (such as variable declaration that are not at the top of a block) that
> are not supported by the msvc C compiler, so I compile it as C++ code.
>
> Thus adjusting my sizeof's is a less ugly solution in my case.

Maybe the solution is to refactor your code so you can declare your
local variables at the top of block scope? Just saying...

Tom

Kaz Kylheku

unread,

Mar 15, 2010, 8:56:25 AM3/15/10

to

On 2010-03-15, Ersek, Laszlo <la...@ludens.elte.hu> wrote:
> If your execution character set is ASCII-based, the above will amount to
>
> char unsigned str[5] = { 0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu };

If your execution character set is ASCII based, it means that
you haven't yet managed to install Linux on that old IBM junker you
got at the swap meet.

Writing { 0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu } instead of "hello", in the
name of portability to EBCDIC is egregiously moronic, and a disservice
to whoever is signing your paycheck.

pete

unread,

Mar 15, 2010, 9:20:53 AM3/15/10

to

Maybe you can define
Byte data[5] = "hello";
and also
const size_t data_size = sizeof(data);
in a special C file that doesn't use any C99 features
and which can be compiled as a C file,
and then declare them with the extern keyword in your C++ files.

--
pete

Ersek, Laszlo

unread,

Mar 15, 2010, 9:27:05 AM3/15/10

to

In article <FBpnn.258290$zD4.2...@newsfe19.ams2>,
Jef Driesen <jefdr...@hotmail.com.invalid> writes:

> It is indeed used as binary data, but the contents happens to be ASCII
> data (and a number of zero bytes too, so it's definitely not usable as a
> null terminated string). The reason why I like the string literal, is
> that it makes the initialization a lot easier to read. If I see
>
> unsigned char str[5] = "hello";
> unsigned char str[5] = {0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu};
>
> it's not immediately clear the second variant equals to "hello".

And rightfully so, because the second variant does *not* equal "hello"
on an EBCDIC execution character set, for example.

I can offer no solution that is really pleasing to the eye. At best:

/* "hello" encoded in ASCII */
const char unsigned hello[] = { 0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu };

You describe your network protocol as sequences of specific octets. The
first variant doesn't initialize the array to specific octets if you
don't restrict further aspects of your environment. Of course you can
say that the program only works correctly on ASCII-based execution
character sets (I guess that covers the vast majority of systems today).
I just wanted to make you aware of your reliance on the basic execution
character set being encoded in ASCII.

(I used the word "octet" above. I usually check #if 8 == CHAR_BIT and
abort compilation with #error if char doesn't have exactly 8 bits. All
of C89, C99, SUSv1 and SUSv2 permit bigger bytes theoretically. Even if
no actual system with bytes wider than 8 bits might exist that also
supports the BSD sockets interface, I like to spell out this dependency
of my code explicitly.)

> But I have to admit I didn't know that the character 'h' is not always
> equal to 0x68. I assumed that for characters in the ASCII range this is
> safe?

I'd risk it is safe on most systems today. Perhaps you'll want to
document your dependence on the ASCII encoding of the basic execution
character set, instead of changing the code.

lacos

Ersek, Laszlo

unread,

Mar 15, 2010, 9:40:15 AM3/15/10

to

It is not without example, though.

$ less bzip2-1.0.5/CHANGES

----v----
1.0.2
~~~~~

[...]

* Hard-code header byte values, to give correct operation on platforms
using EBCDIC as their native character set (IBM's OS/390).
(Leland Lucius)

[...]
----^----

I agree that documenting reliance on ASCII may be a better way to go
than diminishing the readability of the source for a dubious increase in
portability. Being aware of the issue is useful in any case, IMHO.

lacos

Stuart Golodetz

unread,

Mar 15, 2010, 10:25:50 AM3/15/10

to

I'll probably regret suggesting this, but:

#define UNTERMINATED_STRING(var, str) char var[sizeof(str)-1];
memcpy(var, str, sizeof(str)-1);

(Obviously the macro is all on one line - ignore the line wrapping.)

It's a bit of an icky approach (at least coming from a C++ background),
but it gets it done.

Cheers,
Stu

Ersek, Laszlo

unread,

Mar 15, 2010, 4:55:45 PM3/15/10

to

In article <20100315...@gmail.com>, Kaz Kylheku <kkyl...@gmail.com> writes:

Not to debate your point any further, but I'd like to add the following:

1. In C99, __STDC_ISO_10646__ defined by the implementation implies,
AFAICT, that "hello" will in fact translate to { 0x68u, 0x65u, 0x6Cu,
0x6Cu, 0x6Fu } (and possibly a trailing \0 if space allows). I think
this can be derived from 6.10.8p2, 6.4.5p3, 6.4.4.4p11 and 5.2.1.2p1:

char unsigned s[5] = "hello";
= { 'h', 'e', 'l', 'l', 'o' };
= { L'h', L'e', L'l', L'l', L'o' };

= { 0x68u, 0x65u, 0x6Cu, 0x6Cu, 0x6Fu };

2. It seems to me that all four versions of the SUS published till now
were explicitly written with EBCDIC in mind.

v1:
System Interface Definitions
Issue 4, Version 2
4.4 Character Set Description File
paragraph 7

----v----
The charmap file was introduced to resolve problems with the portability
of, especially, /localedef/ sources. This document set assumes that the
portable character set is constant across all locales, but does not
prohibit implementations from supporting two incompatible codings, such
as both ASCII and EBCDIC. Such dual-support implementations should have
all charmaps and /localedef/ sources encoded using one portable character
set, in effect cross-compiling for the other environment. [...]
----^----

v2:
http://www.opengroup.org/onlinepubs/007908775/xbd/charset.html#tag_001_004

v3:
http://www.opengroup.org/onlinepubs/000095399/xrat/xbd_chap06.html#tag_01_06_01

v4:
http://www.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap06.html#tag_21_06_01

v4 is POSIX:2008 too, thus not very old. Citing the linked-to passage of
the v4 rationale:

----v----
A.6.1 Portable Character Set

The portable character set is listed in full so there is no dependency
on the ISO/IEC 646:1991 standard (or historically ASCII) encoded
character set, although the set is identical to the characters defined
in the International Reference version of the ISO/IEC 646:1991 standard.

[...]

The statement about invariance in codesets for the portable character
set is worded to avoid precluding implementations where multiple
incompatible codesets are available (for instance, ASCII and EBCDIC).
[...]
----^----

I hoarded all this stuff together because your post made me ponder
whether these standards I care about do require ASCII-based encodings
from a conforming implementation. They seem not to.

I'm not obsessed with EBCDIC per se. I generally care that my
assumptions about the environment -- not guaranteed by relevant
standards -- are *conscious*.

lacos

Jef Driesen

unread,

Mar 15, 2010, 5:17:21 PM3/15/10

to

Those declarations are not the only C99 feature I'm using. Refactoring
is an option, but there are a lot more urgent items on my todo list if
you know what I mean.

For now, knowing that there is a difference between C and C++, using a
null terminated string works in both cases and is not that ugly to deal
with.

Jorgen Grahn

unread,

Mar 26, 2010, 8:31:47 AM3/26/10

to

["Followup-To:" header set to comp.lang.c.]

For me it's the other way around -- add C99 declarations, and the
biggest reason for refactoring goes away.

But does MSVC really not support C99 features which are as fundamental
as this one? I have no experience with that compiler, but I find it
hard to believe. An old version?

/Jorgen

--
// Jorgen Grahn <grahn@ Oo o. . .
\X/ snipabacken.se> O o .

Default User

unread,

Mar 26, 2010, 3:40:11 PM3/26/10

to

Jorgen Grahn wrote:

> For me it's the other way around -- add C99 declarations, and the
> biggest reason for refactoring goes away.
>
> But does MSVC really not support C99 features which are as fundamental
> as this one? I have no experience with that compiler, but I find it
> hard to believe. An old version?

MS has not been particularly receptive towards C99. The version of the
C compiler in MSVC 2005 doesn't support that feature.

Brian