Char 26

0 views

Skip to first unread message

Kristee Summerford

unread,

Aug 4, 2024, 6:47:53 PM8/4/24

to rilucfirmde

Oneof the things that has always confused (or frustrated )me using the F2812 and now the F28335 is that a char is not individually addressable (as far as I know). The smallest unit size for a variable is a 16-bit value. There are so many instances where I want to explicitly use a variable that is only 8-bits (256 values). For instance if I transfer a packed struct via serial port from a PC to the TI Chip and I want to cast that serial byte stream to a struct it doesn't work.

Also another example is say the serial stream has a 1 byte checksum that is also passed along and in the F28335 I re-compute the checksum by adding up the values received. Normally the 8-bit value would overflow above 256 and wrap, however it will not in the TI chip so a comparision at the end won't work unless aditional code is added.

Joshua Hintze said:I'm curious how this design decision got into a TI product. It really makes code portability frustrating. Take for instance this post where he is porting some open source FAT16/32 library code. ( )

History, about 15 years of it in fact. The C2000 core originally was a 16-bit architecture and even then, the smallest addressable unit was the native size of 16-bits. As the 32-bit architecture was introduced, legacy needed to be maintained for compatibility purposes.

Characters are stored as numbers however. You can see the specific encoding in the ASCII chart. This means that it is possible to do arithmetic on characters, in which the ASCII value of the character is used (e.g. 'A' + 1 has the value 66, since the ASCII value of the capital letter A is 65). See Serial.println reference for more on how characters are translated to numbers.

A char* stores the starting memory location of a C-string.1 For example, we can use it to refer to the same array s that we defined above. We do this by setting our char* to the memory location of the first element of s:

This is different from the array s above, which we are allowed to modify. This is because the string literal "test" is automatically copied into the array at initialization phase. But with myStringLiteral, no such copying occurs. (Where would we copy to, anyways? There's no array to hold our data... just a lonely char*!)

1 Technical note: char* merely stores a memory location to things of type char. It can certainly refer to just a single char. However, it is much more common to use char* to refer to C-strings, which are NUL-terminated character sequences, as shown above.

The char type can only represent a single character. When you have a sequence of characters, they are piled next to each other in memory, and the location of the first character in that sequence is returned (assigned to test). Test is nothing more than a pointer to the memory location of the first character in "testing", saying that the type it points to is a char.

The bottom line, however, is that char x; will only define a single character. If you want a string of characters, you have to define an array of char or a pointer to char (which you'll initialize with a string literal, as above, more often than not).

There are real differences between the first two options though. char *test=... defines a pointer named test, which is initialized to point to a string literal. The string literal itself is allocated statically (typically right along with the code for your program), and you're not supposed to (attempt to) modify it -- thus the preference for char const *.

The char test[] = .. allocates an array. If it's a global, it's pretty similar to the previous except that it does not allocate a separate space for the pointer to the string literal -- rather, test becomes the name attached to the string literal itself.

If you do this as a local variable, test will still refer directly to the string literal - but since it's a local variable, it allocates "auto" storage (typically on the stack), which gets initialized (usually from a normal, statically allocated string literal) on every entry to the block/scope where it's defined.

The latter versions (with an array of char) can act deceptively similar to a pointer, because the name of an array will decay to the address of the beginning of the array anytime you pass it to a function. There are differences though. You can modify the array, but modifying a string literal gives undefined behavior. Conversely, you can change the pointer to point at some other chars, so something like:

The main thing people forgot to mention is that "testing" is an array of chars in memory, there's no such thing as primitive string type in c++. Therefore as with any other array, you can't reference it as if it is an element.

Using a * says that this variable points to a location in memory. In this case, it is pointing to the location of the string "testing". With a char pointer, you are not limited to just single characters, because now you have more space available to you.

The difference between this example and yours is that b, which is an array, decays to a pointer to the first element when assigned to a. So in this case a contains the address of a local variable which then goes out of scope.

String literals are statically allocated, so the pointer is valid indefinitely. If you had said char b[] = "stackoverflow", then you would be allocating a char array on the stack that would become invalid when the scope ended. This difference also shows up for modifying strings: char s[] = "foo" stack allocates a string that you can modify, whereas char *s = "foo" only gives you a pointer to a string that can be placed in read-only memory, so modifying it is undefined behaviour.

Other people have explained that this code is perfectly valid. This answer is about your expectation that, if the code had been invalid, there would have been a runtime error when calling printf. It isn't necessarily so.

It's absolutely vital to understand that "undefined behavior" does not mean "the program will crash predictably". It means "anything can happen", and anything includes appearing to work as the programmer probably intended (on this computer, with this compiler, today).

As a final note, none of the aggressive debugging tools I have convenient access to (Valgrind, ASan, UBSan) track "auto" variable lifetimes in sufficient detail to trap this error, but GCC 6 does produce this amusing warning:

I think that, as a proof of previous answers, it is good to take a look at what really sits inside your code. People already mentioned that string literals lay inside .text section. So, they (literals) are simply, always, there. You can easily find this for the code

I've always wondered why the C++ Standard library has instantiated basic_[io]stream and all its variants using the char type instead of the unsigned char type. char means (depending on whether it is signed or not) you can have overflow and underflow for operations like get(), which will lead to implementation-defined value of the variables involved. Another example is when you want to output a byte, unformatted, to an ostream using its put function.

The type of a 1-byte character in C++ is "char", not "unsigned char". This gives implementations a bit more freedom to do the best thing on the platform (for example, the standards body may have believed that there exist CPUs where signed byte arithmetic is faster than unsigned byte arithmetic, although that's speculation on my part). Also for compatibility with C. The result of removing this kind of existential uncertainty from C++ is C# ;-)

Given that the "char" type exists, I think it makes sense for the usual streams to use it even though its signedness isn't defined. So maybe your question is answered by the answer to, "why didn't C++ just define char to be unsigned?"

I have always understood it this way: the purpose of the iostream class is to read and/or write a stream of characters, which, if you think about it, are abstract entities that are only represented by the computer using a character encoding. The C++ standard makes great pains to avoid pinning down the character encoding, saying only that "Objects declared as characters (char) shall be large enough to store any member of the implementation's basic character set," because it doesn't need to force the "implementation basic character set" to define the C++ language; the standard can leave the decision of which character encoding is used to the implementation (compiler together with an STL implementation), and just note that char objects represent single characters in some encoding.

An implementation writer could choose a single-octet encoding such as ISO-8859-1 or even a double-octet encoding such as UCS-2. It doesn't matter. As long as a char object is "large enough to store any member of the implementation's basic character set" (note that this explicitly forbids variable-length encodings), then the implementation may even choose an encoding that represents basic Latin in a way that is incompatible with any common encoding!

It is confusing that the char, signed char, and unsigned char types share "char" in their names, but it is important to keep in mind that char does not belong to the same family of fundamental types as signed char and unsigned char. signed char is in the family of signed integer types:

The one similarity between the char, signed char, and unsigned char types is that "[they] occupy the same amount of storage and have the same alignment requirements". Thus, you can reinterpret_cast from char * to unsigned char * in order to determine the numeric value of a character in the execution character set.

To answer your question, the reason why the STL uses char as the default type is because the standard streams are meant for reading and/or writing streams of characters, represented by char objects, not integers (signed char and unsigned char). The use of char versus the numeric value is a way of separating concerns.

Standard does not specify if signed or unsigned char will be used for the implementation of char - it is compiler-specific. It only specifies that the "char" will be "enough" to hold characters on you system - the way characters were in those days, which is, no UNICODE.