"A char variable is of the natural size to hold a character on a given
machine (typically a byte), and an int variable is of the natural size
for integer arithmetic on a given machine (typically a word)."
Now the last statement (i.e. sizeof(int) typically == a word)
certainly shows the age of the text here. In the meantime, the
"natural" size of an int has grown to a 32-bit DWORD on most machines,
whereas 64-bit int's are becoming more and more common.
But what does this mean for char?? I was always under the assumption
that sizeof(char) is ALWAYS guaranteed to be exactly 1 byte,
especially since there is no C++ "byte" type. As we now have the
wchar_t as an intrinsic data type, wouldn't this cement the fact that
char is always 1 byte?
What does the ANSI standard have to say about this?
Bob Hairgrove
rhairgro...@Pleasebigfoot.com
---
[ comp.std.c++ is moderated. To submit articles, try just posting with ]
[ your news-reader. If that fails, use mailto:std...@ncar.ucar.edu ]
[ --- Please see the FAQ before posting. --- ]
[ FAQ: http://www.jamesd.demon.co.uk/csc/faq.html ]
> In Bjarne Stroustrup's 3rd edition of "The C++ Programming Language",
> there is an interesting passage on page 24:
>
> "A char variable is of the natural size to hold a character
> on a given machine (typically a byte) and an int variable
> is of the natural size for integer arithmetic
> on a given machine (typically a word)."
>
> Now the last statement (i.e. sizeof(int) typically == a word)
> certainly shows the age of the text here.
> In the meantime, the "natural" size of an int
> has grown to a 32-bit DWORD on most machines,
> whereas 64-bit int's are becoming more and more common.
>
> But what does this mean for char?
> I was always under the assumption that sizeof(char)
> is ALWAYS guaranteed to be exactly 1 byte,
> especially since there is no C++ "byte" type.
> As we now have the wchar_t as an intrinsic data type,
> wouldn't this cement the fact that char is always 1 byte?
>
> What does the ANSI standard have to say about this?
A byte is a data size -- not a data type.
A byte is 8 bits on virtually every modern processor
and the memories are almost always byte addressable.
A machine word is as wide as the integer data path
throught the Arithmetic and Logic Unit (ALU).
The old Control Data Corporation (CDC) computers
had 60 bit words and were word addressable.
Characters were represented by 60 bit words
or were packed into a word 10 at a time
which means that the CDC character code set
had just 64 distinct codes represented by a 6 bit byte.
> In Bjarne Stroustrup's 3rd edition of "The C++ Programming Language",
> there is an interesting passage on page 24:
>
> "A char variable is of the natural size to hold a character on a given
> machine (typically a byte), and an int variable is of the natural size
> for integer arithmetic on a given machine (typically a word)."
>
> Now the last statement (i.e. sizeof(int) typically == a word)
> certainly shows the age of the text here.
No. You are applying a corruption of the term "word". It does not mean 16
bits. It means the natural size for the machine, typically the register
size. On a 128-bit machine it is 128 bits. On an 8-bit machine it is 8
bits.
> In the meantime, the
> "natural" size of an int has grown to a 32-bit DWORD on most machines,
No it hasn't. Most machines do not have DWORDs. 32-bit machines often have
words and half words.
4-bit machines that became 8-bit machines that became 16-bit machines that
became 32-bit machines have DWORDs. Nobody else has anything half as silly.
>
> whereas 64-bit int's are becoming more and more common.
64-bit registers are becoming more common. Many people who believe in DWORDs
object to 64-bit ints because their religion says that ints are 32 bits.
>
>
> But what does this mean for char?? I was always under the assumption
> that sizeof(char) is ALWAYS guaranteed to be exactly 1 byte,
Your assumption is invalid.
> especially since there is no C++ "byte" type. As we now have the
> wchar_t as an intrinsic data type, wouldn't this cement the fact that
> char is always 1 byte?
No.
The type char can be 16 bits like Unicode or even 32 bits like the ISO
character sets.
>
>
> What does the ANSI standard have to say about this?
Have you read it?
The standard mandates sizeof(char)==1. The only requirements on the size
of an 'int' are those implied by the requirements that INT_MIN<=-32767,
and INT_MAX>=32767 (these limits are incorporated by reference from the
C standard, rather than being specified in the C++ standard itself).
Bjarne's statement is technically incorrect, but true to the history of
C, when he identifies "char" more closely with "character" than with
"byte". His statement about "words" is actually more accurate;
traditionally a "word" of memory wasn't a fixed amount of memory, but
varied from machine to machine. On a 32-bit machine, a "word" should
properly be a 32-bit chunk of memory. However, when people are used to
programming only for a limited range of architectures, all of which
share the same word size, they tend to assume that "word" means the same
amount of memory on all machines, that it refers to on the machines
they're used to. If enough people do this, the term may even end up
being redefined; confusing people who still remember the original
definition.
|> Bob Hairgrove wrote:
|> > In Bjarne Stroustrup's 3rd edition of "The C++ Programming
|> > Language", there is an interesting passage on page 24:
|> > "A char variable is of the natural size to hold a character
|> > on a given machine (typically a byte) and an int variable
|> > is of the natural size for integer arithmetic
|> > on a given machine (typically a word)."
|> > Now the last statement (i.e. sizeof(int) typically == a word)
|> > certainly shows the age of the text here. In the meantime, the
|> > "natural" size of an int has grown to a 32-bit DWORD on most
|> > machines, whereas 64-bit int's are becoming more and more common.
Excuse me, but on 32 bit machines (at least the ones I've seen), DWORD
is 64 bits. The "traditional" widths (from IBM, since the 360) are:
BYTE 8 bits
HWORD 16 bits
WORD 32 bits
DWORD 64 bits
The only place I've seen otherwise is on 16 bit machines. Where word is
16 bits. Or, of course, on 36 bit machines, with 36 bit words, or 48
bit machines, with 48 bit words.
|> > But what does this mean for char? I was always under the
|> > assumption that sizeof(char) is ALWAYS guaranteed to be exactly 1
|> > byte, especially since there is no C++ "byte" type. As we now
|> > have the wchar_t as an intrinsic data type, wouldn't this cement
|> > the fact that char is always 1 byte?
The standard defines the results of sizeof as the size in bytes. And
guarantees that sizeof(char) == 1. So by definition, the size of a char
is one byte, even if that char has 32 bits.
|> > What does the ANSI standard have to say about this?
|>
|> A byte is a data size -- not a data type.
|> A byte is 8 bits on virtually every modern processor and the
|> memories are almost always byte addressable.
I'm not so sure. From what I've heard, more than a few DSP use 32 bit
char's.
|> A machine word is as wide as the integer data path throught the
|> Arithmetic and Logic Unit (ALU).
Or as wide as the memory bus?
I'm not sure that there is a real definition of "word". I've used
machines (Interdata 32/7) where the ALU was 16 bits wide, but the native
instruction set favored 32 bits (through judicious microcode), and if I
remember correctly, the memory bus was 32 bits wide (but it has been a
long time, and I could be mistaken).
|> The old Control Data Corporation (CDC) computers had 60 bit words
|> and were word addressable. Characters were represented by 60 bit
|> words or were packed into a word 10 at a time which means that the
|> CDC character code set had just 64 distinct codes represented by a 6
|> bit byte.
This wouldn't be legal in C/C++, since UCHAR_MAX must be at least 255.
A C/C++ implementation on this machine would probably use 6 10 bit bytes
to the word. (This is a C/C++ specific. The original use of byte was
for a 6 bit chunk of data.)
There have definitly been C implementations on 36 bit machines, normally
with 9 bit bytes, and there are implementations today (for DSPs) with 32
bit bytes. There probably are, and have been, others as well.
--
James Kanze mailto:ka...@gabi-soft.de
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
Ziegelhüttenweg 17a, 60598 Frankfurt, Germany Tel. +49(0)179 2607481
No, it's not invalid, it is precisely correct.
sizeof (char) is required to be one byte. Note
that byte size can and does vary among platforms,
and that a char (byte) is required by the C standard
to have at least eight bits, but is not prevented
from having more.
> > especially since there is no C++ "byte" type. As we now have the
> > wchar_t as an intrinsic data type, wouldn't this cement the fact that
> > char is always 1 byte?
>
> No.
A char is indeed always one byte, but the definition
of type 'wchar_t' has no influence upon this.
"sizeof(char) == one byte" is mandated by the standard.
>
> The type char can be 16 bits like Unicode or even 32 bits like the ISO
> character sets.
Yes, on machines with 16-bit or 32-bit *bytes*.
On a machine with e.g. 8-bit bytes, type 'char'
cannot represent every Unicode character. Thus
'wchar_t' was invented.
>
> >
> >
> > What does the ANSI standard have to say about this?
>
> Have you read it?
Have *you*? :-)
-Mike
> when people are used to
>programming only for a limited range of architectures, all of which
>share the same word size, they tend to assume that "word" means the same
>amount of memory on all machines, that it refers to on the machines
>they're used to. If enough people do this, the term may even end up
>being redefined; confusing people who still remember the original
>definition.
Yes, and a similar confusion already exists for "byte", which many
people incorrectly assume to mean "8-bit byte".
Genny
|> > In the meantime, the "natural" size of an int has grown to a
|> > 32-bit DWORD on most machines,
|> No it hasn't. Most machines do not have DWORDs. 32-bit machines
|> often have words and half words.
IBM 360's (the prototypical 32 bit machine) certainly have DWORDS. A
DWORD is a 8 byte quantity, often initialized with 16 BCD digits. (The
IBM 360 had machine instructions for all four operations on such
quantities, as well as instructions for 4 bit left and right shifts over
DWORDs. Very useful for Cobol, or other languages that used decimal
arithmetic. We once converted the BCD arithmetic routines in a Basic
interpreter from C to assembler -- something like 150 lignes of C became
10 lignes of assembler, and ran four or five magnitudes faster.)
|> 4-bit machines that became 8-bit machines that became 16-bit
|> machines that became 32-bit machines have DWORDs. Nobody else has
|> anything half as silly.
That's because nobody else has been around half as long:-)? Seriously,
historical reasons lead to all kinds of silliness, where the normal
registers are called extended, and the non-extended registers need a
special instruction prefix to access them.
In the mean time, there are 64 bit machines out there where int is only
32 bits, and you need long to get 64 bits. That sounds pretty silly,
too, until you realize that the vendors have a lot of customers who were
stupid enough to write code which depended on int being exactly 32 bits.
And making your customer feel like an idiot has never been a
particularly successful commercial policy, even if it is sometimes the
truth.
In the good old days (pre-360), of course, no one worried about
compatibility, so a WORD in IBM's assembler could change from one
machine to the next. We didn't get such silliness. But we did have to
rewrite all of our code every time we upgraded the processor.
|> > whereas 64-bit int's are becoming more and more common.
|> 64-bit registers are becoming more common. Many people who believe
|> in DWORDs object to 64-bit ints because their religion says that
|> ints are 32 bits.
|> > But what does this mean for char?? I was always under the
|> > assumption that sizeof(char) is ALWAYS guaranteed to be exactly 1
|> > byte,
|> Your assumption is invalid.
I think you misread something. He said that his assumtion was that
sizeof(char) is guaranteed to be exactly one byte. Which is exactly
what the standard says.
|> > especially since there is no C++ "byte" type. As we now have the
|> > wchar_t as an intrinsic data type, wouldn't this cement the fact
|> > that char is always 1 byte?
|> No.
Yes. ISO 14882, 5.3.3 and ISO 9899 6.5.3.4.
|> The type char can be 16 bits like Unicode or even 32 bits like the
|> ISO character sets.
The type char can be 16 bits, or 32 bits. In the past, it has often
been 9 bits, and I think that there have also been 10 bit
implementations.
But the size of char in bytes is always 1.
|> > What does the ANSI standard have to say about this?
|> Have you read it?
Have you?
--
James Kanze mailto:ka...@gabi-soft.de
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
Ziegelhüttenweg 17a, 60598 Frankfurt, Germany Tel. +49(0)179 2607481
---
sizeof(char) is guaranteed to be 1. 1 what though? 1 memory allocation
unit. All other types must have sizes which are multiples of sizeof(char).
The standard makes no claim that 1 memory allocation unit == 1 byte. On a
system with a 16-bit "natural character", sizeof(char) and sizeof(wchar_t)
might both be 1, and sizeof(int), though it's 32 bits, would be 2 not 4.
Further, there's no guarantee that you have any access to the smallest
addressable unit of storage, only to storage which is allocated in multiples
of char. For example, on an 8051, the smallest addressable unit is 1 bit,
but char is still 8 bits on 8051 C compilers - those addressable bits are
simply outside the C/C++ memory model on such a system (of course, and 8051
compiler will provide a way to access them, but it will do so by an
extension - nothing in the standard makes it possible).
HTH
-cd
[...]
| Bjarne's statement is technically incorrect, but true to the history of
| C, when he identifies "char" more closely with "character" than with
^^^^^^^^^^
| "byte".
Firstly, note that B. Stroustrup didn't *identify* "char" with
"character"; rather, I quote (from the original poster):
"A char variable is of the natural size to hold a character on a given
machine (typically a byte)
Secondly, it has been the tradition that 'char', in C++, is the
natural type for holding characters, as examplified by the standard
type std::string and the standard narrow streams.
--
Gabriel Dos Reis, dos...@cmla.ens-cachan.fr
>> But what does this mean for char?? I was always under the assumption
>> that sizeof(char) is ALWAYS guaranteed to be exactly 1 byte,
>
>Your assumption is invalid.
>
Check out Mike Wahler's response ... seems that the standard does
guarantee this (although a byte doesn't have to be 8 bits). That is,
the guarantee seems to be that sizeof(char)==1 under all
circumstances.
>> What does the ANSI standard have to say about this?
>
>Have you read it?
Hmm ... I thought it was more expensive than it is ... Now that I have
gone to www.ansi.org, I was delighted to discover that it is only $18.
I'm sure this will be well worth buying.
Bob Hairgrove
rhairgro...@Pleasebigfoot.com
> Excuse me, but on 32 bit machines (at least the ones I've seen), DWORD
> is 64 bits.
I guess you have not seen Microsoft Windows, then. Just try
#include <windows.h>
#include <stdio.h>
int main()
{
printf("%d\n", sizeof(DWORD));
}
in MSVC++ 6 or so. It prints 4, and it uses 8-bit bytes.
Regards,
Martin
AFAIK that is for backwards compatibility with 16-bit DOS and Windows
3.x. A double word at the assembler level is still 64 bits.
And as we're well off-topic at this point...
Cheers,
Chris
Section 5.3.3: "The sizeof operator yields the number of bytes in the
object representation of its operand."
> system with a 16-bit "natural character", sizeof(char) and sizeof(wchar_t)
> might both be 1, and sizeof(int), though it's 32 bits, would be 2 not 4.
Correct. For instance, that means that on such a system, 'int' is two
16-bit bytes long.
I hadn't looked at that section before this morning. I'm surprised they
worded it that way, since it's patently false given the most common meaning
of 'byte' (8 bits). It would have helped if the standard actually defined
the word byte, or simply not used it at all. As is, the section is
confusing at best.
And yes, I realize that in the past 'byte' was used more flexibly, with
'bytes' being 6, 7, 8, 9, 10, 12, and even 15 bits on various systems.
Surely today, and as surely in 1998, most readers think "8 bits" when they
see the word "byte".
-cd
> On Sun, 14 Apr 2002 07:01:54 GMT, Witless <wit...@attbi.com> wrote:
>
> >> But what does this mean for char?? I was always under the assumption
> >> that sizeof(char) is ALWAYS guaranteed to be exactly 1 byte,
> >
> >Your assumption is invalid.
> >
>
> Check out Mike Wahler's response ... seems that the standard does
> guarantee this (although a byte doesn't have to be 8 bits). That is,
> the guarantee seems to be that sizeof(char)==1 under all
> circumstances.
That's not the issue. The hidden redefinition of "byte" is the issue.
{OT} This sleight of hand is similar to the IRS definition of income.
>
>
> >> What does the ANSI standard have to say about this?
> >
> >Have you read it?
>
> Hmm ... I thought it was more expensive than it is ... Now that I have
> gone to www.ansi.org, I was delighted to discover that it is only $18.
> I'm sure this will be well worth buying.
I wish you good luck with it.
It's not 'redefined', it's defined. And it's not hidden.
>
> {OT} This sleight of hand is similar to the IRS definition of income.
Sleight of hand? I agree with the IRS part, but not that
it applies to the standard.
-Mike
One byte.
> 1 memory allocation
> unit.
No.
> All other types must have sizes which are multiples of sizeof(char).
Right. 'char' and 'byte' are synonymous in C++
> The standard makes no claim that 1 memory allocation unit == 1 byte.
It absolutely does. See my quote of the standard elsethread.
> On a
> system with a 16-bit "natural character",
In this context, 'natural character' == byte.
>sizeof(char) and sizeof(wchar_t)
> might both be 1,
sizeof(char) is *required* to be one byte.
sizeof(wchar_t) is usually larger, typically two
(but it's implementation-defined).
> and sizeof(int), though it's 32 bits, would be 2 not 4.
Absolutely not. sizeof(int) is implementation-defined,
but is still express in bytes (i.e. chars). A 32-bit
int's sizeof will be 32 / CHAR_BIT.
>
> Further, there's no guarantee that you have any access to the smallest
> addressable unit of storage,
Yes there is. The byte is specified as smallest addressible unit.
>only to storage which is allocated in multiples
> of char.
Right. 'char' == 'byte'
>For example, on an 8051, the smallest addressable unit is 1 bit,
But not from C++.
> but char is still 8 bits on 8051 C compilers
Which means smallest addressible unit (from C++) is an eight-bit byte.
>- those addressable bits are
> simply outside the C/C++ memory model on such a system
Exactly.
> (of course, and 8051
> compiler will provide a way to access them, but it will do so by an
> extension - nothing in the standard makes it possible).
Right.
-Mike
It does define it, in section 1.7p1: "The fundamental storage unit in
the C++ memory model is the _byte_. A byte is at least large enough to
contain any member of the basic execution character set and is composed
of a contiguous sequence of bits, the number of which is
implementation-defined." The fact that "byte" is italicized, indicates
that this clause should be taken as defining that term. As far as
standardese goes (which isn't very far) you can't get much clearer than
that. In particular, pay special attention the the very last part of
that definition.
> In Bjarne Stroustrup's 3rd edition of "The C++ Programming Language",
> there is an interesting passage on page 24:
>
> "A char variable is of the natural size to hold a character on a given
> machine (typically a byte), and an int variable is of the natural size
> for integer arithmetic on a given machine (typically a word)."
>
> Now the last statement (i.e. sizeof(int) typically == a word)
> certainly shows the age of the text here. In the meantime, the
> "natural" size of an int has grown to a 32-bit DWORD on most machines,
> whereas 64-bit int's are becoming more and more common.
Who's "DWORD"? On a PowerPC, a DWORD is 64 bits, a WORD is 32 bits.
Neither Microsoft nor Intel define C++.
> But what does this mean for char?? I was always under the assumption
> that sizeof(char) is ALWAYS guaranteed to be exactly 1 byte,
> especially since there is no C++ "byte" type. As we now have the
> wchar_t as an intrinsic data type, wouldn't this cement the fact that
> char is always 1 byte?
>
> What does the ANSI standard have to say about this?
sizeof(char) is 1 by definition, always has been in C and C++, and
almost certainly always will be. Changing it would break far too much
existing, properly working, conforming code. So a char is 1 byte,
which contains at least 8 bits or possibly more.
There are now C++ compilers for 32 bit digital signal processors where
char, short, int and long are all 1 byte and share the same
representation. Each of those bytes contains 32 bits.
--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://www.eskimo.com/~scs/C-faq/top.html
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.learn.c-c++ ftp://snurse-l.org/pub/acllc-c++/faq
|> James Kanze <ka...@gabi-soft.de> writes:
|> > Excuse me, but on 32 bit machines (at least the ones I've seen),
|> > DWORD is 64 bits.
|> I guess you have not seen Microsoft Windows, then. Just try
Not directly. I've written a few programs for Windows, but we always
used Java/Corba for the GUI parts, and I wrote the code in pretty much
standard C++. A priori, however, DWORD is an assembler concept, and not
something I'd expect to see in C/C++.
|> #include <windows.h>
|> #include <stdio.h>
|> int main()
|> {
|> printf("%d\n", sizeof(DWORD));
|> }
|> in MSVC++ 6 or so. It prints 4, and it uses 8-bit bytes.
I presume that there are some reasons of backwards compatibility.
Although I'll admit that I don't see what something like DWORD is doing
in a C++, or even a C, interface. Somebody must have seriously muffed
the design, a long time ago.
--
James Kanze mailto:ka...@gabi-soft.de
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
Ziegelhüttenweg 17a, 60598 Frankfurt, Germany Tel. +49(0)179 2607481
---
Sure, nevertheless:
"IA-32 Intel® Architecture
Software Developer’s
Manual
Volume 1:
Basic Architecture
....
4.1. FUNDAMENTAL DATA TYPES
The fundamental data types of the IA-32 architecture are bytes,
words, doublewords, quadwords, and double quadwords (see Figure
4-1). A byte is eight bits, a word is 2 bytes (16 bits), a
doubleword is 4 bytes (32 bits), a quadword is 8 bytes (64 bits),
and a double quadword is 16 bytes (128 bits). "
regards,
alexander.
> Bob Hairgrove wrote:
>
> > In Bjarne Stroustrup's 3rd edition of "The C++ Programming Language",
> > there is an interesting passage on page 24:
> >
> > "A char variable is of the natural size to hold a character
> > on a given machine (typically a byte) and an int variable
> > is of the natural size for integer arithmetic
> > on a given machine (typically a word)."
> >
> > Now the last statement (i.e. sizeof(int) typically == a word)
> > certainly shows the age of the text here.
> > In the meantime, the "natural" size of an int
> > has grown to a 32-bit DWORD on most machines,
> > whereas 64-bit int's are becoming more and more common.
> >
> > But what does this mean for char?
> > I was always under the assumption that sizeof(char)
> > is ALWAYS guaranteed to be exactly 1 byte,
> > especially since there is no C++ "byte" type.
> > As we now have the wchar_t as an intrinsic data type,
> > wouldn't this cement the fact that char is always 1 byte?
> >
> > What does the ANSI standard have to say about this?
>
> A byte is a data size -- not a data type.
> A byte is 8 bits on virtually every modern processor
> and the memories are almost always byte addressable.
This is absurd and totally incorrect. Just for example, the Analog
Devices SHARC is a very modern processor. It's byte is 32 bits and
its memory is not octet addressable at all.
--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://www.eskimo.com/~scs/C-faq/top.html
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.learn.c-c++ ftp://snurse-l.org/pub/acllc-c++/faq
---
>> Section 5.3.3: "The sizeof operator yields the number of bytes in the
>> object representation of its operand."
>
>I hadn't looked at that section before this morning. I'm surprised they
>worded it that way, since it's patently false given the most common meaning
>of 'byte' (8 bits).
The standard doesn't rely on the common meaning infact: it uses the
term as explained in §1.7p1. Note also, to complete the definition,
that what the standard requires is that a byte is uniquely addressable
*within* C++ and not within the hardware architecture: the two "units"
can be different, with the char type either larger or smaller.
As an example of the latter, a machine where the hardware-addressable
unit is 32-bit can still have a C++ compiler with 8-bit chars (the
minimum anyway, remember that CHAR_BIT>=8), even though this requires
that the addresses not multiple of the machine-unit contain both the
actual address and a relative offset.
The long and the short of it is that the compiler can perform all kind
of magic to let you appear what doesn't exist at the assembly level:
it is a shell, and we are its inhabitants, at least until we don't go
starting our favourite disassembler and take a look at the world
outside ;)
P.S.: the only thing that leaves me perplexed is the apparent circular
definition constituted by 5.3.3 and 3.9p4. Does anybody know if it is
resolved in an other part of the standard?
Genny.
Welcome to the wonderful world of Windows, where everything is a typedef
or a macro.
--
Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)
|> There are now C++ compilers for 32 bit digital signal processors
|> where char, short, int and long are all 1 byte and share the same
|> representation. Each of those bytes contains 32 bits.
A slightly different issue, but I believe that most, if not all of these
are freestanding implementations. There is some question whether int
and char can be the same size on a hosted implementation, since
functions like fgetc (inherited from C) must return a value in the range
[0...UCHAR] or EOF which must be negative.
--
James Kanze mailto:ka...@gabi-soft.de
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
Ziegelhüttenweg 17a, 60598 Frankfurt, Germany Tel. +49(0)179 2607481
---
|> Bob Hairgrove wrote:
|> > On Sun, 14 Apr 2002 07:01:54 GMT, Witless <wit...@attbi.com> wrote:
|> > >> But what does this mean for char?? I was always under the
|> > >> assumption that sizeof(char) is ALWAYS guaranteed to be exactly
|> > >> 1 byte,
|> > >Your assumption is invalid.
|> > Check out Mike Wahler's response ... seems that the standard does
|> > guarantee this (although a byte doesn't have to be 8 bits). That
|> > is, the guarantee seems to be that sizeof(char)==1 under all
|> > circumstances.
|> That's not the issue. The hidden redefinition of "byte" is the
|> issue.
What hidden redefinition? The definition for the word as used in the
standard is in 1.7, which is where all of the definitions are. As
usual, the standard uses a somewhat stricter definition that the
"normal" definition. In particular:
- Not all machines have addressable bytes. All C/C++ implementations
must have addressable bytes. This requirement can be met in one of
two ways: declaring machine words to be bytes (typical for DSP's),
or implementing some form of extended addressing, where char* is
larger than int* (typical for general purpose word addressed
machines).
- A byte may be less than 8 bits -- the first use of the word, in
fact, was for six bit entities. The C/C++ standard requires bytes
to have at least eight bits.
--
James Kanze mailto:ka...@gabi-soft.de
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
Ziegelhüttenweg 17a, 60598 Frankfurt, Germany Tel. +49(0)179 2607481
---
|> Carl Daniel <cpda...@pacbell.net> wrote in message
|> news:Im5u8.1420$Uf.127...@newssvr21.news.prodigy.com...
|> > "Bob Hairgrove" <rhairgro...@Pleasebigfoot.com> wrote in message
|> > news:3cb820f5...@news.ch.kpnqwest.net...
|> > > But what does this mean for char?? I was always under the
|> > > assumption that sizeof(char) is ALWAYS guaranteed to be exactly
|> > > 1 byte, especially since there is no C++ "byte" type. As we now
|> > > have the wchar_t as an intrinsic data type, wouldn't this cement
|> > > the fact that char is always 1 byte?
|> > sizeof(char) is guaranteed to be 1. 1 what though?
|> One byte.
|> > 1 memory allocation unit.
|> No.
I'd say yes. But the name of that memory allocation unit is "byte".
|> > All other types must have sizes which are multiples of
|> > sizeof(char).
|> Right. 'char' and 'byte' are synonymous in C++
Not quite. A C/C++ cannot directly access "bytes"; it can only access
"objects", which are sequences of contiguous bytes. On the other hand,
the standard requires that for char and its signed and unsigned
variants, this sequence is exactly one element long, and that these
types (or at least unsigned char) contain no padding. So any
distinction between unsigned char and byte is purely formal.
|> > The standard makes no claim that 1 memory allocation unit == 1
|> > byte.
|> It absolutely does. See my quote of the standard elsethread.
|> > On a
|> > system with a 16-bit "natural character",
|> In this context, 'natural character' == byte.
According to the definition in the standard, at any rate.
|> >sizeof(char) and sizeof(wchar_t) might both be 1,
|> sizeof(char) is *required* to be one byte.
|> sizeof(wchar_t) is usually larger, typically two
|> (but it's implementation-defined).
The most frequent situation, I think, is 8 bit char's and 32 bit
wchar_t's. Anything less than about 21 bits for a wchar_t pretty much
makes them relatively useless, since the only widespread code set with
more than 8 bits is ISO 10646/Unicode, which requires 21 bits. (But of
course, the standard doesn't require wchar_t -- or anything else, for
that matter -- to be useful:-).)
|> > and sizeof(int), though it's 32 bits, would be 2 not 4.
|> Absolutely not. sizeof(int) is implementation-defined, but is still
|> express in bytes (i.e. chars). A 32-bit int's sizeof will be 32 /
|> CHAR_BIT.
Which on some DSP is 1.
|> > Further, there's no guarantee that you have any access to the
|> > smallest addressable unit of storage,
|> Yes there is. The byte is specified as smallest addressible unit.
Yes and no. Before making any claims, it is important to state what you
are claiming. If the claim involves the smallest unit of addressable
storage at the hardware level, there is no guarantee -- the smallest
unit of addressable storage in C/C++ must be at least 8 bits, and there
exist machines with hardware addressable bits. If the claim refers to
the C/C++ memory model, it is true by definition.
--
James Kanze mailto:ka...@gabi-soft.de
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
Ziegelhüttenweg 17a, 60598 Frankfurt, Germany Tel. +49(0)179 2607481
---
|> "James Kuyper Jr." <kuy...@wizard.net> wrote in message
|> news:3CB9CE84...@wizard.net...
|> > Carl Daniel wrote:
|> > ....
|> > > sizeof(char) is guaranteed to be 1. 1 what though? 1 memory
|> > > allocation unit. All other types must have sizes which are
|> > > multiples of sizeof(char). The standard makes no claim that 1
|> > > memory allocation unit == 1 byte. On a
|> > Section 5.3.3: "The sizeof operator yields the number of bytes in
|> > the object representation of its operand."
|> I hadn't looked at that section before this morning. I'm surprised
|> they worded it that way, since it's patently false given the most
|> common meaning of 'byte' (8 bits). It would have helped if the
|> standard actually defined the word byte, or simply not used it at
|> all. As is, the section is confusing at best.
The word "byte" has never meant eight bits. Historically, the word was
invented at IBM to refer to a unit of addressable memory smaller than a
word -- I believe that the first use of the word was for 6 bit units.
The standard, of course, doesn't use byte in this sense -- a word
addressed machine doesn't have bytes, but an implementation of C or C++
on it does. The standard actually uses the word with two different (but
not incompatible) meanings.
The first definition is given in 1.7. The identity of these bytes with
char/unsigned char/signed char is not explicitly stated, but follows
from the fact that sizeof on these types must return 1, and that these
types (or at least unsigned char) cannot contain padding or bits which
don't participate in the value.
The second definition is indirectly given in 17.3.2.1.3.1, which
defines "null-terminated byte string". In this case, a "null-terminated
byte string" is an array of char, signed char or unsigned char, which is
delimited by a 0 sentinal object. In the given context, the implication
is that the "bytes" are actually characters (or parts of multi-byte
characters), and that the sequence doesn't contain the value 0.
|> And yes, I realize that in the past 'byte' was used more flexibly,
|> with 'bytes' being 6, 7, 8, 9, 10, 12, and even 15 bits on various
|> systems. Surely today, and as surely in 1998, most readers think "8
|> bits" when they see the word "byte".
Not just in the past. The fact that machines with bytes of other than 8
bits have become rare doesn't negate the fact that when you do talk of
them, the word "byte" doesn't mean 8 bits. And the distinction is still
relevant. -- look at any of the RFC's, for example, and you'll find
that when 8 bits is important, the word used is octet, and not byte.
--
James Kanze mailto:ka...@gabi-soft.de
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
Ziegelhüttenweg 17a, 60598 Frankfurt, Germany Tel. +49(0)179 2607481
---
A simpler way to clarify the distinction is point out that a byte is a
unit used to measure memory, while char is a data type that defined as
fitting into one byte. As such, 'char' is a much richer concept than
'byte'.
....
> The most frequent situation, I think, is 8 bit char's and 32 bit
> wchar_t's. Anything less than about 21 bits for a wchar_t pretty much
> makes them relatively useless, since the only widespread code set with
> more than 8 bits is ISO 10646/Unicode, which requires 21 bits. (But of
> course, the standard doesn't require wchar_t -- or anything else, for
There's a 16 bit variable of it. While I don't use either version of it
in any of my own programs, from what I'd heard here and on comp.std.c,
I'd gotten the impression that the 16 bit variant was more widely used
than the 32 bit version. Yours is the first mention I've ever seen of a
21 bit version - or are you specifying the number of bits actually used
by the 32 bit version?
Right, this is called "a memory location" or "a memory granule", AFAIK.
> is 32-bit can still have a C++ compiler with 8-bit chars ....
Well, things are getting much more interesting with *threads*
added into play. Just for your information (it is probably
off-topic here; at least in this thread):
http://www.opengroup.org/austin/aardvark/finaltext/xbdbug.txt
(see "Defect in XBD 4.10 Memory Synchronization (rdvk# 26)")
regards,
alexander.
Not with respect to the C++ standard. Section 5.3.3p1 says "The sizeof
operator yields the number of bytes in the object representation of its
operand. ... sizeof(char), sizeof(signed char), and sizeof(unsigned
char) are 1."
> > especially since there is no C++ "byte" type. As we now have the
> > wchar_t as an intrinsic data type, wouldn't this cement the fact that
> > char is always 1 byte?
>
> No.
>
> The type char can be 16 bits like Unicode or even 32 bits like the ISO
> character sets.
Yes, but under the C++ standard, that simply means that a "byte" will
become 16 or 32 bits, respectively. That's what the CHAR_BITS macro is
for.
> sizeof(char) is guaranteed to be 1. 1 what though? 1 memory allocation
> unit. All other types must have sizes which are multiples of sizeof(char).
> The standard makes no claim that 1 memory allocation unit == 1 byte.
It certainly does: 1.7, [intro.memory]/1:
# The fundamental storage unit in the C++ memory model is the byte. A
# byte is at least large enough to con-tain any member of the basic
# execution character set and is composed of a contiguous sequence of
# bits, the number of which is implementation-defined.
> On a system with a 16-bit "natural character", sizeof(char) and
> sizeof(wchar_t) might both be 1, and sizeof(int), though it's 32
> bits, would be 2 not 4.
On such a system, a byte would have 16 bits.
Regards,
Martin
| Carl Daniel wrote:
| ....
| > sizeof(char) is guaranteed to be 1. 1 what though? 1 memory allocation
| > unit. All other types must have sizes which are multiples of sizeof(char).
| > The standard makes no claim that 1 memory allocation unit == 1 byte. On a
|
| Section 5.3.3: "The sizeof operator yields the number of bytes in the
| object representation of its operand."
Exact. The question is what you think "byte" means in the C++
standards text.
--
Gabriel Dos Reis, dos...@cmla.ens-cachan.fr
---
> James Kanze <ka...@gabi-soft.de> writes:
>
> > Excuse me, but on 32 bit machines (at least the ones I've seen), DWORD
> > is 64 bits.
>
> I guess you have not seen Microsoft Windows, then. Just try
Microsoft(R) Windows(!tm) is not based on 32 bits but on 16 bits.
>
>
> #include <windows.h>
> #include <stdio.h>
>
> int main()
> {
> printf("%d\n", sizeof(DWORD));
> }
>
> in MSVC++ 6 or so. It prints 4, and it uses 8-bit bytes.
.... which is consistent with its ancestry.
The standard is what *defines* these issues. No
way can it be 'false'.
>given the most common meaning
> of 'byte' (8 bits).
Irrelevant. size of a byte is only required to be
*at least* eight bits, but is allowed to be larger.
>It would have helped if the standard actually defined
> the word byte,
It does. Smallest addressible storage unit.
The system you describe above's 1-bit addressible
unit does not meet the requirement of at least
eight bits. So from C++, smallest addressible
unit for that machine is whichever larger unit
with at least eight bits is addressible. Perhaps
a 'word'.
> or simply not used it at all.
There has to be defined some point of reference.
It's a byte.
>As is, the section is
> confusing at best.
Yes, "IOS-ese" takes a while to understand.
>
> And yes, I realize that in the past 'byte' was used more flexibly, with
> 'bytes' being 6, 7, 8, 9, 10, 12, and even 15 bits on various systems.
It still is.
> Surely today, and as surely in 1998, most readers think "8 bits" when they
> see the word "byte".
And they're wrong. "Eight bits" == "octet".
-Mike
|> "James Kuyper Jr." <kuy...@wizard.net> writes:
|> [...]
|> | Bjarne's statement is technically incorrect, but true to the history of
|> | C, when he identifies "char" more closely with "character" than with
|> ^^^^^^^^^^
|> | "byte".
|> Firstly, note that B. Stroustrup didn't *identify* "char" with
|> "character"; rather, I quote (from the original poster):
|> "A char variable is of the natural size to hold a character on a
|> given machine (typically a byte)
|> Secondly, it has been the tradition that 'char', in C++, is the
|> natural type for holding characters, as examplified by the standard
|> type std::string and the standard narrow streams.
For you and me, maybe, but one could argue that we are being
anchronistic. For most modern applications which deal with text, I
suspect that there is no natural type for holding characters -- wchar_t
comes close, but there are systems where it is not sufficiently large to
hold an ISO 10646 character. And it is rarely well supported. The
result is that if I had to write an application dealing with text, I'd
probably end up defining my own character type (which might be a typedef
to wchar_t, if portability weren't a real concerne -- all of the
machines I normally deal with define wchar_t as a 32 bit type).
--
James Kanze mailto:ka...@gabi-soft.de
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
Ziegelhüttenweg 17a, 60598 Frankfurt, Germany Tel. +49(0)179 2607481
---
Not everyone uses the term correctly, not even (apparantly) Intel. I'll
consider an ISO specification more authoritative than a company
specification any day (though it could still be wrong). I'd like to
know; do any of the other ISO standards define the "byte"? If so, which
definition do they use?
| Jack Klein <jack...@spamcop.net> writes:
|
| |> There are now C++ compilers for 32 bit digital signal processors
| |> where char, short, int and long are all 1 byte and share the same
| |> representation. Each of those bytes contains 32 bits.
|
| A slightly different issue, but I believe that most, if not all of these
| are freestanding implementations. There is some question whether int
| and char can be the same size on a hosted implementation, since
| functions like fgetc (inherited from C) must return a value in the range
| [0...UCHAR] or EOF which must be negative.
Or you may just look at the requirements imposed by the standard
std::string class in clause 21.
A conforming hosted implementation cannot have
values_set(char) == values_set(int)
because every bit in a char representation participate to a value
representation, i.e. all bits in a char are meaningful.
--
Gabriel Dos Reis, dos...@cmla.ens-cachan.fr
---
However, since there are other ways of detecting file errors and end of
file, than checking for EOF, that doesn't absolutely require that EOF be
outside the range of char values. In fact, I gather that the consensus
of the C committee has been that it doesn't, though I couldn't find any
currently listed DR on the issue - however, the place I searched only
goes back to DR 201.
I discovered that definition in 1.7 too late - too bad it isn't mentioned in
the index or cross-referenced in 5.3.3. I still maintain that it was a poor
choice of word, since the definition the standard uses (and gives) is not
the common one (these days).
-cd
Unicode 2.0 had 40-some-odd thousand characters, so a 16-bit variable
could hold all possible values. Unicode 3.0 has over 90,000 characters,
so a 16-bit variable doesn't work. There's a UTF-16 encoding, but that's
analogous to a multi-byte character string in C and C++: a pain to work
with. Java apologists will tell you that this is no big deal, because
the characters that require two 16-bit values are rarely used. But then,
they're stuck with Java's choice of 16 bits for character types, so
naturally they claim that it doesn't really matter.
--
Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)
---
It is a Microsoft design, after all. :-)
======================================= MODERATOR'S COMMENT:
I almost bounced this as a flame...
Clause 21 is very large and complicated; it would help if you could be
more specific about what you're referring to.
> A conforming hosted implementation cannot have
>
> values_set(char) == values_set(int)
I'm not clear what you're saying. The standard doesn't define anything
called values_set(), so I presume you're just using it as convenient
shorthand for the set of valid values for the type. However, I can't see
any reason why that would be prohibited.
> because every bit in a char representation participate to a value
> representation, i.e. all bits in a char are meaningful
I don't see how that requirement would be violated by an implementation
which had CHAR_MIN==INT_MIN, and CHAR_MAX==INT_MAX.
"Alexander Terekhov" <tere...@web.de> wrote in message
news:3CBAD3A5...@web.de...
>
> Well, things are getting much more interesting with *threads*
> added into play. Just for your information (it is probably
> off-topic here; at least in this thread):
>
> http://www.opengroup.org/austin/aardvark/finaltext/xbdbug.txt
> (see "Defect in XBD 4.10 Memory Synchronization (rdvk# 26)")
==** BEGIN PASTE **==
Problem:
Defect code : 3. Clarification required
d...@dvv.ru (Dima Volodin) wrote:
....
The standard doesn't provide any definition on memory location [POSIX is
a C API, so it must be done in C terms?]. Also, as per standard C rules,
access to one memory location [byte?] shouldn't have any effect on a
different memory location. POSIX doesn't seem to address this issue, so
the assumption is that the usual C rules apply to multi-threaded
programs. On the other hand, the established industry practices are such
that there is no guarantee of integrity of certain memory locations when
modification of some "closely residing" memory locations is performed.
The standard either has to clarify that access to distinct memory
locations doesn't have to be locked [which, I hope, we all understand,
is not a feasible solution] or incorporate current practices in its
wording providing users with means to guarantee data integrity of
distinct memory locations. "Please advise."
---
http://groups.google.com/groups?hl=en&selm=3B0CEA34.845E7AFF%40compaq.com
Dave Butenhof (David.B...@compaq.com) wrote:
....
POSIX says you cannot have multiple threads using "a memory location"
without explicit synchronization. POSIX does not claim to know, nor
try to specify, what constitutes "a memory location" or access to it,
across all possible system architectures. On systems that don't use
atomic byte access instructions, your program is in violation of the
rules.
==**END PASTE**==
I don't like that answer, as it seems it would be near impossible to write
portable code without some common notion of atomically updatable memory
location. But isn't this actually what type sigatomic_t (sizeof >= 1) is
intended for?
hys
--
Hillel Y. Sims
hsims AT factset.com
|> James Kanze wrote:
|> > Jack Klein <jack...@spamcop.net> writes:
|> > |> There are now C++ compilers for 32 bit digital signal
|> > |> processors where char, short, int and long are all 1 byte and
|> > |> share the same representation. Each of those bytes contains
|> > |> 32 bits.
|> > A slightly different issue, but I believe that most, if not all of
|> > these are freestanding implementations. There is some question
|> > whether int and char can be the same size on a hosted
|> > implementation, since functions like fgetc (inherited from C) must
|> > return a value in the range [0...UCHAR] or EOF which must be
|> > negative.
|> However, since there are other ways of detecting file errors and end
|> of file, than checking for EOF, that doesn't absolutely require that
|> EOF be outside the range of char values. In fact, I gather that the
|> consensus of the C committee has been that it doesn't, though I
|> couldn't find any currently listed DR on the issue - however, the
|> place I searched only goes back to DR 201.
The C standard definitly requires that all characters in the basic
character set be positive, and the EOF be negative.
The open issue is, I think, whether fgetc is required to be able to
return *all* values in the range of 0...UCHAR. For actual characters,
this is not a problem -- if we have 32 bit char's, it is certain that
some of the values will not be used as a character. (ISO 10646, for
example, only uses values in the range 0...0x10FFFF.) But fgetc can
also be used to read raw "bytes"; what happens then?
What I suspect is that on an implementation using 32 bit char's, fgetc
in fact will return something in the range 0...255, or -1 for EOF.
IMHO, this should be a legal implementation, however, I don't think that
the current C standard is unambiguously clear that this is the case.
--
James Kanze mailto:ka...@gabi-soft.de
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
Ziegelhüttenweg 17a, 60598 Frankfurt, Germany Tel. +49(0)179 2607481
---
|> James Kanze <ka...@gabi-soft.de> writes:
|> | Jack Klein <jack...@spamcop.net> writes:
|> | |> There are now C++ compilers for 32 bit digital signal
|> | |> processors where char, short, int and long are all 1 byte and
|> | |> share the same representation. Each of those bytes contains
|> | |> 32 bits.
|> | A slightly different issue, but I believe that most, if not all of
|> | these are freestanding implementations. There is some question
|> | whether int and char can be the same size on a hosted
|> | implementation, since functions like fgetc (inherited from C) must
|> | return a value in the range [0...UCHAR] or EOF which must be
|> | negative.
|> Or you may just look at the requirements imposed by the standard
|> std::string class in clause 21.
The advantage of basing the argument on fgetc is that it becomes a C
problem as well, and not something specific to C++.
|> A conforming hosted implementation cannot have
|> values_set(char) == values_set(int)
|> because every bit in a char representation participate to a value
|> representation, i.e. all bits in a char are meaningful.
Where does it say this. (Section 21 is large.)
What are the implications for an implementation which wants to support
ISO 10646 on a 32 bit machine? The smallest type it can declare which
supports ISO 10646 is 32 bits.
--
James Kanze mailto:ka...@gabi-soft.de
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
Ziegelhüttenweg 17a, 60598 Frankfurt, Germany Tel. +49(0)179 2607481
---
| Gabriel Dos Reis wrote:
| ....
| > Or you may just look at the requirements imposed by the standard
| > std::string class in clause 21.
|
| Clause 21 is very large and complicated;
Not that complicated; it suffices to look at the first two pages.
21.1.2/2
For a certain character container type char_type, a related
container type INT_T shall be a type or class which can represent
all of the valid characters converted from the corresponding
char_type values, as well as an end-of-file value, eof(). The type
int_type represents a charac-ter container type which can hold
end-of-file to be used as a return type of the iostream class member
functions.
The case in interest is when char_type == char and int_type == int.
Now, look at the table 37 (Traits requirements)
X::eof() yields: a value e such that X::eq_int_type(e,X::to_int_type(c))
is false for all values c.
(by 21.1.1/1, c is of type char).
The standard also says that any bit pattern for char represents a
valid char value, therefore eof() can't be in the values-set of char.
[...]
| I don't see how that requirement would be violated by an implementation
| which had CHAR_MIN==INT_MIN, and CHAR_MAX==INT_MAX.
See above.
--
Gabriel Dos Reis, dos...@cmla.ens-cachan.fr
---
[...]
| The open issue is, I think, whether fgetc is required to be able to
| return *all* values in the range of 0...UCHAR. For actual characters,
| this is not a problem -- if we have 32 bit char's, it is certain that
| some of the values will not be used as a character.
That won't be conforming, since the standard says that any bit pattern
represent a valid value.
--
Gabriel Dos Reis, dos...@cmla.ens-cachan.fr
---
In itself, that would merely mean that 'int' can't be the int_type for
'char'. The clincher is that 21.1.3.1 explicitly specifies that int_type
for char_traits<char> must be 'int'. Therefore, I concede your point.
Someone was keeping a list of C/C++ differences - this should be added
to that list; C makes no such guarantee.
It explicitly restricts that guarantee to unsigned char; char is allowed
to be signed.
Witless wrote:
>
> "Martin v. Löwis" wrote:
>
> > James Kanze <ka...@gabi-soft.de> writes:
> >
> > > Excuse me, but on 32 bit machines (at least the ones I've seen), DWORD
> > > is 64 bits.
> >
> > I guess you have not seen Microsoft Windows, then. Just try
>
> Microsoft(R) Windows(!tm) is not based on 32 bits but on 16 bits.
>
The 32 bit windows (95 and later, NT and later) are 32 bit enviornments
as far as the app is concerned.
|> > The most frequent situation, I think, is 8 bit char's and 32 bit
|> > wchar_t's. Anything less than about 21 bits for a wchar_t pretty
|> > much makes them relatively useless, since the only widespread code
|> > set with more than 8 bits is ISO 10646/Unicode, which requires 21
|> > bits. (But of course, the standard doesn't require wchar_t -- or
|> > anything else, for
|> There's a 16 bit variable of it. While I don't use either version of
|> it in any of my own programs, from what I'd heard here and on
|> comp.std.c, I'd gotten the impression that the 16 bit variant was
|> more widely used than the 32 bit version. Yours is the first mention
|> I've ever seen of a 21 bit version - or are you specifying the
|> number of bits actually used by the 32 bit version?
The code set occupies values in the range 0...0x10FFFF. That requires
21 bits.
The standard specifies several ways to represent the code set. The most
natural way (the only one which doesn't involve multi-something
encodings) uses 32 bit values, with the code on the lower bits, and the
upper bits 0. There are also variants with 8 and 16 bits; these do NOT
offer the full 32 bits.
Of the machines I've seen (and can remember), wchar_t is most often 32
bits. In fact, the only exception seems to be Windows; all of the
Unixes I can remember (Linux, Solaris, AIX -- and I think HP/UX, but my
memory is a bit weak there) have 32 bits.
--
James Kanze mailto:ka...@gabi-soft.de
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
Ziegelhüttenweg 17a, 60598 Frankfurt, Germany Tel. +49(0)179 2607481
---
"James Kuyper Jr." wrote:
>
>
> Yes, but under the C++ standard, that simply means that a "byte" will
> become 16 or 32 bits, respectively. That's what the CHAR_BITS macro is
> for.
>
The problem is that just isn't practical in most cases. Remember that
char plays double duty both as the native character type and the fundamental
memory unit. Yes, you could have 16 bit chars, but you lose the ability
to address 8 bit sized memory for all practical purposes. You're more
or less doomed (as Windows Nt does) to use wchar_t's. The only sad thing
is that C++ doesn't define wchar_t interfaces to everything.
| Gabriel Dos Reis wrote:
| >
| > "James Kuyper Jr." <kuy...@wizard.net> writes:
| ....
| > Now, look at the table 37 (Traits requirements)
| >
| > X::eof() yields: a value e such that X::eq_int_type(e,X::to_int_type(c))
| > is false for all values c.
| >
| > | I don't see how that requirement would be violated by an implementation
| > | which had CHAR_MIN==INT_MIN, and CHAR_MAX==INT_MAX.
| >
| > See above.
|
| In itself, that would merely mean that 'int' can't be the int_type for
| 'char'. The clincher is that 21.1.3.1 explicitly specifies that int_type
| for char_traits<char> must be 'int'. Therefore, I concede your point.
|
| Someone was keeping a list of C/C++ differences - this should be added
| to that list; C makes no such guarantee.
Which "no such guarantee"?
-- Gaby
Well, the real "problem" here is also known as "word-tearing"
(and there is also somewhat similar/related performance problem
of "false-sharing").
There was even comp.std.c thread on this in the past:
http://groups.google.com/groups?threadm=3B54AB12.7F555834%40dvv.org
(with GRANULARIZE(X) macros, etc ;-))
Personally, I just love this topic! ;-) My view/opinion on this:
http://groups.google.com/groups?as_umsgid=3C3F0C77.CFF9CADC%40web.de
http://groups.google.com/groups?as_umsgid=3C428BC0.1D5F2D90%40web.de
> But isn't this actually what type sigatomic_t (sizeof >= 1) is
> intended for?
AFAICT, "Nope":
http://groups.google.com/groups?as_umsgid=3B02A7A4.C6FEDC23%40dvv.org
*static volatile sig_atomic_t* vars could only help/work for *single-
threaded* asynchrony w.r.t access to volatile sig_atomic_t STATIC
object(s) in the thread itself and its interrupt/async.signal
handler(s);
BTW, I've "collected" some C/C++ "volatile"/sig_atomic_t stuff
in the following article and the "Hardware port" thread:
http://groups.google.com/groups?as_umsgid=3CB1EE1D.5671E923%40web.de
http://groups.google.com/groups?threadm=a8hgtr%24euj%241%40news.hccnet.nl
Also, FYI w.r.t. C/C++ volatiles and threads (I mean "atomicity" and
"visibility", etc):
http://groups.google.com/groups?as_umsgid=L9JR7.478%24BK1.14104%40news.cpqcorp.net
And, finally, FYI on memory "granularity":
http://www.tru64unix.compaq.com/docs/base_doc/DOCUMENTATION/V51_HTML/ARH9RBTE/DOCU0007.HTM#gran_sec
regards,
alexander.
> > 4.1. FUNDAMENTAL DATA TYPES
> > The fundamental data types of the IA-32 architecture are bytes,
> > words, doublewords, quadwords, and double quadwords (see Figure
> > 4-1). A byte is eight bits, a word is 2 bytes (16 bits), a
> > doubleword is 4 bytes (32 bits), a quadword is 8 bytes (64 bits),
> > and a double quadword is 16 bytes (128 bits). "
> Not everyone uses the term correctly, not even (apparantly) Intel.
What is wrong with Intel's usage? If a byte means "an 8-bit quantity",
then they're right. If a byte means "the smallest addressable unit of
storage on a particular architecture", then they are still right. What
definition of "byte" makes Intel's usage incorrect?
DS
8-bits <= 1 C++ byte <= 'natural-word'-bits
(where # of bits in a "natural-word" and actual # of bits used for the
"byte" are platform-specific)
("C++ byte" not necessarily equivalent to platform-specific "byte")
It could theoretically be 8, 9, 10, 11, 12, ... 16, ... 32, ... 64, or even
maybe 128 bits on some current graphics processors (guessing), or anything
inbetween too (theoretically). It makes sense even; there are some machines
(DSPs) where "char" (as in character, as in human-readable text) is not a
very heavily used concept vs efficient 32-bit numerical processing, so they
just define 'char' (1 byte!) to refer to the full 32-bits of machine storage
for efficiency (otherwise they'd probably have to do all sorts of bit
masking arithmetic).
"Mike Wahler" <mkwa...@ix.netcom.com> wrote in message
news:a9cvlo$lj5$1...@slb2.atl.mindspring.net...
> It does. Smallest addressible storage unit.
> The system you describe above's 1-bit addressible
> unit does not meet the requirement of at least
> eight bits. So from C++, smallest addressible
> unit for that machine is whichever larger unit
> with at least eight bits is addressible. Perhaps
> a 'word'.
>
> > or simply not used it at all.
>
> There has to be defined some point of reference.
> It's a byte.
>
>
> > Surely today, and as surely in 1998, most readers think "8 bits" when
they
> > see the word "byte".
>
Well not anymore! 1 C++ Byte >= 8-bits! :-)
hys
--
Hillel Y. Sims
hsims AT factset.com
---
Sorry, that got garbled due to lack of sleep. I meant "16 bit version";
"16 bit encoding" would have been even better, but I didn't even think
of that wording.
> > in any of my own programs, from what I'd heard here and on comp.std.c,
> > I'd gotten the impression that the 16 bit variant was more widely used
> > than the 32 bit version. Yours is the first mention I've ever seen of a
> > 21 bit version - or are you specifying the number of bits actually used
> > by the 32 bit version?
> >
>
> Unicode 2.0 had 40-some-odd thousand characters, so a 16-bit variable
> could hold all possible values. Unicode 3.0 has over 90,000 characters,
> so a 16-bit variable doesn't work. There's a UTF-16 encoding, but that's
> analogous to a multi-byte character string in C and C++: a pain to work
> with. Java apologists will tell you that this is no big deal, because
> the characters that require two 16-bit values are rarely used.
I suspect they're correct. A fairly large portion of the C++ world can
get away with 8-bit characters, an even larger portion will never need
to go beyond 16-bits. Of course, the people who need the larger
characters, need them all the time.
| Gabriel Dos Reis <dos...@cmla.ens-cachan.fr> writes:
|
| |> James Kanze <ka...@gabi-soft.de> writes:
|
| |> | Jack Klein <jack...@spamcop.net> writes:
|
| |> | |> There are now C++ compilers for 32 bit digital signal
| |> | |> processors where char, short, int and long are all 1 byte and
| |> | |> share the same representation. Each of those bytes contains
| |> | |> 32 bits.
|
| |> | A slightly different issue, but I believe that most, if not all of
| |> | these are freestanding implementations. There is some question
| |> | whether int and char can be the same size on a hosted
| |> | implementation, since functions like fgetc (inherited from C) must
| |> | return a value in the range [0...UCHAR] or EOF which must be
| |> | negative.
|
| |> Or you may just look at the requirements imposed by the standard
| |> std::string class in clause 21.
|
| The advantage of basing the argument on fgetc is that it becomes a C
| problem as well, and not something specific to C++.
Yeah, a clever way of getting rid of the problem ;-)
By, the wording concerning some functions in <ctype.h> I gather that
EOF cannot be a valid 'unsigned char' converted to int.
| |> A conforming hosted implementation cannot have
|
| |> values_set(char) == values_set(int)
|
| |> because every bit in a char representation participate to a value
| |> representation, i.e. all bits in a char are meaningful.
|
| Where does it say this. (Section 21 is large.)
Table 37 says that char_traits<char>::eof() -- identical to EOF --
should return a value not equal to any char value (converted to int).
| What are the implications for an implementation which wants to support
| ISO 10646 on a 32 bit machine? The smallest type it can declare which
| supports ISO 10646 is 32 bits.
Then it must make sure values-set(char) is a strict subset of
values-set(int) (for example having a 64-bit int). Or it doesn't ;-)
--
Gabriel Dos Reis, dos...@cmla.ens-cachan.fr
---
See 1.7p1.
| James Kanze wrote:
| >
| > Jack Klein <jack...@spamcop.net> writes:
| >
| > |> There are now C++ compilers for 32 bit digital signal processors
| > |> where char, short, int and long are all 1 byte and share the same
| > |> representation. Each of those bytes contains 32 bits.
| >
| > A slightly different issue, but I believe that most, if not all of these
| > are freestanding implementations. There is some question whether int
| > and char can be the same size on a hosted implementation, since
| > functions like fgetc (inherited from C) must return a value in the range
| > [0...UCHAR] or EOF which must be negative.
|
| However, since there are other ways of detecting file errors and end of
| file, than checking for EOF, that doesn't absolutely require that EOF be
| outside the range of char values.
Huh?!? The C++ standard requires that all bits in a char participate
in a char value representation. And EOF is not a character.
--
Gabriel Dos Reis, dos...@cmla.ens-cachan.fr
---
> Clause 21 is very large and complicated; it would help if you could be
> more specific about what you're referring to.
He probably means 21.1.2 [lib.char.traits.typedefs], which states that
character traits must have a type or class (int_type) that can represent
all the valid characters converted from the character type, plus an
end-of-file value. It does not state, however, that this type must be
"int".
--
Ray Lischner, author of C++ in a Nutshell (forthcoming, Q4 2002)
http://www.tempest-sw.com/cpp/
If you need 32-bit characters and the language you're using doesn't
support them you've got a problem, regardless of the rationalizing that
language designers may engage in.
--
Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)
---
| Gabriel Dos Reis wrote:
| >
| > "James Kuyper Jr." <kuy...@wizard.net> writes:
| >
| > | Carl Daniel wrote:
| > | ....
| > | > sizeof(char) is guaranteed to be 1. 1 what though? 1 memory allocation
| > | > unit. All other types must have sizes which are multiples of sizeof(char).
| > | > The standard makes no claim that 1 memory allocation unit == 1 byte. On a
| > |
| > | Section 5.3.3: "The sizeof operator yields the number of bytes in the
| > | object representation of its operand."
| >
| > Exact. The question is what you think "byte" means in the C++
| > standards text.
|
| See 1.7p1.
I know that paragraph very well. Thanks. But from your assertions,
it wasn't clear that you knew of that paragraph.
--
Gabriel Dos Reis, dos...@cmla.ens-cachan.fr
---
What is the correct use? In others contexts a word is the minimun
addressable unit, then a word on all the x86 family will be an octet.
> consider an ISO specification more authoritative than a company
> specification any day (though it could still be wrong).
ISO has no authority do define the universal mean of a word. No more
than intel, that is, they can define the mean it has in your documents.
Regards.
| On Monday 15 April 2002 09:14 pm, James Kuyper Jr. wrote:
|
| > Clause 21 is very large and complicated; it would help if you could be
| > more specific about what you're referring to.
|
| He probably means 21.1.2 [lib.char.traits.typedefs], which states that
| character traits must have a type or class (int_type) that can represent
| all the valid characters converted from the character type, plus an
| end-of-file value. It does not state, however, that this type must be
| "int".
Thanks.
I would add that the standard says that when char_type is char then
int_type must be int.
--
Gabriel Dos Reis, dos...@cmla.ens-cachan.fr
---
>Alexander Terekhov wrote:
>....
>> "IA-32 Intel® Architecture
>> Software Developer?s
>> Manual
>> Volume 1:
>> Basic Architecture
>> ....
>> 4.1. FUNDAMENTAL DATA TYPES
>>
>> The fundamental data types of the IA-32 architecture are bytes,
>> words, doublewords, quadwords, and double quadwords (see Figure
>> 4-1). A byte is eight bits, a word is 2 bytes (16 bits), a
>> doubleword is 4 bytes (32 bits), a quadword is 8 bytes (64 bits),
>> and a double quadword is 16 bytes (128 bits). "
>
>Not everyone uses the term correctly, not even (apparantly) Intel. I'll
>consider an ISO specification more authoritative than a company
>specification any day (though it could still be wrong).
But if such a document doesn't exist, who uses the term "correctly"?
There are many meanings of the same term, and each one is "correct" as
long as it is consistent. A "byte" can be an IA-32 data type, an IDL
type, a C/C++ storage unit, and many other things (yes, different
things, not only different sizes, since I can't identify a data type
with a storage unit)
The above quote would be wrong if it pretented to be a general
definition, but it's ok if (as I believe) it is intended as it was
"Within this specification, the term byte refers to...". In this
respect it's no way different from what the C++ standard does.
Note that I'm not saying that this de facto overloading of the term
(as well as of the terms "word", "dword" and others) isn't annoying.
It is! :)
Genny.
Practicality is an issue for the implementation to worry about. As long
as the standard allows each implementation enough freedom to choose
practical values for those type sizes, it's done its job. If one
implementor decides that the most practical thing for their market is a
16-bit char, that's permitted. If another decides that the most
practical thing for their market is an 8-bit char and a 16-bit wchar_t,
that's permitted. The two implmentations might be targeting different
markets, or one of them might be mistaken, but the C++ standard has been
designed to let each of them be conforming, and code that needs to be
portable will be designed to work correctly in either case (which can be
a highly non-trivial excercise in many cases).
The definition I'm familiar with can be paraphrased by saying that if
it's correctly described as a 32-bit machine, then the word size is 32
bits.
> > consider an ISO specification more authoritative than a company
> > specification any day (though it could still be wrong).
>
> ISO has no authority do define the universal mean of a word. No more
No one has that authority. However, ISO does have the authority to
define the usage within ISO documents, and the usage by anyone who cares
about ISO standards. Which includes me.
I'm not sure why. Could you be a little less laconic, and explain what
your point is?
I'm not very familiar with Java; I got the impression from what you said
earlier that they did support the larger range of characters, they just
supported them inconveniently, using a multi-byte encoding.
|> James Kanze <ka...@gabi-soft.de> writes:
|> | Gabriel Dos Reis <dos...@cmla.ens-cachan.fr> writes:
|> | |> James Kanze <ka...@gabi-soft.de> writes:
|> | |> | Jack Klein <jack...@spamcop.net> writes:
|> | |> | |> There are now C++ compilers for 32 bit digital signal
|> | |> | |> processors where char, short, int and long are all 1
|> | |> | |> byte and share the same representation. Each of those
|> | |> | |> bytes contains 32 bits.
|> | |> | A slightly different issue, but I believe that most, if not
|> | |> | all of these are freestanding implementations. There is
|> | |> | some question whether int and char can be the same size on a
|> | |> | hosted implementation, since functions like fgetc (inherited
|> | |> | from C) must return a value in the range [0...UCHAR] or EOF
|> | |> | which must be negative.
|> | |> Or you may just look at the requirements imposed by the
|> | |> standard std::string class in clause 21.
|> | The advantage of basing the argument on fgetc is that it becomes a
|> | C problem as well, and not something specific to C++.
|> Yeah, a clever way of getting rid of the problem ;-)
I almost added something to the effect of letting the C committee do the
work:-).
|> By, the wording concerning some functions in <ctype.h> I gather that
|> EOF cannot be a valid 'unsigned char' converted to int.
I'm not sure. The wording says that the functions must work for all
values in the range 0...UCHAR_MAX and EOF. *IF* one of the values in
the range 0...UCHAR_MAX results in EOF when converted to int, I don't
think that it is a problem as long as that value isn't alpha, numeric,
etc., i.e. as long as all functions return 0.
If we suppose that char has 32 bits, and uses ISO 10646, this isn't a
problem, since all of the values greater than 0x10FFFF are invalid
characters, and should return false. (EOF must be negative, which would
mean an unsigned char value of 0x80000000 or greater. Supposing typical
implementations.)
|> | |> A conforming hosted implementation cannot have
|> | |> values_set(char) == values_set(int)
|> | |> because every bit in a char representation participate to a
|> | |> value representation, i.e. all bits in a char are meaningful.
|> | Where does it say this. (Section 21 is large.)
|> Table 37 says that char_traits<char>::eof() -- identical to EOF --
|> should return a value not equal to any char value (converted to
|> int).
Not quite. Table 37 says that char_traits<charT>::eof() must return a
value e for which eq_int_type(e,to_int_type(c)) is false for all c.
For 32 bit ISO 10646, I use char_type == int_type == unsigned int (on a
32 bit machine), with:
int_type to_int_type( char_type c )
{
return c < 0x11000 ? c : 0 ;
}
I'm not sure, but I believe that this is legal. (At any rate, it seems
the most useful solution.)
|> | What are the implications for an implementation which wants to
|> | support ISO 10646 on a 32 bit machine? The smallest type it can
|> | declare which supports ISO 10646 is 32 bits.
|> Then it must make sure values-set(char) is a strict subset of
|> values-set(int) (for example having a 64-bit int). Or it doesn't
|> ;-)
That is the crux of my question. On the two machines I use (a 32 bit
Sparc under Solaris 2.7 and a PC under Linux), wchar_t is a 32 bit
quantity, there are no integral data types larger than 32 bits. I don't
want wchar_t to be any smaller, since it must be at least 21 bits for
ISO 10646. This means that *if* int_type must be larger than char_type,
I have to define a class type for it. But in practice, I don't need for
it to be larger, since in fact, all legal characters are in the range
0...0x10FFFF.
--
James Kanze mailto:ka...@gabi-soft.de
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
Ziegelhüttenweg 17a, 60598 Frankfurt, Germany Tel. +49(0)179 2607481
---
|> > Not everyone uses the term correctly, not even (apparantly)
|> > Intel. I'll
|> What is the correct use? In others contexts a word is the minimun
|> addressable unit, then a word on all the x86 family will be an
|> octet.
A word is normally the bus width in the ALU; it is often larger than the
minimum addressable unit. The correct word for the minimal addressable
unit is byte, although this is normally only used if this unit is
smaller than a word (as it is on most modern processors).
--
James Kanze mailto:ka...@gabi-soft.de
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
Ziegelhüttenweg 17a, 60598 Frankfurt, Germany Tel. +49(0)179 2607481
---
|> Huh?!? The C++ standard requires that all bits in a char
|> participate in a char value representation. And EOF is not a
|> character.
However, as far as I can see, it doesn't place any constraints with
regards as to what a character can be (except that the characters in the
basic character set must have positive values, even if char is signed).
The requirement that EOF not be a character doesn't mean that it cannot
be a legal char value.
--
James Kanze mailto:ka...@gabi-soft.de
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
Ziegelhüttenweg 17a, 60598 Frankfurt, Germany Tel. +49(0)179 2607481
---
As does every programming language, I suppose. The point of wide
characters is to not have to deal with multi-byte encodings. The Java
libraries have a bunch of code that assumes that a single character is
not part of a multi-character sequence.
--
Pete Becker
Dinkumware, Ltd. (http://www.dinkumware.com)
---
No need for such a complex description.
Various platforms implement a 'byte' with the number
of bits deemed most 'appropriate'.
A C or C++ program required the host platform to provide
(or perhaps 'emulate') a byte with at least eight bits.
The C and C++ data type representing this 'byte' is
type 'char'. So regardless of the number of bits therein,
sizeof(char) == one byte. Period. Forever and ever, amen. :-)
-Mike
> > Not everyone uses the term correctly, not even (apparantly) Intel. I'll
>
> What is the correct use? In others contexts a word is the minimun
> addressable unit, then a word on all the x86 family will be an octet.
>
> > consider an ISO specification more authoritative than a company
> > specification any day (though it could still be wrong).
>
> ISO has no authority to define the universal mean of a word.
> No more than intel, that is, they can define the mean it has in your
documents.
I believe the term in question is "machine word".
This is not a data type any more than "byte" is a data type.
It is simply the width of the typical [integral] "data path"
in a given computer architecture --
the width of a data bus, register, ALU, etc.
The meaning of the term "word" is typically "overloded"
by computer architects to descibe data types which may be interpreted
as characters, integers, fixed or floating point numbers, etc.
These type definitions are only meaningful
within the context of a particular computer architecture.
Intel's word, doubleword, quadword and double quadword types
are all based upon the original 8086 architecture's 16 bit machine word
and has remained fixed as the actual machine word size increased
to 32 then 64 bits with the introduction of new architectures
in the same family.
> |> A machine word is as wide as the integer data path throught the
> |> Arithmetic and Logic Unit (ALU).
>
> Or as wide as the memory bus?
>
> I'm not sure that there is a real definition of "word".
I think it's one of those terms... ;)
The Sony PlayStation2 has 128-bit registers and has 128-bit
busses to the memory and other systems, but then the instruction
set typically works on the low 64 bits of those registers, even
though "int" is 32-bits for some bizzaro reason, and all of the
docs call the 128-bit values "quadwords".
-tom!
The guarantee that you cited from table 37, which (translated into C
terms) says EOF!=(int)c for all values of c.
|> James Kanze <ka...@gabi-soft.de> writes:
|> [...]
|> | The open issue is, I think, whether fgetc is required to be able
|> | to return *all* values in the range of 0...UCHAR. For actual
|> | characters, this is not a problem -- if we have 32 bit char's, it
|> | is certain that some of the values will not be used as a
|> | character.
|> That won't be conforming, since the standard says that any bit
|> pattern represent a valid value.
But it doesn't require that fgetc actually be able to return all valid
values.
--
James Kanze mailto:ka...@gabi-soft.de
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
Ziegelhüttenweg 17a, 60598 Frankfurt, Germany Tel. +49(0)179 2607481
---
|> "James Kuyper Jr." <kuy...@wizard.net> writes:
|> | Gabriel Dos Reis wrote:
|> | ....
|> | > Or you may just look at the requirements imposed by the standard
|> | > std::string class in clause 21.
|> | Clause 21 is very large and complicated;
|> Not that complicated;
I'll admit that there are worse. I was just looking for an excuse for
my laziness.
|> it suffices to look at the first two pages.
|> 21.1.2/2
|> For a certain character container type char_type, a related
|> container type INT_T shall be a type or class which can represent
|> all of the valid characters converted from the corresponding
|> char_type values, as well as an end-of-file value, eof(). The type
|> int_type represents a charac-ter container type which can hold
|> end-of-file to be used as a return type of the iostream class member
|> functions.
OK. Consider the case of 32 bit char, int and long, using ISO 10646 as
a code set. And read the text *very* carefully. It doesn't say that
INT_T must be able to represent all valid values which can be put in a
char_type. It says that it must be able to represent all valid
*characters* -- in this case, all values in the range 0...0x10FFFF --
plus a singular value for eof (say, 0xFFFFFFFF).
Other constraints mean that such an implementation would have to use
some somewhat particular definitions for some of the other functions,
but I think that such an implementation would be legal. I would feel
better about it if it were more clearly statement somewhere that
"character" doesn't necessarily mean all possible values that can be
stored in a "char_type", but if this isn't what is meant, why use the
word character?
|> The case in interest is when char_type == char and int_type == int.
|> Now, look at the table 37 (Traits requirements)
|> X::eof() yields: a value e such that X::eq_int_type(e,X::to_int_type(c))
|> is false for all values c.
|> (by 21.1.1/1, c is of type char).
X::eof() yields 0xFFFFFFFF.
X::to_int_type( char_type c ) is constrained to always yield a value
less than 0x11000, e.g.:
to_int_type( char_type c )
{
return c > 0 && cc < 0x11000 ? c : 0 ;
}
X::eq_int_type simply uses ==.
Where is the error in this implementation?
|> The standard also says that any bit pattern for char represents a
|> valid char value, therefore eof() can't be in the values-set of
|> char.
Table 37 talks of characters, not valid char values. Not all valid char
values need be valid characters.
> The word "byte" has never meant eight bits. Historically...
Ironically, words mean whatever they get used to mean, and as
long as a word has a definition that is understood by the
involved parties, that definition is valid regardless of what is
"official".
Indeed, dictionaries are developed to track the changes in our
language. "Irregardless", for instance, doesn't make any
etymological sense, and yet it is used so often and always has
the same definition every time it's used that it has made its way
into many dictionaries.
Popular usage of the word "byte" does mean "eight bits" or
"octet", regardless of what ISO says and regardless of what IBM
once did 40 or 50 years ago.
Merriam-Webster currently defines a byte to be "a group of eight
binary digits...", and since dictionaries get definitions from
popular usage, we can assume that this definition is what most
people use as their definition of "byte". This does not mean
that the ISO is wrong, of course, it just means that they are
defining byte to be something other than the popular usage.
As an example, a "nice" girl referred to a prostitute in
Victorian England. The meaning of "nice" has morphed over the
years; the only thing defining it was popular usage and
understanding of what the word meant.
> The fact that machines with bytes of other than 8 bits have
> become rare doesn't negate the fact that when you do talk of
> them, the word "byte" doesn't mean 8 bits. And the distinction
> is still relevant. -- look at any of the RFC's, for example, and
> you'll find that when 8 bits is important, the word used is
> octet, and not byte.
Yes; the distinction is still relevant in that they need to
define these words to something other than the popular
definition. This doesn't make the standards and RFCs wrong, just
anachronistic. ;)
-tom!
"the term" I was referring to was "word".
> What is wrong with Intel's usage? If a byte means "an 8-bit quantity",
> then they're right. If a byte means "the smallest addressable unit of
> storage on a particular architecture", then they are still right. What
> definition of "byte" makes Intel's usage incorrect?
If it's properly described as a 32-bit architecture, then "word" should
indicate a 32-bit unit of memory.
|> James Kanze wrote:
|> > The word "byte" has never meant eight bits. Historically...
|> Ironically, words mean whatever they get used to mean, and as long
|> as a word has a definition that is understood by the involved
|> parties, that definition is valid regardless of what is "official".
True, but words are used within distinct communities. Here, we are
talking of a specialized technical community; how the man on the street
uses the word (or if he has even heard of it) is irrelevant: when we use
the word stack, or loop, in this forum, it generally also has a meaning
quite different from that used by the man on the street.
[...]
|> Popular usage of the word "byte" does mean "eight bits" or "octet",
|> regardless of what ISO says and regardless of what IBM once did 40
|> or 50 years ago.
I'm not sure that there is a popular usage of the word "byte". If so,
it is very recent, and probably is 8 bits. But that is separate from
the technical usage, just as the use of stack or loop with regards to
programming is different from other uses.
|> Merriam-Webster currently defines a byte to be "a group of eight
|> binary digits...", and since dictionaries get definitions from
|> popular usage, we can assume that this definition is what most
|> people use as their definition of "byte". This does not mean that
|> the ISO is wrong, of course, it just means that they are defining
|> byte to be something other than the popular usage.
And that Merriam-Webster is giving a general definition, and not a
technical one. IMHO, if they don't mention it's use with a meaning of
other than 8 bytes, they are wrong; the two uses are related, and
presenting one without the other is highly misleading, since the
definition they do give "sounds" technical. They might, of course,
label my usage as "technical", or give some other indication that it is
not the everyday usage.
With regards to the technical meaning, it is significant to note that
technical documents in which the unit must be 8 bits (descriptions of
netword protocols, etc.) do NOT use the word byte, but octet.
|> As an example, a "nice" girl referred to a prostitute in Victorian
|> England. The meaning of "nice" has morphed over the years; the only
|> thing defining it was popular usage and understanding of what the
|> word meant.
A good dictionary will still give this meaning, indicating, of course,
that it is archaic.
I would agree that we are in a situation where the word byte is changing
meaning, and 50 years from now, it probably will mean 8 bits. For the
moment, even if many people assume 8 bits, the word is still
occasionally used for other sizes, and still retains to some degree its
older meaning. (This is, of course, *why* it isn't used in protocol
descriptions.)
|> > The fact that machines with bytes of other than 8 bits have become
|> > rare doesn't negate the fact that when you do talk of them, the
|> > word "byte" doesn't mean 8 bits. And the distinction is still
|> > relevant. -- look at any of the RFC's, for example, and you'll
|> > find that when 8 bits is important, the word used is octet, and
|> > not byte.
|> Yes; the distinction is still relevant in that they need to define
|> these words to something other than the popular definition. This
|> doesn't make the standards and RFCs wrong, just anachronistic. ;)
Not even anachronistic. Just more precise and more technical than
everyday usage.
In the case of the C/C++, the use is a bit special, even with regards to
the older meaning. I'd actually favor a different word here, but I
don't have any suggestions.
And what about the use in the library section, where there is a question
of multi-byte characters -- I've never heard anyone use anything else
but "multi-byte characters" when referring to the combining codes in 16
bit Unicode, for example. So at least in this compound word, byte has
retained a more general meaning.
In the case of the RFC's and the various standards for the OSI
protocols, I see no reason to switch from "octet" to "byte". The word
"octet" is well established, and is precise, and makes it 100% clear
that exactly 8 bits are involved. Even if "byte" is generally
understood to be 8 bits, why choose the less precise word?
--
James Kanze mailto:ka...@gabi-soft.de
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
Ziegelhüttenweg 17a, 60598 Frankfurt, Germany Tel. +49(0)179 2607481
---
Please allow me to summarize where we stand thus far:
byte#1: one opinion is that when C++98 and C standards refer to "byte", they
are referring to a strictly 8-bit byte
byte#2: one opinion is that when C++98 and C standards refer to "byte", they
are referring to an implementation defined unit somehow related to representing
a minimum-sized char-type on that processor architecture. Some variants of this
opinion would permit a byte (i.e., char) of 6-bits to 15-bits. Other variants
of this opinion would permit a byte (i.e., char) of nearly any size: 16-bits,
32-bits, 64-bits, 128-bits, ad infinitum.
Obviously the C++98 standard can be read in two drastically different and
substantially incongruent ways: byte#1 and byte#2. The C++98 standard does not
explicitly define the term "byte" nor does it normatively reference a standard
which itself in turn explicitly defines the term "byte".
DEFECT: C++98's ambiguous use of the term "byte" without providing an
explicit definition which selects exactly one of the alternative definitions of
"byte" is itself a fundamental defect from which a series of troublesome
alternative interpretations (and thus troublesome alternative compiler
implementations) may flow.
I see at least two ways of resolving this:
resolution#1: Omit any & all mention of the word "byte". In C++0x and in any
C++98 corrigenda, strictly use only the word "octet" instead of C++98's "byte".
resolution#2: Explicitly pick one of the two alternative definitions of
"byte": byte#1 or byte#2. Explicitly definite "byte" in C++0x and in any C++98
corrigenda.
Obviously some people who staunchly subscribe to byte#2 would consider
resolution#1 as moving away from their position. Likewise, if byte#1 were to be
chosen as part of implementing resolution#2, some of those people who staunchly
subscribe to byte#2 would consider resolution#2=byte#1 as moving away from their
position.
Because of this thread's volume of seemingly-endless debate about what the
word "byte" is, I expect to see this defect added to the official C++98 defect
list. I expect to see this defect resolved in a C++98 corrigendum which then is
folded into C++0x. If C++98 meant for "byte" (char) to be an 8-bit byte =
octet, then explicitly define "byte" with such strictness. If C++98 meant for
"byte" (char) to be 6-bits to 15-bits, up to 16-bits, up to 32-bits, up to
64-bits, up to 128-bits, or so forth, then explicitly define "byte" with such
rich semantics.
Note that some byte#2-oriented postings on this thread have been tantamount
to redefining/hijacking C's/C++'s historically 8(ish)-bit char to be
UTF16/UTF32/UCS2/UCS4-capable for non-UTF8 Unicode. Character encoding schemes
composed of value-sets whose size is greater than 255 graphemes (e.g.,
Unicode, ISO/IEC 10646) is the purpose for which wchar_t has always been intended.
Or equivalently, the "supreme court" of C++ needs to normatively decide how
the C++98 "constitution" is to be interpreted regarding how "byte" is to
interpreted regarding char and sizeof(char).
James Kanze wrote:
---
IMO thats worth a defect report; any other opinions ?
3.9p4:
-----------
The object representation of an object of type T is the sequence of
N unsigned char objects taken up by the object of type T, where N
equals sizeof(T). The value representation of an object is the set
of bits that hold the value of type T. For POD types, the value
representation is a set of bits in the object representation that
determines a value, which is one discrete element of an
implementationdefined set of values.37)
37) The intent is that the memory model of C++ is compatible with
that of ISO/IEC 9899 Programming Language C.
-----------
.... so this definition of "object representation of an object of type T"
relies on "sizeof(T)".
5.3.3
-----------
The sizeof operator yields the number of bytes in the object representation
of its operand. .....
-----------
.... and "sizeof(T)" seems to rely on T's "object representation"
Daniel Miller wrote:
>
> Note that some byte#2-oriented postings on this thread have been tantamount
> to redefining/hijacking C's/C++'s historically 8(ish)-bit char to be
> UTF16/UTF32/UCS2/UCS4-capable for non-UTF8 Unicode. Character encoding schemes
> composed of value-sets whose size is greater than 255 graphemes (e.g.,
> Unicode, ISO/IEC 10646) is the purpose for which wchar_t has always been intended.
You would do better to avoid making inflamatory statements like "hijack" in
proposals that you are trying to push on people.
Frankly, if char's were just designed to hold characters, then allowing them
to be something larger if you were on a native, lets say, UTF16 machine
would be reasonable.
However, char plays double duty as the "minimal addressable memory unit."
As much as one would wish to redefine char to a larger size, one can not
do so without losing the ability to address something smaller.
If you want to fix the terminology to allow the same latitude currently
allowed (16 bit chars lets say), thats fine. If you want to somehow
restrict chars to 8 bits you ave two problems:
1. You then need to fix the fact that wchar_t is not fully supported in C++.
2. You are still deciding that a certain class of machines that has had C/C++
compilers implemetned for them are no longer allowed a conforming implementaion
because of the infeasibility of exactly 8 bit char size on them.
Then win32 it's not correctly described as a 32-bit machine?
> > ISO has no authority do define the universal mean of a word. No more
> No one has that authority. However, ISO does have the authority to
> define the usage within ISO documents, and the usage by anyone who cares
> about ISO standards. Which includes me.
In the context of this newsgroup the relevant standard does not define
WORD.
Regards.
Keep in mind that people can only sustain that opinion by ignoring the
explicit definitions provided in those standards.
> byte#2: one opinion is that when C++98 and C standards refer to "byte", they
> are referring to an implementation defined unit somehow related to representing
> a minimum-sized char-type on that processor architecture. ...
Almost correct. It's the minimum-sized addressable unit, and it must be
able to hold every element of the basic execution character set.
However, it needn't be a character type as far as the architecture is
concerned, and it can't be the minimum-sized char-type on that
architecture, if the minimum-sized char-type is less than 8 bits.
For instance, there have been machines with a word size of 36 bits, and
configurable byte sizes; the byte size could be set as low a 5 bits,
allowing 7 bytes per word. That mode allowed only for capital letters
and punctuation - there was no room even for digits, much less lower
case. By your description of byte#2, a C++ implementation on such a
machine would be required to use it in the 5-bit mode. However, what the
standard actually requires is that char must be able to hold at least 96
different values, and that unsigned char have a range which implicitly
requires at least 8 bits per byte. I've never heard anyone indicate
whether there was ever a C implementation for that machine, and there
almost certaintly was not a C++ implementation. However, there could
have been. For such an implementation, the mode which put 4 9-bit bytes
in a word would have been the most logical configuration.
> ... Some variants of this
> opinion would permit a byte (i.e., char) of 6-bits to 15-bits. ...
6 or 7 bit bytes would violate requirements specifically layed out by
the standards. Note: the 8-bit limit is not explcit; it's derived from
the requirements on UCHAR_MAX. Those requirements are not explicitly
part of the C++ standard, but are instead incorporated by reference from
the C standard.
> ... Other variants
> of this opinion would permit a byte (i.e., char) of nearly any size: 16-bits,
> 32-bits, 64-bits, 128-bits, ad infinitum.
Yes; the C and C++ standard explicitly allow for an unspecified, and
therefore arbitrarily large, number of bits per byte.
> Obviously the C++98 standard can be read in two drastically different and
> substantially incongruent ways: byte#1 and byte#2.
No - it cannot be read to match byte#1; the people who support that
point of view have failed to read the relevant clauses. Modulo the
corrections I've given above, byte#2 is pretty much exactly what the
standard actually says.
What's at issue here is not whether the standards mean what they
explicitly say about what a byte is; what's at issue is whether they
should say something different, or use different terminology to say it.
> ... The C++98 standard does not
> explicitly define the term "byte" nor does it normatively reference a standard
> which itself in turn explicitly defines the term "byte".
Completely false. See section 1.7p1 in the C++ standard. In particular,
pay close attention to the last part of the second sentence, which makes
the number of bits in a byte explicitly implementation-defined. See
section 3.6 of the C99 standard. In particular, pay close attention to
Note 2, in 3.6p3. The note is, of course, non-normative, but it
explicitly and correctly points out the absence of a size specification
for a byte in the normative section of the text, making it clear that
this absence was intentional. See section 5.2.4.2.1 of the C99 standard,
for the limits on the valid ranges of character types, which implicitly
require that a char be at least 8 bits. Pay particular attention to
paragraph 2 of that section.
....
> resolution#2: Explicitly pick one of the two alternative definitions of
> "byte": byte#1 or byte#2. Explicitly definite "byte" in C++0x and in any C++98
> corrigenda.
Already achieved, without any change to the standard.
....
> Note that some byte#2-oriented postings on this thread have been tantamount
> to redefining/hijacking C's/C++'s historically 8(ish)-bit char to be
Historically, for as long as there's been a C standard, it's explicitly
defined a byte in a way that allows for it to be larger than 8 bits. The
C++ standard merely continued that tradition.
The comment about sizeof(T) is a side issue; the comment is true, and a
useful thing to know, but does not play a part in defining the what the
object representation is, nor how big it is. It might be better to make
that comment non-normative text, since it's redundant with 5.3.3, and as
currently written might give the mistaken impression that N is
determined by sizeof(), rather than simply being reported by it.
> 5.3.3
> -----------
> The sizeof operator yields the number of bytes in the object representation
> of its operand. .....
> -----------
> .... and "sizeof(T)" seems to rely on T's "object representation"
There's no circularity in the actual meaning. The size of a byte is
implementation-defined. The representation of an object takes up an
implementation-defined amount of memory space, which must be a positive
integral number of bytes. sizeof(object) reports how many bytes that is.
Or it's not correctly described as having a word size other than 32
bits. Take your choice.
> > > ISO has no authority do define the universal mean of a word. No more
> > No one has that authority. However, ISO does have the authority to
> > define the usage within ISO documents, and the usage by anyone who cares
> > about ISO standards. Which includes me.
>
> In the context of this newsgroup the relevant standard does not define
> WORD.
The C++ standard makes no use of the term with this meaning, and
therefore doesn't need to define it; therefore it's meaning is indeed
off-topic for this newsgroup. It came up only because of quotation from
Stroustrup that used the term. However, I'm sure there are other ISO
standards do define it. Hopefully, all ISO standards that define the
term give it mutually compatible defintions, but I wouldn't be surprised
to hear otherwise.
|> The debate on this thread is resembling a purely academic
|> debating society regarding the ontology versus phenomenology of
|> certain words. Let us pragmatically refocus on identifying
|> defects/ambiguities in the C++98 standard and how to fix them.
|> Please allow me to summarize where we stand thus far:
|> byte#1: one opinion is that when C++98 and C standards refer to
|> "byte", they are referring to a strictly 8-bit byte
|> byte#2: one opinion is that when C++98 and C standards refer to
|> "byte", they are referring to an implementation defined unit somehow
|> related to representing a minimum-sized char-type on that processor
|> architecture. Some variants of this opinion would permit a byte
|> (i.e., char) of 6-bits to 15-bits. Other variants of this opinion
|> would permit a byte (i.e., char) of nearly any size: 16-bits, 32-bits,
|> 64-bits, 128-bits, ad infinitum.
I've not actually seen either of these opinions, except at the very
beginning of the thread. Both the C and the C++ standards explicitly
define what they mean by byte, and both explicitly say that it is NOT
necessarily 8 bits. (See ISO 9899, 3.6 and ISO 14882, 1.7.)
Having established what the C and the C++ standards mean by byte, the
thread has thus drifted to the question of what the word "normally"
means, outside of the standard.
|> Obviously the C++98 standard can be read in two drastically
|> different and substantially incongruent ways: byte#1 and byte#2.
|> The C++98 standard does not explicitly define the term "byte" nor
|> does it normatively reference a standard which itself in turn
|> explicitly defines the term "byte".
This is simply false. See the sections mentionned above. In
particular, from the C++ standard (second sentence of 1.7): "A byte
[...] is a contiguous sequence of bits, the number of which is
implementation-defined." I don't see what could be clearer.
(Elsewhere, the standard states that the sizeof operator returns the
size in bytes, that sizeof(unsigned char) must be 1, and that an
unsigned char must be able to hold all of the values in the range
0...255. These requirements, taken together, mean that a byte must have
at least 8 bits.)
|> DEFECT: C++98's ambiguous use of the term "byte" without
|> providing an explicit definition which selects exactly one of the
|> alternative definitions of "byte" is itself a fundamental defect
|> from which a series of troublesome alternative interpretations (and
|> thus troublesome alternative compiler implementations) may flow.
No defect. You just haven't bothered reading the standard, or even the
preceding posts in this thread.
|> I see at least two ways of resolving this:
|> resolution#1: Omit any & all mention of the word "byte". In C++0x
|> and in any C++98 corrigenda, strictly use only the word "octet"
|> instead of C++98's "byte".
Except that the intention is precisely NOT to require strictly 8 bits,
but 8 or more bits. For whatever reasons, the intent of the C and the
C++ standard is to allow efficient implementations on any conceivable
hardward, including one with 36 bit words and 9 bit bytes (a hardware
which has actually existed).
|> resolution#2: Explicitly pick one of the two alternative
|> definitions of "byte": byte#1 or byte#2. Explicitly definite "byte"
|> in C++0x and in any C++98 corrigenda.
|> Obviously some people who staunchly subscribe to byte#2 would
|> consider resolution#1 as moving away from their position. Likewise,
|> if byte#1 were to be chosen as part of implementing resolution#2,
|> some of those people who staunchly subscribe to byte#2 would
|> consider resolution#2=byte#1 as moving away from their position.
At least within this thread, I don't think that there have been any
arguments that the C/C++ should change so that it would not allow an
effective implementation on a machines which don't directly support 8
bit bytes.
This is a separate argument. It is, IMHO, a reasonable argument -- I
don't think that there are any machines capable of a hosted environment
sold today that have anything but 8 bit bytes. I think that the last
one sold was probably long enough ago that it need not be considered.
(But I am far from sure about this.) On the other hand, there ARE DSP's
today which define the size of a byte as 32 bits (and sizeof(int) as 1);
any change should allow these to continue to exist. And I think that
there would be considerable resistance to a change which made the basic
language requirements different for hosted and free-standing
environments.
|> Because of this thread's volume of seemingly-endless debate about
|> what the word "byte" is, I expect to see this defect added to the
|> official C++98 defect list. I expect to see this defect resolved in
|> a C++98 corrigendum which then is folded into C++0x. If C++98 meant
|> for "byte" (char) to be an 8-bit byte = octet, then explicitly
|> define "byte" with such strictness. If C++98 meant for "byte"
|> (char) to be 6-bits to 15-bits, up to 16-bits, up to 32-bits, up to
|> 64-bits, up to 128-bits, or so forth, then explicitly define "byte"
|> with such rich semantics.
|> Note that some byte#2-oriented postings on this thread have been
|> tantamount to redefining/hijacking C's/C++'s historically 8(ish)-bit
|> char to be UTF16/UTF32/UCS2/UCS4-capable for non-UTF8 Unicode.
|> Character encoding schemes composed of value-sets whose size is
|> greater than 255 graphemes (e.g., Unicode, ISO/IEC 10646) is the
|> purpose for which wchar_t has always been intended.
I'm curious as to where you got the ideas about C's "historically 8-bit
char". In Kernighan and Richie, "The C Programming Language", 1978
(page 34), the authors explicitely state that the sizes of the data
types are not defined by the language, and include a table of some
current implementations which includes a 9 bit byte. In 1978, the word
byte certainly did not have any implications of 8 bits, as there were
still many machines on the market which had other size bytes.
Note that this is all irrelevant to Tom Plunket's points, to which I was
responding. He and I do not disagree about what the standard says, or
should say, but about the state of the *evolution* of the general
meaning of byte. I think we both agree that it didn't originally mean 8
bits, and we both agree that in 50 years or more, it will definitly mean
a unit of 8 bits -- barring some unforeseeable historical quirk. I
think we also both agree that given this evolution, it would be best if
the C/C++ found another term.
--
James Kanze mailto:ka...@gabi-soft.de
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
Ziegelhüttenweg 17a, 60598 Frankfurt, Germany Tel. +49(0)179 2607481
---
James, now I too understand the intent of this paragraph, your explanation
was cristal clear !
So one could say ..
"The object representation of an object of type T is the (finite:-) set
of all unsigned char objects taken up by that object of type T".
Some years ago when I came to 5.3.3 and 3.9p4, after a long round
trip through the document, trying to solve a problem and a stack
of related crossreferences almost blasting my head, when I realized
this 'circularity' (not really one, now I know it) I aborted my
undertaking with a lot of unkind words leaving my mouth ...
So I would appreciate it if the next version would make this
paragraph clearer in the sense you have explained it.
> > 5.3.3
> > -----------
> > The sizeof operator yields the number of bytes in the object representation
> > of its operand. .....
> > -----------
> > .... and "sizeof(T)" seems to rely on T's "object representation"
>
> There's no circularity in the actual meaning. The size of a byte is
> implementation-defined. The representation of an object takes up an
> implementation-defined amount of memory space, which must be a positive
> integral number of bytes. sizeof(object) reports how many bytes that is.
yes to all
Thanks,
Markus.
| Gabriel Dos Reis <dos...@cmla.ens-cachan.fr> writes:
|
| |> Huh?!? The C++ standard requires that all bits in a char
| |> participate in a char value representation. And EOF is not a
| |> character.
|
| However, as far as I can see, it doesn't place any constraints with
| regards as to what a character can be (except that the characters in the
| basic character set must have positive values, even if char is signed).
Surely, the standard does define what "character" means.
Have a look at 17.1.2.
--
Gabriel Dos Reis, dos...@cmla.ens-cachan.fr
[...]
| |> it suffices to look at the first two pages.
|
| |> 21.1.2/2
| |> For a certain character container type char_type, a related
| |> container type INT_T shall be a type or class which can represent
| |> all of the valid characters converted from the corresponding
| |> char_type values, as well as an end-of-file value, eof(). The type
| |> int_type represents a charac-ter container type which can hold
| |> end-of-file to be used as a return type of the iostream class member
| |> functions.
|
| OK. Consider the case of 32 bit char, int and long, using ISO 10646 as
| a code set. And read the text *very* carefully.
I did.
[...]
| |> The standard also says that any bit pattern for char represents a
| |> valid char value, therefore eof() can't be in the values-set of
| |> char.
|
| Table 37 talks of characters, not valid char values. Not all valid char
| values need be valid characters.
Sure, they do.
Let's look at the definitions given at the begining of the library (17.1)
17.1.2 character
in clauses 21, 22, and 27, means any object which, when treated
sequentially, can represent text. The term does *not only mean char
and wchar_t objects*, but *any value* that can be represented by a type
that pro-vides the definitions specified in these clauses.
(Emphasis is mine).
--
Gabriel Dos Reis, dos...@cmla.ens-cachan.fr
---
>Markus Mauhart wrote:
>>
>> "Gennaro Prota" <gennar...@yahoo.com> wrote ...
>> >
>> > P.S.: the only thing that leaves me perplexed is the apparent circular
>> > definition constituted by 5.3.3 and 3.9p4. Does anybody know if it is
>> > resolved in an other part of the standard?
>>
>> IMO thats worth a defect report; any other opinions ?
>>
>> 3.9p4:
>> -----------
>> The object representation of an object of type T is the sequence of
>> N unsigned char objects taken up by the object of type T, where N
>> equals sizeof(T).
[...]
>The comment about sizeof(T) is a side issue; the comment is true, and a
>useful thing to know, but does not play a part in defining the what the
>object representation is, nor how big it is.
Gulp! This is because you probably know the intended wording, but it's
not what is written there :)
Anyhow, it seems to me that moving the comment in a non-normative part
wouldn't solve another problem: objects of the same type can have
different sizes. Example:
class A {};
class B : public A { public: int i;};
A a;
B b;
The A sub-object in b can occupy 0 bytes, while the complete object a
cannot (1.8p5). Now how do you apply the text from 3.9p4 quoted above?
Moreover you cannot say that sizeof "yields the number of bytes in the
object representation of its operand", since AFAIK what is defined by
the standard is the object representation of an object, not that of a
(parenthesis enclosed name of a) type.
>There's no circularity in the actual meaning. The size of a byte is
>implementation-defined.
You meant "the size of an object", I suppose.
> The representation of an object takes up an
>implementation-defined amount of memory space, which must be a positive
>integral number of bytes. sizeof(object) reports how many bytes that is.
Genny.
I'm afraid that I don't see that. I'm not a fan of the "mind-reading"
school for interpreting the standard. I can see how this wording is
misleading, but not how it's incorrect. 3.9p4 says that N==sizeof(T);
that's perfectly true. It doesn't say that sizeof(T) determines what the
value of N is. It doesn't actually say what it is that determines the
value of N, it just describes some facts involving N. The standard does
not determine what the value of N is; that's up to the implementation.
> Anyhow, it seems to me that moving the comment in a non-normative part
> wouldn't solve another problem: objects of the same type can have
> different sizes. Example:
>
> class A {};
> class B : public A { public: int i;};
>
> A a;
> B b;
>
> The A sub-object in b can occupy 0 bytes, while the complete object a
> cannot (1.8p5). Now how do you apply the text from 3.9p4 quoted above?
Good point; I don't know. I'd recommend filing a DR on that issue.
> Moreover you cannot say that sizeof "yields the number of bytes in the
> object representation of its operand", since AFAIK what is defined by
> the standard is the object representation of an object, not that of a
> (parenthesis enclosed name of a) type.
>
> >There's no circularity in the actual meaning. The size of a byte is
> >implementation-defined.
>
> You meant "the size of an object", I suppose.
No, I meant "the size of a byte". See 1.7p1: "A byte is ... composed of
... bits, the number of which is implementation-defined."
|> James Kanze <ka...@gabi-soft.de> writes:
|> | Gabriel Dos Reis <dos...@cmla.ens-cachan.fr> writes:
|> | |> Huh?!? The C++ standard requires that all bits in a char
|> | |> participate in a char value representation. And EOF is not a
|> | |> character.
|> | However, as far as I can see, it doesn't place any constraints
|> | with regards as to what a character can be (except that the
|> | characters in the basic character set must have positive values,
|> | even if char is signed).
|> Surely, the standard does define what "character" means.
No. We all know what "character" means. And that it has nothing to do
with char, wchar_t, etc. (A "character" is not a numerical value, for
example.)
|> Have a look at 17.1.2.
It is an interesting definition. In particular the "[...] any object
which, when treated sequentially, can represent text" part. I'm not to
sure what that is supposed to mean -- whether an object represents text
or not depends on how it is interpreted, and a char[] doesn't
necessarily represent text, whereas in specific contexts, a double[]
may. (APL basically uses the equivalent of float[] to represent text.
So if I write an APL interpreter in C++...)
About the only way to make sense of it is suppose that the word "object"
was meant to be taken very literally -- the use of "object" instead of
"type" is intentional, and of course, a 32 bit wchar_t which contains
the value 0x5a5a5a5a is not a character, because there is no way which,
taking it sequentially (whatever that is supposed to mean -- I suppose
it is an attempt to cover multi-byte characters), there is no way which
it can be taken to represent text.
Given the wording, I wouldn't read too much into this definition. And I
think some sort of clarification is necessary. I have proposed an
implemention with 32 bit characters, where int_type and char_type are
identical. I think that the standard can be interpreted two ways, one
of which forbids this implementation, and another which allows it. I
would like to know which interpretation is correct.
>From a practical point of view, the implementation seems more than
reasonable, and exceedingly useful. So I would like to see it allowed.
But there may be reasons of which I am unaware which argue for
forbidding it.
--
James Kanze mailto:ka...@gabi-soft.de
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
Ziegelhüttenweg 17a, 60598 Frankfurt, Germany Tel. +49(0)179 2607481
---
IMHO Merriam-Webster is a /terrible/ dictionary, so I'm now convinced that
/whatever/ "byte" means it cannot be that, or not only that.
FWIW the Shorter Oxford (which I would rate as "fairly decent" dictionary)
defines "byte" as: "Computing. A group of binary digits (usu. eight) operated
on as a unit."
That seems to be pretty-much on the mark.
Cheers,
Daniel.