Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Re: compilers, endianness and padding

630 views
Skip to first unread message

Ian Collins

unread,
May 13, 2013, 2:08:25 AM5/13/13
to

Jack Adrian Zappa wrote:
> On May 9, 3:07 am, Joshua Maurice <joshuamaur...@googlemail.com>
> wrote:
>> I don't see a use case. You still haven't provided one. What is a
>> specific concrete example where you would use this?
>
> Specific use case: Communication between micro controller to other
> systems via streams or file storage. Requirements: Need to be able
> to stipulate binary representation of data between systems for clean
> and compatible communication interchange while requiring low memory
> and processing overhead on the part of the micro controller.

This is a common situation which has already been implemented by a
myriad of libraries.

The natural solution is to use the byte order of the slowest system,
the micro.

--
Ian Collins


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

James K. Lowden

unread,
May 13, 2013, 2:10:29 AM5/13/13
to

On Thu, 9 May 2013 16:06:39 -0700 (PDT)
�� Tiib <oot...@hot.ee> wrote:

> > One can enforce byte order using shifts and masks, but it is
> > inefficient.

Efficiency is always measured with respect to something. If
byte-swapping is needed, the compiler can do it more efficiently than
the programmer.

> It is inefficient to keep data in memory in some file format.

Hmm, what would you say about mmap(2)?

> It is inefficient and non-portable to keep data in file in memory
> layout of particular platform.

That is exactly what many DBMSs do.

> So something what you request feels to be doomed to be fundamentally
> inefficient.

Quite to the contrary: the OP is asking for language support for
something currently handled by runtime libraries. If you start with a
something like this (notionally)

struct { LE int16_t name_len; char name[1]; } sb;

where name is a character sequence whose length is expressed in
name_len -- which wouldn't be unusual in an input buffer on a socket,
say -- then you'd be able to say (notionally)

std::string name(sb.name, static_cast<size_t>(sb.name_len));

or, for that matter,

size_t len = sb.name_len; // implicit conversion
std::string name(sb.name, len);

meaning, of course, that the intermediate variable "len" would not be
needed, because endian-conversion would be implicit when the parameter
is passed.

That would be more efficient -- and definitely more robust -- than

std::string name(sb.name, htons(sb.name_len));

especially because htons(3) won't convert from little-endian format.

Jack Adrian Zappa's suggestion is obviously good. Endianism
conversion is tedious and error-prone. The compiler knows the
endianism of the object code it is generating. By allowing the
programmer to denote endianism of I/O structures, the compiler enables
endian conversion at the lowest possible cost and highest possible
convenience.

The objection is raised that this functionality need not be in the
language proper, but could be in the library. OK, let there be
std::integer defined with compile-time endianism and implicit
conversion to built-in types, such that the above notation would work.

The objection is raised that this language feature might be used
inappropriately, not in an I/O context. C++ is rope. How you use it
is up to you. Compared to operator overloading, this is nothing.

At the very least, the compiler should provide a built-in similar to
__FILE__ to denote endianism, saving the library writer the need to
depend on header files. I fail to see any advantage to requiring the
programmer to detect something known full well to the compiler.

--jkl


--

James K. Lowden

unread,
May 13, 2013, 2:12:27 AM5/13/13
to

On Thu, 9 May 2013 00:03:42 -0700 (PDT)
Thomas Richter <th...@math.tu-berlin.de> wrote:

> (otherwise, please explain me how to serialize a pointer).

I'm sorry, but I consider this a trope. Not only is "serializing a
pointer" a solved problem in two dozen libraries since, oh, COBRA, but

ostream operator<<(ostream, char *)

has been, as you well know, defined in namespace std for 25 years.

Why do people think pointers can't be serialized?

> This is the wrong place for it because the philosophy of *this*
> language is a different one. It is "do not pay for what you do not
> need", and I do not need it. I can write portable I/O just fine
> without the help of the language. I use libraries for that.

I find it odd that

char *s = "hello";
cout << s;

works, but

struct { char *s; } s = { "hello" };
cout << s;

does not. I do not understand why we accept serialization of built-in
types, and resolutely refuse to standardize -- or even support the
standardization of -- serialization of user-defined types.

The minimum I would like to see is the ability to iterate over the
members of a structure. Suppose they were described as an array of
tuples of {type, size, constness}. Then we could serialize abstractly
along the lines of

struct { ... } foo;
for_each(members_of(foo).begin(), ... );

The compiler could readily support automatic typing of the members_of
elements and supply sufficient metadata to traverse an
inheritance/aggregation tree. These are the necessary missing
ingredients to a standard serialization library.

There need not be any cost. The metadata are required only if
referenced. Nothing prevents the optimizer from stripping it away.
Nothing prevents the compiler from segregating the metadata somewhere
such that it is not loaded into memory unless it's used.

But wait, you say. Why is serialization so important? I ask you, why
is std::string special?

When C++ was young, the liabilities of uninitilalized pointers and
null-terminated character arrays in C were widely acknowledged. C++
answered them with references (foo_t&) and std::string. The language
succeeded by answering the needs of the day.

Both language and library were designed when networks were still
strange and nonstandard, when people still paid attention to the ISO
model and SNA and X.400. Networking was a bespoke business; the
problem of transmitting a data structures from one machine to another
was hardly standardized at the operating system level, let along
between applications. Cfront appeared in 1985; the likes of CORBA not
until 1992, NCSA Mosaic in 1993. Indeed, when Java arrived in 1995
its main claim to fame other than GC was built-in networking.

Given the networks of the day (primitive), the machines of the day
(slower by 6 orders of magnitude), and the experience with C++ at the
time (roughly nil) Stroustrup & friends restricted themselves to a
single, well understood problem: std::string. To answer my own
question, std::string is special because its need was recognized in
1985.

In 2013, the need for stardardized serialization has become clear.
Dozens of incompatible serialization libraries have been written. All
are more awkward than would be necessary if they didn't lack metatdata
support from the compiler.

Who reading this list hasn't written operator<< for some user-defined
type and wearily wondered why each member had to be named
individually, each collection iterated over explicitly, and the format
of the whole structure re-invented and re-expressed, explicitly, by
hand, *again*?! And yet again for operator>>?

We spend far too much time today dealing with I/O, recreating by hand
the very metadata discarded by the compiler. With support from the
compiler, C++ could provide easy, standard, robust, efficient,
reliable I/O for user-defined types of arbritrary complexity. Until
then, we'll party on like it's 1985.

Francis Glassborow

unread,
May 13, 2013, 8:28:28 AM5/13/13
to

On 13/05/2013 07:12, James K. Lowden wrote:
>
> On Thu, 9 May 2013 00:03:42 -0700 (PDT)
> Thomas Richter <th...@math.tu-berlin.de> wrote:
>
>> (otherwise, please explain me how to serialize a pointer).
>
> I'm sorry, but I consider this a trope. Not only is "serializing a
> pointer" a solved problem in two dozen libraries since, oh, COBRA,
> but
>
> ostream operator<<(ostream, char *)
>
> has been, as you well know, defined in namespace std for 25 years.
>
> Why do people think pointers can't be serialized?
>

Perhaps I am confused (quite likely :) but surely serialisation is not
just the ability to output data but the ability to read it back in a
meaningful way. On most systems the value of a pointer (i.e. an
address) is specific to the current run. If it is on the stack it only
survives as long as the stack has not been unwound past that point,
and if it is on the heap it only survives until that allocation has
been released. To put it more simply, a pointer only survives until
the memory for the relevant object has been released.

Francis

Andy Champ

unread,
May 13, 2013, 8:26:18 AM5/13/13
to

On 13/05/2013 07:08, Ian Collins wrote:
>
> The natural solution is to use the byte order of the slowest system,
> the micro.

Some of us will recall that a 6800 has the opposite byte order to the
8080. A quick google suggest to me that all Intel ones have LSB first,
and other microcontrollers (Freescale was the first one I found) have
MSB first...

We'll also recall that once upon a time the Intel order was called
"byteswapped", but they've made great strides with the Swiftian idea
of endianism. And I for one can never remember which is little endian.

Incidentally, I think Intel have it right, and we humans have it wrong
- least significant digit first means that you can read the number in
one pass. 1234 should be one and twenty and three hundred and four
thousand.

Andy


--

Martin Bonner

unread,
May 13, 2013, 1:08:48 PM5/13/13
to
On Monday, May 13, 2013 12:30:05 PM UTC+1, Andy Champ wrote:
> Incidentally, I think Intel have it right, and we humans have it wrong

Not all humans.

> - least significant digit first means that you can read the number in
> one pass. 1234 should be one and twenty and three hundred and four
> thousand.

Which is (once you have converted all the characters for right-to-left),
how it would read in Arabic and Hebrew.

Seungbeom Kim

unread,
May 13, 2013, 6:19:48 PM5/13/13
to

On 2013-05-12 23:12, James K. Lowden wrote:
>
> ostream operator<<(ostream, char *)
>
> has been, as you well know, defined in namespace std for 25 years.
>
> Why do people think pointers can't be serialized?

This (char*) is a special case; no other type of pointers have that
special treatment by the standard library.

What people mean by serializing pointers is that when you have 'struct
node { int value; node* left; node* right; } n;', for example, you
cannot simply do 'fwrite(&n, sizeof(n), 1, fp)' (and expect to be able
to read it back with valid semantics).

> I find it odd that
>
> char *s = "hello";
> cout << s;
>
> works,

Again, char* is a special case. Mainly because C used char* values to
represent string values.

> but
>
> struct { char *s; } s = { "hello" };
> cout << s;
>
> does not. I do not understand why we accept serialization of
> built-in types, and resolutely refuse to standardize -- or even
> support the standardization of -- serialization of user-defined
> types.

How do you define the serialization format for an arbitrary UDT? Why
should the language standard define one?

For example, given the node type mentioned above, what's THE ONE
correct way to serialize a binary tree?

> The minimum I would like to see is the ability to iterate over the
> members of a structure. Suppose they were described as an array of
> tuples of {type, size, constness}. Then we could serialize
> abstractly along the lines of
>
> struct { ... } foo;
> for_each(members_of(foo).begin(), ... );

That would be very cool, but even before being able to iterate over
struct members, the most fundamental problem to be solved is how to
represent types as data, I believe.

Maybe variadic templates could help, because they are what can handle
an arbitrary-length list of arbitrary-type data currently. Given
something like

template<typename... Args> Ret foo(Args&&... params);

a call to foo(some_struct) could be automatically be expanded by the
compiler to foo(some_struct.member1, some_struct.member2, ...).

But again, I guess lots of UDTs need more than just what the template
expansion can do for serialization (as imposed by the external
format).

> Stroustrup & friends restricted themselves to a single, well
> understood problem: std::string. To answer my own question,
> std::string is special because its need was recognized in 1985.

What makes you think std::string is special in the current context?
It's just a class type, which happens to be included in the standard
library and thus be supported better by other components in the same
library. The core language doesn't give it any special treatment.

--
Seungbeom Kim

Andy Champ

unread,
May 13, 2013, 6:16:43 PM5/13/13
to

On 13/05/2013 18:08, Martin Bonner wrote:
> On Monday, May 13, 2013 12:30:05 PM UTC+1, Andy Champ wrote:

>> Incidentally, I think Intel have it right, and we humans have it
>> wrong
>
> Not all humans.
>
>> - least significant digit first means that you can read the number
>> in one pass. 1234 should be one and twenty and three hundred and
>> four thousand.
>
> Which is (once you have converted all the characters for
> right-to-left), how it would read in Arabic and Hebrew.

Ever watched an Arab typing? The letters go in from the right, and the
cursor follows along. Some of the letters wriggle as new ones go in -
they change shape to match. Then he gets to a number, and it goes in
on the LEFT of the cursor, MSD first. Next letter and the cursor jumps
left past the number...

Andy

James K. Lowden

unread,
May 14, 2013, 2:11:46 AM5/14/13
to
On Mon, 13 May 2013 16:19:48 CST
Seungbeom Kim <musi...@bawi.org> wrote:

Seungeom, I want to acknowledge the care you took to pose a reasonably
hard example problem. I probably missed something, but I hope I've
shown it is readily solved.

> On 2013-05-12 23:12, James K. Lowden wrote:
>
> > I find it odd that
> >
> > char *s = "hello";
> > cout << s;
> >
> > works,
>
> Again, char* is a special case. Mainly because C used char* values
> to represent string values.
>
> > but
> >
> > struct { char *s; } s = { "hello" };
> > cout << s;
> >
> > does not. I do not understand why we accept serialization of
> > built-in types, and resolutely refuse to standardize -- or even
> > support the standardization of -- serialization of user-defined
> > types.
>
> How do you define the serialization format for an arbitrary UDT?

By iteration over the members.

There is no such thing as an "arbitrary UDT". Every UDT is built up
from primitive types, and every memberwise I/O operation eventually
boils down to (de-)serialization of those primitive types.

> Why should the language standard define one?

The language should define one so we can stop reinventing I/O for
every possible combination.

I do not mean C++ should suddenly see I/O added to the language
definition. I do mean that the language needs some small but
important extensions before iostreams can be extended to support
generic types.

> For example, given the node type mentioned above, what's THE ONE
> correct way to serialize a binary tree?

Tell me this: what's the one correct way to serialize a double?

We don't need *the* one correct way. We would benefit, though, from a
correct, reversible way. There is no reason it can't be done
mechanically.

For the reader's reference, the struct in question is

struct node { int value; node* left; node* right; } n;

How, you ask? I'd do something I bet very like what you would do.
What I do not see is why the standard library couldn't do it for me,
with a little information from the compiler.

I hope someone better versed in graph theory will come to my rescue,
but here's a plausible Monday night hack:

byte type size value
0 node 20 -
0 int 4 x
4 node* 8 20
12 node* 8 40
20 node 20 - // n.left
20 int 4 y
24 node* 8 60
32 node* 8 80
40 node 20 -
40 int 4 z // n.right
44 node* 8 100
52 node* 8 120
etc.

I wrote that in ASCII of course, because we're two humans
communicating. For communication between C++ programs, the above
information would better be tokenized.

The serialization system would recognize "node" as a UDT, taken from
the list of types provided by the compiler, and would therefore have
access to the metadata array describing the members. Pointers are
denoted as offsets into the stream. In reality, the stream reflects
what the compiler itself must do to maintain the graph in memory.
(Because, after all, pointers are just offsets from zero into the
linear address space we call "memory".)

Of course, nothing prevents a graph built from such a structure from
having cycles. OTOH nothing prevents the serializer from detecting
cycles.

> > The minimum I would like to see is the ability to iterate over the
> > members of a structure. Suppose they were described as an array
> > of tuples of {type, size, constness}. Then we could serialize
> > abstractly along the lines of
> >
> > struct { ... } foo;
> > for_each(members_of(foo).begin(), ... );
>
> That would be very cool, but even before being able to iterate over
> struct members, the most fundamental problem to be solved is how to
> represent types as data, I believe.

I simply don't see the problem. As I said, every struct or class
eventually is composed of built-in types. The compiler is able to
manage the structures in memory. The debugger is able to represent
them on the screen. What do you think is so different about a stream
that it requires a sad and lonely human being to write the I/O
routines?

> But again, I guess lots of UDTs need more than just what the
> template expansion can do for serialization (as imposed by the
> external format).

ISTM it's not as hard as you think. You'll agree that inheritance is
a tree, and that trees can be unambiguously represented and traversed.
Structures you'll agree can be described as an array of types. If I
gave you a tree of arrays arbitrarily and recursively defined, but
with each element defined in advance -- because I'm a compiler, and
all my types are known by ODR -- then surely you would be able to
iterate over the whole steaming mass and write it to a file.

The problem as I see it is that the type system is unavailable at
runtime. The information I'm describing -- class hierarchy, member
structure -- is discarded by the compiler (except insofar as it's made
available to the debugger).

Although the vogue term is "reflection", the idea is older than
ancient. Classes in Smalltalk could be interrogated at runtime.
(Heck, IIRC classes could be *modified* at runtime. But we won't go
there!)

> > Stroustrup & friends restricted themselves to a single, well
> > understood problem: std::string. To answer my own question,
> > std::string is special because its need was recognized in 1985.
>
> What makes you think std::string is special in the current context?
> It's just a class type, which happens to be included in the standard
> library and thus be supported better by other components in the same
> library. The core language doesn't give it any special treatment.

Exactly. Because the core language discards information the standard
library could otherwise use to handle UDTs generically, std::string
had to be explicitly and painstakingly integrated into the standard
library. Before the advent of the Internet, std::string was the
answer to the one well known I/O problem, namely char*. In that day
and age, it was deemed worthwhile to craft a single-purpose type,
rather than expose the type system for the library's use.

I cannot reliably take std::string from one library and pass it to
operator<< in another. There are all sorts of little geegaws in
std::string because the compiler does not provide the requisite
information: the library must "know" the name of the char* pointer,
and the length. The library cannot simply iterate over the members
and deal with each one in turn.

Stroustrup has often expressed the wish that C++ would develop a
standard library for UIs and databases. Actually, though, those are
only two examples of C++'s poor I/O support: with the exception of
files, the standard library is silent wrt I/O. That glaring void is
invisible to us only because we're accustomed to it.

One reason, surely, is lack of standardization at the OS level.
Another, just as surely, is the impossibility of writing a library
capable of dealing with user-defined types.

Without compiler support, C++ is "just another language" participating
in the IDL-driven language-neutral serialization circus. Inevitably,
the IDL defines the very structures that could be better defined
directly in C++. Twice the complexity, half the features, and none of
the fun.

C++17 represents a chance to fill that void, as it were, with
standardized, programatically accessible metadata. Sure, let's storm
the castle! But first let's answer their questions about the speed of
a flying swallow. Perhaps they'll lower the drawbridge.

--jkl



--

James K. Lowden

unread,
May 14, 2013, 2:32:19 AM5/14/13
to
On Mon, 13 May 2013 06:28:28 CST
Francis Glassborow <francis.g...@btinternet.com> wrote:

> > has been, as you well know, defined in namespace std for 25 years.
> >
> > Why do people think pointers can't be serialized?
>
> Perhaps I am confused (quite likely :) but surely serialisation is
> not just the ability to output data but the ability to read it back
> in a meaningful way. On most systems the value of a pointer (i.e. an
> address) is specific to the current run. If it is on the stack it
> only survives as long as the stack has not been unwound past that
> point, and if it is on the heap it only survives until that
> allocation has been released. To put it more simply, a pointer only
> survives until the memory for the relevant object has been released.

Maybe we don't mean the same thing. To me, "serialize a pointer" is
the same as "marshal the data". On output, chase the pointer and
write the data. On input, allocate the memory, assign the pointer,
and fill the buffer.

The value of the pointer itself is immaterial. Indeed, the value of
the pointer in *memory* is immaterial, right? We don't usually care
what the numeric value of a pointer is -- we don't even care that's
it's a number -- but only that it can be dereferenced to get at the
pointed-at value.

--jkl

Thomas Richter

unread,
May 14, 2013, 2:41:14 AM5/14/13
to
On 13.05.2013 08:12, James K. Lowden wrote:
>
> On Thu, 9 May 2013 00:03:42 -0700 (PDT)
> Thomas Richter<th...@math.tu-berlin.de> wrote:
>
>> (otherwise, please explain me how to serialize a pointer).
>
> I'm sorry, but I consider this a trope. Not only is "serializing a
> pointer" a solved problem in two dozen libraries since, oh, COBRA,
> but
>
> ostream operator<<(ostream, char *)
>
> has been, as you well know, defined in namespace std for 25 years.

This does not serialize a pointer. It serializes the object the
pointer points to, which is quite something different.

> Why do people think pointers can't be serialized?

Because the value of the pointer is specific to the run of the
program. In specific, the above "serialization" cannot distinguish
between two pointers that point to the identical object, and two
pointers that point to similar objects. This can make quite a
difference in program code.

>> This is the wrong place for it because the philosophy of *this*
>> language is a different one. It is "do not pay for what you do not
>> need", and I do not need it. I can write portable I/O just fine
>> without the help of the language. I use libraries for that.
>
> I find it odd that
>
> char *s = "hello";
> cout<< s;
>
> works, but
>
> struct { char *s; } s = { "hello" };
> cout<< s;
>
> does not.

Of course it works. Supply the right operator for your structure, and
off you go. There is absolutely nothing special about std::string. The
standard committee just choose that it would be considerably more
useful to have already an operator<< for string, whereas they could
not predict how your structures look like and how they should appear
printed on screen (or disk).

> I do not understand why we accept serialization of built-in types,
> and resolutely refuse to standardize -- or even support the
> standardization of -- serialization of user-defined types.

std::string is not "built-in". It is a library solution that is
specified because it is of general use. The structures and classes in
your code are likely of very less general use, but if they are of
*some* general use, they probably have output operators that are
specified in *some* standard. Standard C++ is really what is supposed
to be useful for every user of C++.

> The minimum I would like to see is the ability to iterate over the
> members of a structure.

Again, I typically don't need that, but it would include some
overhead. For example, the structure layout would likely need to be
stored somewhere at run time. I don't need this overhead. But if you
do, I'm sure a library solution is feasible which does that.

> Suppose they were described as an array of tuples of {type, size,
> constness}. Then we could serialize abstractly along the lines of
>
> struct { ... } foo;
> for_each(members_of(foo).begin(), ... );
>
> The compiler could readily support automatic typing of the
> members_of elements and supply sufficient metadata to traverse an
> inheritance/aggregation tree. These are the necessary missing
> ingredients to a standard serialization library.
>
> There need not be any cost. The metadata are required only if
> referenced. Nothing prevents the optimizer from stripping it away.
> Nothing prevents the compiler from segregating the metadata
> somewhere such that it is not loaded into memory unless it's used.

Well, propose a solution. I personally wouldn't care much since I
wouldn't need it, and if I need serialization, I only need a partial
serialization - the above "automation" does not do the right thing if
I have pointers somewhere.

> But wait, you say. Why is serialization so important? I ask you,
> why is std::string special?

It isn't special at all. It's a library solution like any other
classes, too. It was just considered to be standardized because it is
quite useful for a large audience.

> When C++ was young, the liabilities of uninitilalized pointers and
> null-terminated character arrays in C were widely acknowledged. C++
> answered them with references (foo_t&) and std::string. The
> language succeeded by answering the needs of the day.
>
> Both language and library were designed when networks were still
> strange and nonstandard, when people still paid attention to the ISO
> model and SNA and X.400. Networking was a bespoke business; the
> problem of transmitting a data structures from one machine to
> another was hardly standardized at the operating system level, let
> along between applications. Cfront appeared in 1985; the likes of
> CORBA not until 1992, NCSA Mosaic in 1993.

...and CORBA is dead nowadays, but provides serialization in - wait -
C++. So where is your problem, you have the solution. (Well, the C++
binding of Corba is awkward, but that's a Corba problem, not a C++
problem).

> Indeed, when Java arrived in 1995 its main claim to fame other than
> GC was built-in networking.

Which is also a library solution. Java has the advantage of a very
rich "standard library" because its application domain is narrower
than that of C++. But C++ runs on platforms java does not run on, so
you gain something, and you loose something. I'll certainly not stop
anyone from using Java. Actually, I'm programming a lot in java these
days, but also in C++.

> Given the networks of the day (primitive), the machines of the day
> (slower by 6 orders of magnitude), and the experience with C++ at
> the time (roughly nil) Stroustrup& friends restricted themselves to
> a single, well understood problem: std::string. To answer my own
> question, std::string is special because its need was recognized in
> 1985.

The problem is just that you now assume networking to be part of the
language, but C++ also runs on platforms that have nothing like
that. So it's not there. If you need networking in C or C++, the
solution is to pick *other* standards that solve these problems for
you. C++ does not intent to solve problems like GUIs or
serialization. There are solutions for such problems on the market,
and written down as standards, so where's the problem using them?

> In 2013, the need for stardardized serialization has become clear.

Not to me. I don't have this problem in my day job. Really. If I had,
I would probably pick a language that solves the problem in a better
way. I use C++ because it is a powerful rich language that allows me
to write fast algorithms with a good structure. If GUIs are my
problem, I pick Java - or probably use a creator for HTML5 documents,
as java applets are also dying out.

> We spend far too much time today dealing with I/O, recreating by
> hand the very metadata discarded by the compiler. With support from
> the compiler, C++ could provide easy, standard, robust, efficient,
> reliable I/O for user-defined types of arbritrary complexity. Until
> then, we'll party on like it's 1985.

I would rather say, you picked the wrong tool for the job in first
place. I don't know what you do or which types of problems you want to
solve, but C++ does not sound like the solution for your problem. If
your programs are mostly I/O bound, I would check for programming
languages that offer a higher abstraction level than C++. If most of
your job is writing serialization code, this is even more an indicator
that C++ is the wrong choice for the problem.

Greetings,
Thomas

Thomas Richter

unread,
May 14, 2013, 2:42:18 AM5/14/13
to
On 13.05.2013 08:10, James K. Lowden wrote:
>
> On Thu, 9 May 2013 16:06:39 -0700 (PDT)
> 嘱 Tiib<oot...@hot.ee> wrote:
>
>>> One can enforce byte order using shifts and masks, but it is
>>> inefficient.
>
> Efficiency is always measured with respect to something. If
> byte-swapping is needed, the compiler can do it more efficiently
> than the programmer.

I kind of doubt that, but anyhow...

>> It is inefficient to keep data in memory in some file format.
>
> Hmm, what would you say about mmap(2)?

Good for raw data. Not good if I have to fiddle around with the
data. As soon as I have to touch non-native endian data more than
probably a couple of times, it is more efficient just to convert them
by hand than to rely on the compiler to arrange accesses to the
data. Actually, in many cases it does not even make sense for C++ data
types. Did I say "pointer"?

>> It is inefficient and non-portable to keep data in file in memory
>> layout of particular platform.
>
> That is exactly what many DBMSs do.

I'm not an expert in DBMSs at all, but this sounds like a bad idea to
me.

>> So something what you request feels to be doomed to be
>> fundamentally inefficient.
>
> Quite to the contrary: the OP is asking for language support for
> something currently handled by runtime libraries.

So what's so bad about runtime libraries? A lot of stuff in C++ is
handled by runtime libraries. Does it make a difference to you whether
the runtime library has a ANSI C++ logo printed on it or not?

> Jack Adrian Zappa's suggestion is obviously good. Endianism
> conversion is tedious and error-prone. The compiler knows the
> endianism of the object code it is generating. By allowing the
> programmer to denote endianism of I/O structures, the compiler
> enables endian conversion at the lowest possible cost and highest
> possible convenience.

Use a library for that.

> The objection is raised that this functionality need not be in the
> language proper, but could be in the library. OK, let there be
> std::integer defined with compile-time endianism and implicit
> conversion to built-in types, such that the above notation would
> work.

Then create a library for that if you need it.

> The objection is raised that this language feature might be used
> inappropriately, not in an I/O context. C++ is rope. How you use
> it is up to you. Compared to operator overloading, this is nothing.

C++ never stops you to shoot you in your foot, but it enables you to
check in many places with its language features. So I afraid I don't
quite get the argument.

> At the very least, the compiler should provide a built-in similar to
> __FILE__ to denote endianism, saving the library writer the need to
> depend on header files.

But you can have that nowadays right away. Autoconf has a built-in
check for that, and generates a define for right that if you need
it. Again, I don't see your point. If you need it, it's right in front
of your face to grab it.

> I fail to see any advantage to requiring the programmer to detect
> something known full well to the compiler.

And why should the compiler detect it if there are already tools that
can do that for you right away?

Greetings,
Thomas

Tobias Müller

unread,
May 14, 2013, 2:44:09 AM5/14/13
to
Andy Champ <no....@nospam.invalid> wrote:
> Incidentally, I think Intel have it right, and we humans have it
> wrong - least significant digit first means that you can read the
> number in one pass.

But inside one byte, the bits are again ordered the other way round.
I would agree with you if the bits were ordered in the same way, but a
different order for bits and bytes is just awkward.

> 1234 should be one and twenty and three hundred and four thousand.

Yes, but with the usual little endian representation it's 2143. (Not
really but similar)

Tobi

Joshua Maurice

unread,
May 14, 2013, 1:33:26 PM5/14/13
to
On May 13, 11:11 pm, "James K. Lowden" <jklow...@speakeasy.net> wrote:
> The serialization system would recognize "node" as a UDT, taken from
> the list of types provided by the compiler, and would therefore have
> access to the metadata array describing the members. Pointers are
> denoted as offsets into the stream. In reality, the stream reflects
> what the compiler itself must do to maintain the graph in memory.
> (Because, after all, pointers are just offsets from zero into the
> linear address space we call "memory".)
>
> Of course, nothing prevents a graph built from such a structure from
> having cycles. OTOH nothing prevents the serializer from detecting
> cycles.

I take it you are talking about something very much like the Java
standard serialization scheme, and how more or less you can use a
generic piece of code to serialize any object graph in Java. You can
do that in Java. It can even handle cycles.

You cannot do this with generic C++. You can do this with Java because
of the very strong type safety. C and thus C++ have inherited some
weaker typing rules that make this impossible in the general case.

How would you serialize this data structure?

struct Foo { void* p; size_t x; size_t y; };

What's the length of the data pointed-to by the void pointer? Is it
double data? Integer data? String data? UTF-8 or some other encoding?

Throw on legal but obfuscated legal pointer arithmetic, and you've
already lost.

Andy Champ

unread,
May 14, 2013, 1:33:56 PM5/14/13
to
On 14/05/2013 07:44, Tobias Müller wrote:
> Andy Champ <no....@nospam.invalid> wrote:
> > Incidentally, I think Intel have it right, and we humans have it
> > wrong - least significant digit first means that you can read the
> > number in one pass.
>
> But inside one byte, the bits are again ordered the other way round.
> I would agree with you if the bits were ordered in the same way, but a
> different order for bits and bytes is just awkward.
>
Are they? How do you know?

I always regard memory as a grid. The address is vertical, and the bit
position is horizontal.


> > 1234 should be one and twenty and three hundred and four thousand.
>
> Yes, but with the usual little endian representation it's 2143. (Not
> really but similar)
>
That's an artefact of humans using decimal, and computers using bytes -
you're treating it as a BCD number.

1234 read least-significant first is 0x4d2. In LSB-first memory that's
one byte of D2, and another of 04.

In memory it's
11010010
00000100

(or possibly
01001011
00100000)

Andy

Bart van Ingen Schenau

unread,
May 14, 2013, 5:08:22 PM5/14/13
to
Trees are not that difficult to serialize. How about a slightly more
complex structure:

class X {
struct t {
size_t a;
char* b;
};

size_t c;
union {
char d[sizeof(t)];
t e;
} f;
};


An astute reader might recognize this as a possible layout for
std::string with short-string optimization. I have deliberately used non-
descriptive names for the members, because the compiler can't derive
meaning from the names either.
What would an auto-generated serialization format for this structure look
like?

<snip>

> The problem as I see it is that the type system is unavailable at
> runtime. The information I'm describing -- class hierarchy, member
> structure -- is discarded by the compiler (except insofar as it's made
> available to the debugger).
>
> Although the vogue term is "reflection", the idea is older than ancient.
> Classes in Smalltalk could be interrogated at runtime. (Heck, IIRC
> classes could be *modified* at runtime. But we won't go there!)

But, having reflection is not sufficient to do proper serialization of
user-defined types.
Look at the type X declared above. An auto-generated, reflection-based
serialization scheme would probably quite complex. But a hand-written
serialization routing would probably use the same format as for a char
array. All the rest is not an intrinsic part of the type, but is present
to allow certain optimizations and does not need to be included in the
serialization format.

>
> --jkl

Bart v Ingen Schenau

James K. Lowden

unread,
May 16, 2013, 8:40:16 AM5/16/13
to
On Mon, 13 May 2013 23:41:14 -0700 (PDT)
Thomas Richter <th...@math.tu-berlin.de> wrote:

> > The minimum I would like to see is the ability to iterate over the
> > members of a structure.
>
> Again, I typically don't need that, but it would include some
> overhead. For example, the structure layout would likely need to be
> stored somewhere at run time. I don't need this overhead. But if you
> do, I'm sure a library solution is feasible which does that.

Really? Name one.

The compiler does not supply the information that a library would need
to execute generic serialization. That is why every serialization
library used by C++ requires the buffers to be externally defined.

> > We spend far too much time today dealing with I/O, recreating by
> > hand the very metadata discarded by the compiler. With support
> > from the compiler, C++ could provide easy, standard, robust,
> > efficient, reliable I/O for user-defined types of arbritrary
> > complexity. Until then, we'll party on like it's 1985.
>
> I would rather say, you picked the wrong tool for the job in first
> place. I don't know what you do or which types of problems you want to
> solve, but C++ does not sound like the solution for your problem.

I think it's safe to say I understand the problem I'm trying to solve
better than you do.

> If your programs are mostly I/O bound, I would check for programming
> languages that offer a higher abstraction level than C++.

I pretty much have to use C++, because I'm writing, um, C++ libraries.
I didn't say the programs were I/O bound. I am saying they're I/O
awkward. That matters because any useful computation requires input
and output.

> > Why do people think pointers can't be serialized?
>
> In specific, the above "serialization" cannot distinguish
> between two pointers that point to the identical object, and two
> pointers that point to similar objects.

Pointers are nothing more than offsets, counters from zero into linear
memory. To write them out and read them back in requires only to
encode and decode them. If it couldn't be done, "marshalling" wouldn't
be part of our working vocabulary.

> The problem is just that you now assume networking to be part of the
> language

No. The problem is you're looking through the wrong end of the
telescope. You seem to think the problem is unimportant and can be
solved with a library. It cannot, and it matters.

We have typeof and sizeof and offsetof. We do not have, say, memberof
or nameof or parentof or childof. We have, in short, incomplete
ad hoc access to program metadata. Without complete access to that
data, it impossible to do many things that can be done in Java or
Smalltalk or Python.

Many people wrongly assume that's because C++ is compiled. It's not.
It's because of *how* it's compiled. It's because the C++ compiler
discards the metadata, or makes it available only in nonstandard ways
to the debugger, rather than to the language itself.

It applies to more than I/O, too. For example, what happens when you
deference a NULL pointer? You get a PDP-11-era "segmentation fault"
and, if you're lucky, a core file. Then, with the help of the debugger
and a symbol table, you can track down the problem, maybe. (Another
example of interpreting serialized pointers, btw.)

Now, suppose you could trap SIGSEGV and walk the stack in your program,
using standard functions to identify each object by name and type.
Better, because if that information were made available in a standard
way, a standard library could evolve to deal with it, and you could
just call, say, std::stacktrace in your exception handler.

> If most of your job is writing serialization code, this is even more
> an indicator that C++ is the wrong choice for the problem.

Do this for me, Richard. Examine every C++ ODBC library you can find,
and think about why they're such a pain in the tukkus. Go ahead, I'll
wait. While you're there, consider whether or not there might not be a
large class of programs that would benefit from more convenient -- nay,
*standard* -- access to a DBMS.

--jkl

James K. Lowden

unread,
May 16, 2013, 8:44:12 AM5/16/13
to
On Tue, 14 May 2013 10:33:26 -0700 (PDT)
Joshua Maurice <joshua...@googlemail.com> wrote:

> On May 13, 11:11 pm, "James K. Lowden" <jklow...@speakeasy.net> wrote:
> > The serialization system would recognize "node" as a UDT, taken
> > from the list of types provided by the compiler, and would
> > therefore have access to the metadata array describing the
> > members. Pointers are denoted as offsets into the stream. In
> > reality, the stream reflects what the compiler itself must do to
> > maintain the graph in memory. (Because, after all, pointers are
> > just offsets from zero into the linear address space we call
> > "memory".)
> >
> > Of course, nothing prevents a graph built from such a structure
> > from having cycles. OTOH nothing prevents the serializer from
> > detecting cycles.
>
> I take it you are talking about something very much like the Java
> standard serialization scheme, and how more or less you can use a
> generic piece of code to serialize any object graph in Java. You can
> do that in Java. It can even handle cycles.

Yes. Not that I admire Java; I don't. But I do think the C++
community has a blind spot when it comes to I/O.

> You cannot do this with generic C++. You can do this with Java
> because of the very strong type safety. C and thus C++ have
> inherited some weaker typing rules that make this impossible in the
> general case.
>
> How would you serialize this data structure?
>
> struct Foo { void* p; size_t x; size_t y; };

Given the language as it stands today, that can't be done, it's true.
It might well represent

double[10][10];

and there's no way to know. And that's exactly what I'm saying should
change.

Consider: the compiler (or the heap) knows exactly the extent of the
memory pointed to by Foo::p. The problem is not that "pointers can't
be serialized". The problem is that the compiler provides
insufficient information to the would-be serializer.

Please note this is not a deficiency of the type system. The compiler
has no trouble dealing with your struct Foo. It's just a little
tight-lipped about how.

Even if you can demonstrate that it's impossible in the general case
-- something I'm not going to make easy! -- must we then say there
isn't a large class of problems for which it *would* be useful?

I would say it falls in the same class of operations as the copy
constructor. The compiler generates one for you if you invoke it. If
it doesn't DTRT you're well advised to write your own.

> Throw on legal but obfuscated legal pointer arithmetic, and you've
> already lost.

Not at all. If it's valid C++, it's representable externally, too,
just as in memory.

I'm actually saying something extremely simple and very easy to agree
with: if all the aspects of the program were available *to* the
program, the program could do exactly what the machine does. No one
thinks it's peculiar that a debugger chases pointers, or that gdb can
recapture them from a core file. We can argue about need and cost,
but the answer to feasibilty is found in implementations all around
us.

--jkl

James K. Lowden

unread,
May 16, 2013, 8:47:52 AM5/16/13
to
On Tue, 14 May 2013 15:08:22 CST
Bart van Ingen Schenau <ba...@ingen.ddns.info.invalid> wrote:

> Trees are not that difficult to serialize. How about a slightly more
> complex structure:
>
> class X {
> struct t {
> size_t a;
> char* b;
> };

As I mentioned elsewhere, it's necessary in the general case for the
compiler to provide the extent as well as the value of a pointer. IOW

sizeof(X::t::b) == sizeof(char*)
X x;
x.t.b = char s[10];
extentof(x.t.b) == 10;

Every pointer -- static, free store, or automatic -- always has some
number of bytes allocated to it. (That number might be zero.) The
language deficiency is that it does not make that information
available to the programmer. Instead, it requires the programmer to
track it independently and duplicatively. And often, it might be
noted, incorrectly.

Someone will object that keeping track of the size of memory allocated
to a pointer will add 8 bytes to every pointer. Not true! Remember,
every time you say

char *s = "hello";

the compiler set aside those 6 bytes and placed the next variable
*after* them. Change it just a little

char s[] = "hello";

and suddenly sizeof(s) works. Yet the pointer is the same size. Move
to the heap

char *s = malloc(6);

and the heap must do as the compiler does, setting aside 6 bytes. I'm
simply pointing out that the language could expose that fact with

extentof(s);

at *no* cost. Not just a little: none. The information is already
there, in the executable image, or on the stack, or in the free store.
What's missing is a bit of syntax.

> size_t c;
> union {
> char d[sizeof(t)];
> t e;
> } f;
> };

At first glance, this seems no problem at all, insofar as sizeof(f) is
known at compile time. The problem I think you're alluding to is that
two different compilers might arrange f differently, and nothing about
the bit pattern of the union tells us what to do.

My answer is simple, once again, although at a trivial cost. It must
be possible to know which member of f was last written. Why? Because
if f.t was written, serialization demands its endianism be honored.

One might hope, though, that this sort of malarky might fade into
history if endianism were dealt with in the language proper.

--jkl

Peter C. Chapin

unread,
May 16, 2013, 12:59:59 PM5/16/13
to
On Thu, 16 May 2013, James K. Lowden wrote:

> Many people wrongly assume that's because C++ is compiled. It's
> not. It's because of *how* it's compiled. It's because the C++
> compiler discards the metadata, or makes it available only in
> nonstandard ways to the debugger, rather than to the language
> itself.

It sounds like you are asking for a standard facility for doing
reflection in C++. I agree that would be a useful thing at times. The
language does have a few bits and pieces of that feature now (I'm
thinking of RTTI, for example), but that's clearly a shadow of what
full reflection support would be like.

I wonder, though, about the runtime costs such a capability would
impose and if it could be implemented in such a way as to have zero
cost for the many programs that don't need it.

Peter

Edward Rosten

unread,
May 16, 2013, 1:08:42 PM5/16/13
to
On Thu, 16 May 2013 05:44:12 -0700, James K. Lowden wrote:

> Consider: the compiler (or the heap) knows exactly the extent of the
> memory pointed to by Foo::p. The problem is not that "pointers
> can't be serialized". The problem is that the compiler provides
> insufficient information to the would-be serializer.
>
> Please note this is not a deficiency of the type system. The
> compiler has no trouble dealing with your struct Foo. It's just a
> little tight-lipped about how.
>
> Even if you can demonstrate that it's impossible in the general case
> -- something I'm not going to make easy! -- must we then say there
> isn't a large class of problems for which it *would* be useful?

That's not entirely obvious to me.

The heap knows the extent of the memory which is being pointed to and
could go and serialise it. However, what it doesn't and can't easily
know is the type.

Is Foo::p pointing to a double inside an array of doubles or is it
pointing to the double inside:

struct Bar
{
double d;
some_other_type* a_pointer;
};

> I'm actually saying something extremely simple and very easy to
> agree with: if all the aspects of the program were available *to*
> the program, the program could do exactly what the machine does. No
> one thinks it's peculiar that a debugger chases pointers, or that
> gdb can recapture them from a core file. We can argue about need
> and cost, but the answer to feasibilty is found in implementations
> all around us.

The debugger can always chase Foo::p and get what it's pointing to. It
can then increment the pointer by sizeof(double) and show you what the
next set of bits look like in double form, but it can't tell you if
that operation is a valid one.

-Ed

Edward Rosten

unread,
May 16, 2013, 1:11:31 PM5/16/13
to
On Thu, 16 May 2013 05:40:16 -0700, James K. Lowden wrote:

> Now, suppose you could trap SIGSEGV and walk the stack in your
> program, using standard functions to identify each object by name
> and type. Better, because if that information were made available
> in a standard way, a standard library could evolve to deal with it,
> and you could just call, say, std::stacktrace in your exception
> handler.

That would severely hobble the optimizer.

While C++ certainly has a stack (or if you prefer a FILO for automatic
variables), the standard says very little about it, purposefully.

Once a modern optimizer has got its hands on your code, and has done
passes of strip mining, loop unswitching, CSE elimination, value
propagation, inlining, scalar replacement and so on, the resulting
machine code often bears surprisingly little resemblance to the
original code.

This is a good thing. It allows us to write libraries that use
intermediate proxy objects which end up as efficient as if we wrote
the code out long-hand without the library.

But the result is that not only might the objects not actually exist
in memory, but the position in the code might not even correspond to a
line in the program in any meaningful manner.

There is some information there, but the results of things like
debuggers and profilers can get very inaccurate at high optimization
levels for certain types of code because of the optimizations. In
order to make something like std::stacktrace useful, constraints would
have to be placed on what the optimizer isn't allowed to mess up.

-Ed

Bart van Ingen Schenau

unread,
May 17, 2013, 5:49:47 AM5/17/13
to
On Thu, 16 May 2013 05:47:52 -0700, James K. Lowden wrote:

> On Tue, 14 May 2013 15:08:22 CST
> Bart van Ingen Schenau <ba...@ingen.ddns.info.invalid> wrote:
>
>> Trees are not that difficult to serialize. How about a slightly
>> more complex structure:
>>
>> class X {
>> struct t {
>> size_t a;
>> char* b;
>> };
>
> As I mentioned elsewhere, it's necessary in the general case for the
> compiler to provide the extent as well as the value of a pointer.
<snip - explanation about extentof >
>
>> size_t c;
>> union {
>> char d[sizeof(t)];
>> t e;
>> } f;
>> };
>
> At first glance, this seems no problem at all, insofar as sizeof(f)
> is known at compile time. The problem I think you're alluding to is
> that two different compilers might arrange f differently, and
> nothing about the bit pattern of the union tells us what to do.

No, the problem I am alluding to is that the compiler can't know which
member of the union is supposed to be valid and thus if it can chase
the pointer in the structure at all.

Here is the same class again with more meaningful names:

class String {
struct large_t {
size_t capacity;
char* data;
};

size_t length;
union {
char short[sizeof(large_t)];
large_t large;
} value;
};

For short strings (up to sizeof(large_t)), the string data is stored
directly in value.short. For larger strings, a dynamically allocated
area, referred to by value.large.data is used.

How would the compiler decide when to chase the value.large.data
pointer and when to just dump the bytes from value.short?

> My answer is simple, once again, although at a trivial cost. It
> must be possible to know which member of f was last written. Why?
> Because if f.t was written, serialization demands its endianism be
> honored.

Are you really proposing to add a hidden member to all unions to track
which member was last written to? Just in case it might need
serialization and the endianness might matter? And have you
considered that your proposed serialization feature might be
standardized to use an endian-neutral serialization format?

> One might hope, though, that this sort of malarky might fade into
> history if endianism were dealt with in the language proper.

I would not hold my breath. For either.

{ Quoted signature removed -mod }

Bart v Ingen Schenau

Öö Tiib

unread,
May 17, 2013, 5:55:53 AM5/17/13
to
On Thursday, 16 May 2013 15:47:52 UTC+3, James K. Lowden wrote:
> Every pointer -- static, free store, or automatic -- always has some
> number of bytes allocated to it. (That number might be zero.)

That is in very limited meaning of "every". There are pointers that
point at sub-objects of other objects, pointers that point at elements
in containers or arrays and pointers that point at one past last
elements of arrays. Such might need to be serialized as well.

> The language deficiency is that it does not make that information
> available to the programmer.

It often lacks that information. It is up to programmer to bind all
the information that he needs into properties of types or objects from
what he designs the software. Several such types (with more
information bound around pointers) are readily available in standard
library (like containers, iterators or smart pointers). A 'char*' or
'void*' can point almost anywhere and be valid doing so. So programmer
must know what it is if he uses those.

> Instead, it requires the programmer to track it independently and
> duplicatively. And often, it might be noted, incorrectly.

Current C++ may be used without declaring any raw pointers, never
using keyword 'delete' and using keyword 'new' only to initialize
unique_ptr. Even that because C++11 forgot to add 'make_unique' that
will likely be added by C++14. It might be inconvenient or inefficient
at places but it can be used like that. So it can't be said that C++
requires programmers to track pointers. It is purely matter of free
will at the moment.

> Someone will object that keeping track of the size of memory
> allocated to a pointer will add 8 bytes to every pointer. Not true!
> Remember, every time you say
>
> char *s = "hello";
>
> the compiler set aside those 6 bytes and placed the next variable
> *after* them.

I don't think so. That 's' may be made to point into middle of some
"yellohello" in read-only memory since observable behavior must be
stays same.

Edward Diener

unread,
May 18, 2013, 9:45:59 AM5/18/13
to
The compiler knows no such thing. It only knows that 'void* p' is a
pointer. What you think the "heap" ( whatever that is ) knows you need
to explain.

You cannot expect a run-time type system to do analysis of a program's
instructions at run-time to determine what a particular serialization
should be. The overhead of that would be ridiculous even if it were
possible.

You have made a good point that the compiler could have enough
information to do run-time serialization of some objects, so there is
no need to throw out your ideas entirely. But given the much more
liberal notation of data in C++ ( in particular C++'s pointer notation
) than in other popular statically typed languages, doing automatic
serialization of all UDF would be impossible without the end-user
specify a notation, ala IDL, to do some of it.

I also believe that the possibility of run-time type programming could
be implemented in C++ with greater support from the compiler, but I am
not naive enough that this can be done with every UDF without some
sort of extended notation ( syntax ).

James K. Lowden

unread,
May 21, 2013, 4:21:06 AM5/21/13
to
On Fri, 17 May 2013 02:49:47 -0700 (PDT)
Bart van Ingen Schenau <ba...@ingen.ddns.info.invalid> wrote:

> How would the compiler decide when to chase the value.large.data
> pointer and when to just dump the bytes from value.short?
>
> > My answer is simple, once again, although at a trivial cost. It
> > must be possible to know which member of f was last written. Why?
> > Because if it was written, serialization demands its endianism be
> > honored.
>
> Are you really proposing to add a hidden member to all unions to track
> which member was last written to?

Yes, approximately.

I am not saying that unions should acquire a hidden member contiguous
with the union's memory. The compiled program need only track which
member was written to.

We agree, yes, that data members can be represented as a table, much as
virtual functions are represented as a vtable? Let's pretend there is
such a thing, and call it a "dtable". Like the vtable, the dtable need
not be *in* the union/struct/class, need not perturb the memory
layout.

A dtable would typically have only a handful of rows, because most
structures (I bet) have less than a dozen or so members. Surely 255
members in a union is rare. So usually one byte per instance would
suffice to track which member was last written to.

Such a byte could be used to automatically throw an exception in the
event the union is used for "type punning". I happen to like type
punning and am somewhat baffled as to why the compiler writers were
allowed to prohibit it, but if that's to be the rule, then this would
be a feature.

> Just in case it might need serialization and the endianness might
> matter?

It's not obvious to me that the compiler would generate metadata "just
in case" but rather "in the event". Metadata aren't needed unless
referenced; perhaps like templates they need not be generated
unless and until they're referenced.

The simplest way I can see to address the binary-bloat concern is to
make metadata generation optional, a la RTTI, and let link-time errors
inform the user of the need to turn it on. To make those errors
clearer, the compiler might generate a marker, say,
__SERIALIZATION__ to enable compile-time detection of metadata
availability. That would permit a linker to report that object A in
a.o needs to be compiled with the -serializable option because it did
not provide the symbol required by libserialization.so.

Actually, there is a simpler way: keep the concern in proportion. The
dtable exists once per struct/class. Let's guess that each name in a
struct requires N bytes of metadata, plus the characters of the name
itself, where 4 <= N <= 16 (unless someone can show otherwise). For a
typical int->string map, whose value_type is std::pair<int,string>, the
dtable might be something like 64 bytes if the row size is on the high
side and the names are longish. By one measure, that might be seen as
2X cost; the metadata might be as big as the structure itself. OTOH,
the metadata is per type, not per instance, and data structures are
typically dwarfed by code and data size. It would be interesting to
see implemented and measure the effect on something like, say, Qt. I'd
lay my chips in the under 1% range.

Bear in mind that the dtable is not 100% cost and could well prove to
be a net savings. After all, it would enable the creation of
libraries to replace what today is bespoke code. If you measure cost
as functionality over lines of code, that's a clear win. Can
individual programmers write more efficient operator<< methods than
would be implemented in a library? Some, perhaps, but not on
average. And they would still be able to do that; they just won't have
to.

Another way to look at the question is that, as everyone reading this
list knows, efficiency and correctness are improved any time logic is
reduced to a table.

> And have you considered that your proposed serialization feature
> might be standardized to use an endian-neutral serialization format?

If wishes were horses then beggars would ride. ;-)

That indeed is where I would like to end up. If structure metadata
were exposed in the language, I'm sure the denizens of Boost would have
a field day. We might see JNI compatability, and json/yaml/fotm
(flavor of the month). I personally would like to see library
support for scatter/gather I/O because it's essential to efficient
DBMS interfaces.

--jkl

James K. Lowden

unread,
May 21, 2013, 4:21:28 AM5/21/13
to
On Thu, 16 May 2013 10:11:31 -0700 (PDT)
Edward Rosten <firstname.d...@googlemail.com> wrote:

> > Now, suppose you could trap SIGSEGV and walk the stack in your
> > program, using standard functions to identify each object by name
> > and type.
>
> That would severely hobble the optimizer.
....
> Once a modern optimizer has got its hands on your code, and has done
> passes of strip mining, loop unswitching, CSE elimination, value
> propagation, inlining, scalar replacement and so on, the resulting
> machine code often bears surprisingly little resemblance to the
> original code.
....
> But the result is that not only might the objects not actually exist
> in memory, but the position in the code might not even correspond to a
> line in the program in any meaningful manner.

Well, it wouldn't really hobble the optimizer, would it? The report
might be misleading in some circumstances, nothing we don't deal with
already. There's still a call stack, even if some functions are
inlined.

--jkl

James K. Lowden

unread,
May 21, 2013, 4:23:37 AM5/21/13
to
On Thu, 16 May 2013 09:59:59 -0700 (PDT)
"Peter C. Chapin" <PCh...@vtc.vsc.edu> wrote:

> On Thu, 16 May 2013, James K. Lowden wrote:
>
> > Many people wrongly assume that's because C++ is compiled. It's
> > not. It's because of *how* it's compiled. It's because the C++
> > compiler discards the metadata, or makes it available only in
> > nonstandard ways to the debugger, rather than to the language
> > itself.
>
> It sounds like you are asking for a standard facility for doing
> reflection in C++.

Yes, except that "reflection" means different things to different
people. One day, I would like to see the standard library support
automatic serialization of user-defined types. Today, I'm only
lobbying for such changes to the language as would be required to make
that possible.

> The language does have a few bits and pieces of that feature now (I'm
> thinking of RTTI, for example)

Yes.

> but that's clearly a shadow of what full reflection support would be
> like.

Depending on "clearly" and "shadow", maybe. I submit it's already
being done for debuggers. What we're missing is language support for
access to that same information.

There are some difficulties. From what I know of the GNU debug
symbol format, the problem of program metadata hasn't seen the hands of
a data analyst. I wouldn't want to see that become the standard. At
best it's input to a first draft.

> I wonder, though, about the runtime costs such a capability would
> impose and if it could be implemented in such a way as to have zero
> cost for the many programs that don't need it.

It's a fair question, and I have a ready answer. :-)

First, like RTTI, it could be optional. Not even every program *I*
write needs reflection.

Second, it's much less intrusive than RTTI. RTTI changes how a
program behaves. Program metadata are purely passive. Now, that
doesn't mean you can treat it like debug symbols and shear them off
with strip(1), because they might be referenced. But it's hard to see
how providing syntactic access to the metadata would affect execution
efficiency.

Well, actually there's some reason: If we say the program will make
pointer extents accessible, there is some small overhead. For example,
the size of a static character array isn't recorded in the executable
today. It's known to the compiler, and the compiler uses that
information while determining the layout of the program text, but then
discards it. Similarly on the stack for automatics. The "don't
pay for what you don't use" folks will be all over that like white on
paper. See point #1. ;-)

Third, it's hard to make a case for highly efficient metadata. We're
supporting I/O after all. Metadata could reside at the end of
the executable and not even be paged in unless referenced.

Perhaps it would possible to intelligently emit only referenced
metadata. I don't know, and we don't need to know that at this stage.

The questions today are feasibility and utility and, as ever,
efficiency. To feasibility and efficiency I believe I've provided
reasonable arguments or, if not, am prepared to keep trying.

As to the general utility, that's certainly a matter of opinion, fodder
for the usenet and beyond. To me it's clear I/O matters: all
algorithms have input and output, and all applications, sooner or
later, exchange data with external sources, be they files or databases
or other network connections. If I/O didn't matter, Larry Ellison
wouldn't be a household name, and XML wouldn't be the bane of every
programmer's existence.

A general serialization library for C++, even one that wasn't 100%
machine-independent for void*, would be a huge boon to C++
programmers. For that ever to become a reality, we need changes to the
language.

--jkl

Edward Rosten

unread,
May 21, 2013, 7:39:41 AM5/21/13
to
On Tue, 21 May 2013 01:21:28 -0700, James K. Lowden wrote:

> Well, it wouldn't really hobble the optimizer, would it?

That depends on how much you want the result to be standardised.

> The report might be misleading in some circumstances, nothing we
> don't deal with already. There's still a call stack, even if some
> functions are inlined.

Going on personal experience, I've had some surprisingly large
programs report an error in `main' when they segfaulted. Naturally,
almost all of the work was done in ancillary functions, so the result
of that was an almost useless report.

I'm not saying it's a common case, but it can happen.

For the change not to interfere with the optimizer, this would have to
be quite close to an optional feature in that there may be anything
from almost no information up to complete information (the type you
get when compiled with full debugging and no optimizations). Of course
on some platforms it would have to be disabled entirely.

I'm not claiming that this is not a useful feature (far from it), just
that it would be tricky to standardise.

-Ed

James K. Lowden

unread,
May 21, 2013, 10:02:47 AM5/21/13
to
On Sat, 18 May 2013 06:45:59 -0700 (PDT)
Edward Diener <eldi...@tropicsoft.invalid> wrote:

> >> struct Foo { void* p; size_t x; size_t y; };
> >
> > Given the language as it stands today, that can't be done, it's
> > true. It might well represent
> >
> > double[10][10];
> >
> > and there's no way to know. And that's exactly what I'm saying
> > should change.
> >
> > Consider: the compiler (or the heap) knows exactly the extent of the
> > memory pointed to by Foo::p.
>
> The compiler knows no such thing. It only knows that 'void* p' is a
> pointer.

Of course it does, as you well know. The information isn't recorded in
the pointer, but any allocated memory -- no matter how allocated -- has
size. That size is tracked by the compiler/stack/heap. It's C++'s way
of keeping everything from using the same space.

If you constrain yourself to thinking that the only thing the compiler
knows about void* is sizeof(void*) and its value, yes, your Foo can't be
serialized. If you allow the language to be changed to track the size
of any memory assigned to any pointer, it can.

> What you think the "heap" ( whatever that is ) knows you need
> to explain.

What do you want me to call it? The free store? I'm not trying to be
obscure; I'm using standard terminology afaik. Last I checked,
operator new was just malloc in heap's clothing, at least in GNU's
implementation.

> You cannot expect a run-time type system to do analysis of a program's
> instructions at run-time to determine what a particular serialization
> should be.

I don't believe I every suggested any run-time analysis. All I've ever
mentioned is the potential utility of metadata:

1. static metadata about class/union/struct members
2. the static type-system graph
3. pointer extents
4. last-used union member

Also useful would be

5. __ENDIAN__ or similar, and automatic conversion

Of these #1 is #1, because it would permit automatic handling of
the collection-of-records that are the bread and butter of I/O.

--jkl

James K. Lowden

unread,
May 21, 2013, 10:08:03 AM5/21/13
to
On Thu, 16 May 2013 10:08:42 -0700 (PDT)
Edward Rosten <firstname.d...@googlemail.com> wrote:

> Is Foo::p pointing to a double inside an array of doubles or is it
> pointing to the double inside:
>
> struct Bar
> {
> double d;
> some_other_type* a_pointer;
> };
>
> > I'm actually saying something extremely simple and very easy to
> > agree with: if all the aspects of the program were available *to*
> > the program, the program could do exactly what the machine does. No
> > one thinks it's peculiar that a debugger chases pointers, or that
> > gdb can recapture them from a core file. We can argue about need
> > and cost, but the answer to feasibilty is found in implementations
> > all around us.
>
> The debugger can always chase Foo::p and get what it's pointing to. It
> can then increment the pointer by sizeof(double) and show you what the
> next set of bits look like in double form, but it can't tell you if
> that operation is a valid one.

True, the heap has no type information. What it does have is the bits
and the extent. The meaning of those bits -- the structure imposed on
them by the programmer -- is known to the compiler, else the program
couldn't run, right?

So, yes, the heap can't distinguish between

> struct Foo { void* p; size_t x; size_t y; };
and
> struct Bar
> {
> double d;
> some_other_type* a_pointer;
> };

but that's not the problem we're discussing. The problem is: given
Foo above, how to know how much to write, or, for that matter, how
much to allocate upon reading.

You may say, correctly, that splatting out a void* buffer from the heap
isn't very portable, that because we know nothing about what it's
pointing to, we have no way of reconstructing the object on the far
side unless the machines agree on many particulars. My take on that
is: so what?

We're wandering down the rabbit hole here, concocting structures with
less and less information. At the same time we're raising the bar if
we insist that such structures be transmitted both faithfully *and*
portably.

Let's exclude void* from consideration of *portable* serialization.
That's really OK. The objective is not to move untyped bits over the
wire; we have write(2) for that. The objective is to move perfectly
ordinary workaday data -- int, double, string, char* -- the everyday
stuff of I/O buffers, the kinds of things handled by XML and JSON and
YAML and DBMSs millions of times every day.

--jkl

Edward Diener

unread,
May 22, 2013, 2:15:25 AM5/22/13
to
On 5/21/2013 10:02 AM, James K. Lowden wrote:
> On Sat, 18 May 2013 06:45:59 -0700 (PDT)
> Edward Diener <eldi...@tropicsoft.invalid> wrote:
>
>>>> struct Foo { void* p; size_t x; size_t y; };
>>>
>>> Given the language as it stands today, that can't be done, it's
>>> true. It might well represent
>>>
>>> double[10][10];
>>>
>>> and there's no way to know. And that's exactly what I'm saying
>>> should change.
>>>
>>> Consider: the compiler (or the heap) knows exactly the extent of the
>>> memory pointed to by Foo::p.
>>
>> The compiler knows no such thing. It only knows that 'void* p' is a
>> pointer.
>
> Of course it does, as you well know. The information isn't recorded in
> the pointer, but any allocated memory -- no matter how allocated -- has
> size. That size is tracked by the compiler/stack/heap. It's C++'s way
> of keeping everything from using the same space.

The compiler/stack/heap are all different things. From your 'struct'
above the compiler only knows that p is a pointer to void. That pointer
can be to anywhere in memory and can point to anything. It does not have
to point to dynamic memory or be on the stack. I tend to doubt that
today the information ( length in bytes ) about that 'void * p' is kept
anywhere while the program is running by the run-time system.

>
> If you constrain yourself to thinking that the only thing the compiler
> knows about void* is sizeof(void*) and its value, yes, your Foo can't be
> serialized. If you allow the language to be changed to track the size
> of any memory assigned to any pointer, it can.
>
>> What you think the "heap" ( whatever that is ) knows you need
>> to explain.
>
> What do you want me to call it? The free store? I'm not trying to be
> obscure; I'm using standard terminology afaik. Last I checked,
> operator new was just malloc in heap's clothing, at least in GNU's
> implementation.

I know what you mean by "heap". I was only questioning the idea of what
the "heap" knows.

The 'void * p' can point to anything. To serialize it correctly the
run-time system has to know to what it points. I am not saying that is
impossible, but that the run-time system has to have a means to keep
track of it. That may be overhead for which an end-user may not want to
pay ( in size or speed ).

>
>> You cannot expect a run-time type system to do analysis of a program's
>> instructions at run-time to determine what a particular serialization
>> should be.
>
> I don't believe I every suggested any run-time analysis. All I've ever
> mentioned is the potential utility of metadata:
>
> 1. static metadata about class/union/struct members
> 2. the static type-system graph
> 3. pointer extents
> 4. last-used union member
>
> Also useful would be
>
> 5. __ENDIAN__ or similar, and automatic conversion
>
> Of these #1 is #1, because it would permit automatic handling of
> the collection-of-records that are the bread and butter of I/O.

I am not in principal against a run-time system that can track what you
want it to track but I think you may be understimating the speed/size
costs as well as the effort involved. If there were overhead I would
want an end-user to be able to opt out of it. Not everyone will agree
that the ability to automatically serialize data should be paid for in
terms of either slower code or bigger code.

Edward Rosten

unread,
May 22, 2013, 12:27:06 PM5/22/13
to
On Tue, 21 May 2013 07:08:03 -0700, James K. Lowden wrote:

>> The debugger can always chase Foo::p and get what it's pointing to. It
>> can then increment the pointer by sizeof(double) and show you what the
>> next set of bits look like in double form, but it can't tell you if
>> that operation is a valid one.
>
> True, the heap has no type information. What it does have is the bits
> and the extent. The meaning of those bits -- the structure imposed on
> them by the programmer -- is known to the compiler, else the program
> couldn't run, right?


Just to confirm what you mean:

let's say we have:

struct Foo
{
//some variables
double a_double;
//some more variables;
};

Let's say we have a pointer:

double* d;

One can imaging that the information already exists so that the system
can know whether d points to the stack, the heap or some other area not
covered by the other two. Let's assume that it points to the stack.

If d points into the heap somwehere, it is not too much of a stretch to
know where the beginning of the chunk of allocated memory is, and how big
it is.

> So, yes, the heap can't distinguish between
>
>> struct Foo { void* p; size_t x; size_t y; };
> and
>> struct Bar
>> {
>> double d;
>> some_other_type* a_pointer;
>> };
>
> but that's not the problem we're discussing. The problem is: given Foo
> above, how to know how much to write, or, for that matter, how much to
> allocate upon reading.

I'm not sure it does solve that entirely. What it would solve is that the
serialisation system could know that the memory that d points into has
been serialised. That way it could know that (a) the pointee of d has
been written and (b) where it is, so it could then serialise d itself in
some appropriate way.

In principle, if you know the chunk of memory that d points into, then
you could dump out that entire chunk. If d points into double[10], then
it would work (endian and format issues aside). If it points into a struct
with pointers and other data, then the compiler would dump out a bunch of
meaningless information. I suppose that's OK, since that information
wouldn't have been reachable from d anyway, so it does not really matter
if it's filled with junk.

Is this effectively what you're saying?


-Ed

Thomas Richter

unread,
May 22, 2013, 12:28:01 PM5/22/13
to
On 16.05.2013 14:40, James K. Lowden wrote:
> On Mon, 13 May 2013 23:41:14 -0700 (PDT)
> Thomas Richter<th...@math.tu-berlin.de> wrote:
>
>> > The minimum I would like to see is the ability to iterate over the
>> > members of a structure.
>>
>> Again, I typically don't need that, but it would include some
>> overhead. For example, the structure layout would likely need to be
>> stored somewhere at run time. I don't need this overhead. But if you
>> do, I'm sure a library solution is feasible which does that.
>
> Really? Name one.

You already did. CORBA does solve the problem nicely. Probably "inverse"
from the way you want it because you first have to write the IDL and not
the C++ classes, but anyhow, it'll do. The C++ mapping is awkward, but
there's supposed to be a new one probably this year, or maybe already
last year. I'm no longer actively using Corba, but it does its job.

If I had the same problem, I would probably write my "classes to be
serialized" in a restricted C++ dialect, then write a yacc or bison code
to parse that, and generate the necessary code automatically. All that
would go into a makefile, so the make architecture would ensure that
everything you need is built as you go.

Or I would pick java right away, which has such things built in.

In either way, I don't need it in C++, and it would be a burden for C++
to add it because it is a service that is adding complexity without much
benefit for everyone. It is a specific solution for a very specific problem.

> The compiler does not supply the information that a library would need
> to execute generic serialization. That is why every serialization
> library used by C++ requires the buffers to be externally defined.

See above, you can supply everything you need just with a bit of
handwork that can be automated.


>> > We spend far too much time today dealing with I/O, recreating by
>> > hand the very metadata discarded by the compiler. With support
>> > from the compiler, C++ could provide easy, standard, robust,
>> > efficient, reliable I/O for user-defined types of arbritrary
>> > complexity. Until then, we'll party on like it's 1985.
>>
>> I would rather say, you picked the wrong tool for the job in first
>> place. I don't know what you do or which types of problems you want to
>> solve, but C++ does not sound like the solution for your problem.
>
> I think it's safe to say I understand the problem I'm trying to solve
> better than you do.

So may it be, but may I then ask why you pick C++ instead of, say, Java?

>> If your programs are mostly I/O bound, I would check for programming
>> languages that offer a higher abstraction level than C++.
>
> I pretty much have to use C++, because I'm writing, um, C++ libraries.

So why do you write C++ libraries if you need to solve database
problems? These are typically I/O bound in my experience, and you don't
need the low-level control C++ offers.

> I didn't say the programs were I/O bound. I am saying they're I/O
> awkward. That matters because any useful computation requires input
> and output.

That is so general that it applies to every program. Actually, a program
without output is pretty much useless, I would say.

>> > Why do people think pointers can't be serialized?
>>
>> In specific, the above "serialization" cannot distinguish
>> between two pointers that point to the identical object, and two
>> pointers that point to similar objects.
>
> Pointers are nothing more than offsets, counters from zero into linear
> memory. To write them out and read them back in requires only to
> encode and decode them. If it couldn't be done, "marshalling" wouldn't
> be part of our working vocabulary.

Sorry, but how is that supposed to work? If I de-serialize two objects,
how can one ensure that they have the same offset they had when they
were serialized? You can't, so it won't work. You need to find some
abstract description of the graph described by the pointers, and this is
problem-dependent and cannot be solved by just "following the pointers".

>> The problem is just that you now assume networking to be part of the
>> language
>
> No. The problem is you're looking through the wrong end of the
> telescope. You seem to think the problem is unimportant and can be
> solved with a library. It cannot, and it matters.

That depends on your problem domain. The problem actually *is*
unimportant for most C++ users, which is exactly why it is not part of
the language. It is a problem for a specific application domain, but as
I already pointed out above, *if* you want to solve it with C++, there
are enough tools around to do that job for you. But as I also pointed
out, I probably won't pick C++ in first place for that. I would use code
other people had written already, in another language that is better
adapted to the problem. You seem to believe that most problems are
database and serialization problems, but they aren't. Honestly, I
believe you're holding the telescope the wrong way. (-;

> We have typeof and sizeof and offsetof. We do not have, say, memberof
> or nameof or parentof or childof.

In terms of inheritance, it should be possible to do that by language
means and template magic, but I never tried. In terms of "pointers", it
cannot be solved by the language in first place. C++ has no language
means to identify a tree as a tree, and to serialize it as a tree just
by seeing two pointers. Essential information is missing for that. For
example, that there are no circles in the graph.


> We have, in short, incomplete
> ad hoc access to program metadata. Without complete access to that
> data, it impossible to do many things that can be done in Java or
> Smalltalk or Python.

But then, what is your problem, I ask you again. Why don't you do such
things in Java, smalltalk or python? Every language has its natural
application domain, and the type of problems you describe neither
require the low-level support of C++, nor does C++ provide language
features for that. You can built them, and you can invent tools to
support the compiler in your built-chain to support them, but the pure
fact that you need to should tell you that it's the wrong tool for the job.

> Many people wrongly assume that's because C++ is compiled. It's not.
> It's because of *how* it's compiled. It's because the C++ compiler
> discards the metadata, or makes it available only in nonstandard ways
> to the debugger, rather than to the language itself.

Correct, but that's how C++ is defined. It is a design decision that has
been made, on purpose, because C++ is a derivative of C, and as such
still very close to the bare metal. Making such features accessible
would impose run-time overhead most people don't need, so it wasn't
required.

> It applies to more than I/O, too. For example, what happens when you
> deference a NULL pointer? You get a PDP-11-era "segmentation fault"
> and, if you're lucky, a core file.

No, you don't. You get undefined behavior, that's all C++ tells you. If
you expect more, you're in bad luck. Actually, I know platforms where
nothing happens at all.

> Then, with the help of the debugger
> and a symbol table, you can track down the problem, maybe. (Another
> example of interpreting serialized pointers, btw.)

You may, you may not. C++ does not define what happens. An
implementation could do various things, from starting the debugger and
pointing at the code, from ignoring the event, up to formatting the hard
disk (actually, I had a situation where such a thing at least corrupted
my hard disk, on a less secured environment than Unix is).

> Now, suppose you could trap SIGSEGV and walk the stack in your program,
> using standard functions to identify each object by name and type.

Which would impose quite an implementation overhead not needed by most
programs. But still, an implementation could support it. But, actually,
I *don't need it*, so I don't want to pay for the feature. Actually,
such a thing would probably require that function inlining adjusts the
stack frame, or puts additional data on the stack frame, and that's an
overhead I don't want to pay for because my problems are CPU-bound and
not CPU-bound, and I'm not willing to pay for a feature I don't need. If
I need this feature, I pick Java.

> Better, because if that information were made available in a standard
> way, a standard library could evolve to deal with it, and you could
> just call, say, std::stacktrace in your exception handler.

Use java if you need that. No problem. C++ is not the answer to all
problems. It is a good answer for C++ bound problems that require a
complex program structure so there is no natural way to do that in C.

You start all from the wrong premise "I want to use C++", instead of "I
want to solve my problem". If the language is your problem, you need to
pick a different one.

>> If most of your job is writing serialization code, this is even more
>> an indicator that C++ is the wrong choice for the problem.
>
> Do this for me, Richard. Examine every C++ ODBC library you can find,
> and think about why they're such a pain in the tukkus. Go ahead, I'll
> wait. While you're there, consider whether or not there might not be a
> large class of programs that would benefit from more convenient -- nay,
> *standard* -- access to a DBMS.

I honestly don't care about ODBC libraries, not a bit. It's not my
problem domain. And if it would, I wouldn't pick C++ for that. I'm doing
quite a bit here, and I've picked C++, java, python, bash-script and
probably a couple of other languages for what solved my problem best. I
don't see a need for C++ supporting what you request because *if* that
should be my problem, there are already better solutions in other languages.

David Lowndes

unread,
May 23, 2013, 1:37:26 AM5/23/13
to
>If I had the same problem, I would probably write my "classes to be
>serialized" in a restricted C++ dialect

Perhaps something along those lines (define a set of C++ classes that
have some standard library serialisation facilities) would be a good
way to go?

Provided that users used them appropriately (just for serialisation
representations), and there were convenient ways of moving internal
working data representations between them, they might not be too
burdensome?

Dave

Joshua Maurice

unread,
May 23, 2013, 1:38:23 AM5/23/13
to
On May 22, 9:27 am, Edward Rosten
<firstname.dot.lastn...@googlemail.com> wrote:
> let's say we have:
>
> struct Foo
> {
> //some variables
> double a_double;
> //some more variables;
>
> };
>
> Let's say we have a pointer:
>
> double* d;
>
> One can imaging that the information already exists so that the system
> can know whether d points to the stack, the heap or some other area not
> covered by the other two. Let's assume that it points to the stack.
>
> If d points into the heap somewhere, it is not too much of a stretch to
> know where the beginning of the chunk of allocated memory is, and how big
> it is.

Actually, no. Suppose you ask for an array of double:
new double[7];
It's my understanding that a lot of heap implementations will not
return you /exactly/ an array of 7 doubles. It'll actually return to
you an array of 8 doubles. Nowhere will it track that it allocated
exactly 7. Instead, it keeps track that it gave you 8 or less, and it
doesn't care beyond that. It's actually quite possible for arrays of
POD types to not contain the actual size at all. It's sometimes more
efficient to not contain the exact size and instead round up to
multiples of 2 to use a small number of separate pools.

James K. Lowden

unread,
May 23, 2013, 1:41:50 AM5/23/13
to
On Wed, 22 May 2013 09:27:06 -0700 (PDT)
Edward Rosten <firstname.d...@googlemail.com> wrote:

> Just to confirm what you mean:
>
> let's say we have:
>
> struct Foo
> {
> //some variables
> double a_double;
> //some more variables;
> };
>
> Let's say we have a pointer:
>
> double* d;
....
> In principle, if you know the chunk of memory that d points into, then
> you could dump out that entire chunk. If d points into double[10],
> then it would work (endian and format issues aside). If it points
> into a struct with pointers and other data, then the compiler would
> dump out a bunch of meaningless information. I suppose that's OK,
> since that information wouldn't have been reachable from d anyway, so
> it does not really matter if it's filled with junk.
>
> Is this effectively what you're saying?

Almost. I believe you're thinking along these lines:

Foo foo;
double *d = &foo.a_double;
serialize(d);

and you're thinking the compiler/heap/stack will associate d with foo
because foo was (somehow, somewhere) allocated, and that the
serialization library would serialize foo in order to serialize d.

To my mind that's not really necessary. The assignment

double *d = &foo.a_double;

assigns the address of a single double (not an array) to d. We can
serialize d by itself, without reference to foo, given the following:

1. Provided by the compiler today:
typeof(d) == nonstandardized string
sizeof(d) == sizeof(double*)

2. Discarded by the compiler today:
nameof(d) == "d"
extentof(d) == sizeof(double)

3. Part of the executing image:
d == &foo.a_double;

That is part of what I'm saying. I hope that part's a little clearer
now. :-)

--jkl

James K. Lowden

unread,
May 23, 2013, 1:45:18 AM5/23/13
to
On Wed, 22 May 2013 00:15:25 CST
Edward Diener <eldi...@tropicsoft.invalid> wrote:

> >> The compiler knows no such thing. It only knows that 'void* p' is a
> >> pointer.
> >
> > Of course it does, as you well know. The information isn't
> > recorded in the pointer, but any allocated memory -- no matter how
> > allocated -- has size. That size is tracked by the
> > compiler/stack/heap. It's C++'s way of keeping everything from
> > using the same space.
>
> The compiler/stack/heap are all different things.

That is part of the challenge. Pointers can be assigned values in a
variety of ways. If you make them "bigger" in the sense of adding an
extent attribute, you have to touch each member of the menagerie.

> From your 'struct' above the compiler only knows that p is a pointer
> to void

That's true when defined. Once assigned,

string s;
void *p = reinterpret_cast<void*>(&s);

it's a pointer to something. That information is discarded
today, but need not be. Indeed, it would be valuable to keep; only a
few days ago there was a discussion on this list about the danger of
casting to void* and casting the result to any type other than the
original. By retaining the cast-from information, the running system
could throw an exception if that were done.

Absent better information, void* points to a sequence of bytes.

> That pointer can be to anywhere in memory and can point to
> anything. It does not have to point to dynamic memory or be on the
> stack. I tend to doubt that today the information ( length in bytes )
> about that 'void * p' is kept anywhere while the program is running
> by the run-time system.

I'm sure your right. As I said, for a file-scope variable:

static const char *name = "Galileo";

the compiler reserves space in the object code for the string
"Galileo", but the size of that reservation is discarded.

> I know what you mean by "heap". I was only questioning the idea of
> what the "heap" knows.

Ah. I've always thought it peculiar that

char *p = malloc(10);
free(p);

works but

char *p = new char[10];
delete p;

leaks.

I know it's been justified time and again, but I don't see how it can
be seen as anything other than a step backwards. I wonder how many
cycles have been saved versus hours wasted.

> I am not in principal against a run-time system that can track what
> you want it to track but I think you may be understimating the
> speed/size costs as well as the effort involved.

You may be right. I've been misunderestimated myself on occasion.

> If there were overhead I would want an end-user to be able to opt out
> of it. Not everyone will agree that the ability to automatically
> serialize data should be paid for in terms of either slower code or
> bigger code.

Acknowledged. OTOH, it's easy to overestimate the costs, because they
are so often currently borne by the programmer.

Let's separate two areas of concern: static metadata and pointer
extents.

Incorporating static metadata -- basically, a name-type-size tuple for
every structure member -- in the object code will make the object code
bigger. That's undeniable. If I were writing a compiler, I'm sure I'd
hear from users complaining about that, and be under pressure to offer
an option to remove it. (There will even be proprietary-software
concerns. Not every closed-source vendor will want his data structures
clearly disclosed.) I think it would be interesting to measure, though,
especially if we restrict ourselves to being able to iterate over the
members of a struct/class.

I don't see how extending the language to provide static metadata
imposes any runtime cost. I don't believe it's terribly difficult
given that every compiler already provides the information to
debuggers. (ISTM debuggers would then be easier to write.)

Just metadata and only metadata would be a boon to anyone doing C++
I/O, especially to library writers. If you want to see more people
using C++, that's surely one way to get there.

In the general case, serialization requires inheritance metadata, and
pointer extents, too. I'm personally not all that exercised about
inheritance, but neither does the inheritance graph strike me as
particularly difficult to represent. It would support some interesting
use cases. For instance, it would be possible to explain a Koenig
lookup without parsing the code.

To provide pointer extents requires runtime support. There is some
complexity for that reason among others. But it's not at all clear the
cost is nearly so great as it seems at first blush.

I believe the compiled program should track the extent of every
pointer. When people object about efficiency, I'm actually puzzled,
because I find that whenever I'm dealing with a pointer, any pointer,
I'm always tracking and testing against the extent

A *
foo( A *a, size_t len ) {
assert(a);
for( A *p=a; p < a + len; ++p ) {
if( bar(p) )
return p;
}
return NULL;
}

Who hasn't written that 1000 times in one form or another? How does
one deal with pointers without tracking the bounds of what they point
to? Who iterates over a string today relying on the NUL terminator
without reference to the allocated extent of the buffer?

Once we accept that every pointer has an extent, and that to use that
pointer we track its extent, why not ask the compiler to do the work
for us?

It may at first seem an unbearable cost, because it may seem that
a pointer's extent must be updated whenever it's incremented, and may
seem that

*p

becomes analogous to vector::at() instead of vector::operator[]. But
neither of those suppositions is accurate.

In the first place, it's not necessary to update the pointer's extent
or to check every dereference. In my A *a example above, the system
can compute the extent of p at will

extentof(p) == extentof(a) - (p - a)

In the second place, that computation need not take place unless
demanded. If extentof is never invoked, the information to produce it
need not exist in the executable.

Most important, though: we already bear the cost. We're tracking
the length of allocated objects, passing lengths with pointers, testing
against boundaries. We have been since 1975. Would 2017 be too soon
to move that information into the language, where it would be more
convenient? Not to mention unerringly correct?

A *
foo( A a[] ) {
A *p = a;
for( ; p < a + extentof(a)/sizeof(a[0]); ++p ) {
if( bar(p) )
return p;
}
return NULL;
}

I just don't see the problem. Or, rather, I don't see the technical
problem. ;-)

--jkl

James K. Lowden

unread,
May 23, 2013, 1:49:04 AM5/23/13
to
On Wed, 22 May 2013 09:28:01 -0700 (PDT)
Thomas Richter <th...@math.tu-berlin.de> wrote:

> On 16.05.2013 14:40, James K. Lowden wrote:
> > On Mon, 13 May 2013 23:41:14 -0700 (PDT)
> > Thomas Richter<th...@math.tu-berlin.de> wrote:
> >
> >> > The minimum I would like to see is the ability to iterate
> >> > over the members of a structure.
> >>
> >> Again, I typically don't need that, but it would include some
> >> overhead. For example, the structure layout would likely need to
> >> be stored somewhere at run time. I don't need this overhead. But
> >> if you do, I'm sure a library solution is feasible which does that.
> >
> > Really? Name one.
>
> You already did. CORBA does solve the problem nicely. Probably
> "inverse" from the way you want it because you first have to write
> the IDL and not the C++ classes, but anyhow, it'll do.

So say you. I don't consider CORBA a "library solution" (setting aside
whether it's a solution at all, to anything). CORBA is not a C++
library. It's a foreign system with C++ bindings.

> If I had the same problem, I would probably write my "classes to be
> serialized" in a restricted C++ dialect, then write a yacc or bison
> code to parse that, and generate the necessary code automatically.

Yup. Instead of using the language you already have, you'd recreate
the information discarded by the compiler in a bespoke, nonstandard
way.

What you could not do is extend I/O streams to deal with user-defined
types.

> Or I would pick java right away, which has such things built in.

Yeah, except there is no "pick". Stroustrup says there are billions of
lines of extant C++, and some 3 million C++ programmers. Every one of
those systems reads and writes data. Many thousands of those
programmers have dealt with I/O, some not very well.

C++ has some unique features, even some *good* unique features.
Judging from the billions of lines and the lousy I/O support, it's safe
to say it gets picked for good reason, and that I/O isn't one of them.

> So why do you write C++ libraries if you need to solve database
> problems? These are typically I/O bound in my experience, and you
> don't need the low-level control C++ offers.
>
> > I didn't say the programs were I/O bound. I am saying they're I/O
> > awkward. That matters because any useful computation requires input
> > and output.
>
> That is so general that it applies to every program. Actually, a
> program without output is pretty much useless, I would say.

Good. So we agree, at least that output matters?

In case it helps, the problems I face day-to-day are compute-bound.
Efficiency is very important. So is getting data in and out of the
system, and sharing it with the rest of the organization.

> > Pointers are nothing more than offsets, counters from zero into
> > linear memory. To write them out and read them back in requires
> > only to encode and decode them. If it couldn't be done,
> > "marshalling" wouldn't be part of our working vocabulary.
>
> Sorry, but how is that supposed to work? If I de-serialize two
> objects, how can one ensure that they have the same offset they had
> when they were serialized?

I don't see the problem. If, when those objects were serialized, the
values of their pointers were stored, too, then on de-serialization you
would have the exact information you needed to know if they were one and
the same object. I don't know what you mean by "same offset"; they
don't need to have the same locations in memory, except relative to
each other.

> >> The problem is just that you now assume networking to be part of
> >> the language
> >
> > No. The problem is you're looking through the wrong end of the
> > telescope. You seem to think the problem is unimportant and can be
> > solved with a library. It cannot, and it matters.
>
> That depends on your problem domain. The problem actually *is*
> unimportant for most C++ users, which is exactly why it is not part of
> the language.

That's not clear at all. The language was invented before it was
used. As I pointed out elsewhere, it predates the Internet, the
dominance of tcp/ip, and SQL. I rather think C++ reflects
its era and its origin in the land of C, not what's important to the 3
million programmers using it in 2013.

> You seem to believe that most problems are database and serialization
> problems, but they aren't. Honestly, I believe you're holding the
> telescope the wrong way. (-;

Kudos for the apropo emoticon!

No, I'm simply addressing myself to the problem of I/O because I note
it's a void in the language and the cause of a lot of work and
nonstandardized behavior that would otherwise be unnecessary. It's
something I claim to know something about, and something we cannot
program around.

> > We have typeof and sizeof and offsetof. We do not have, say,
> > memberof or nameof or parentof or childof.
>
> In terms of inheritance, it should be possible to do that by language
> means and template magic

Maybe, but that is not a general solution.

> In terms of "pointers", it cannot be solved by the language in first
> place. C++ has no language means to identify a tree as a tree

Nor does it need one. Do you deny that the tree exists, or that the
language could provlde an iterator for it?

> > Many people wrongly assume that's because C++ is compiled. It's
> > not. It's because of *how* it's compiled. It's because the C++
> > compiler discards the metadata, or makes it available only in
> > nonstandard ways to the debugger, rather than to the language
> > itself.
>
> Correct, but that's how C++ is defined. It is a design decision that
> has been made, on purpose, because C++ is a derivative of C, and as
> such still very close to the bare metal. Making such features
> accessible would impose run-time overhead most people don't need, so
> it wasn't required.

Yes, that's how it's defined. I dispute the costs. I'm not aware of
any research on the topic that quantify them. We already bear much of
the cost ourselves, by repeating metadata in the form of data.

> > Now, suppose you could trap SIGSEGV and walk the stack in your
> > program, using standard functions to identify each object by name
> > and type.
>
> Which would impose quite an implementation overhead not needed by most
> programs. But still, an implementation could support it. But,
> actually, I *don't need it*, so I don't want to pay for the feature.
> Actually, such a thing would probably require that function inlining
> adjusts the stack frame, or puts additional data on the stack frame,
> and that's an overhead I don't want to pay for because my problems
> are CPU-bound and not CPU-bound, and I'm not willing to pay for a
> feature I don't need. If I need this feature, I pick Java.

I suppose you know that the debate over what's "needed" is as old as C
itself. I imagine you've read the Unix Haters Handbook.

It's not clear to me that the stack would need adjustment, or that
CPU-bound programs pay a measurable price for stack size. I've never
heard of someone making a program faster by minimizing frame sizes. So
I'm not sure that argument holds water.

I think we both know that on today's hardware small is fast because
cache is king. That mostly goes to loop optimization, where the stack
is irrelevant.

> > While you're there, consider whether or not there might not be a
> > large class of programs that would benefit from more convenient --
> > nay, *standard* -- access to a DBMS.
>
> I honestly don't care about ODBC libraries, not a bit. It's not my
> problem domain. And if it would, I wouldn't pick C++ for that. I'm
> doing quite a bit here, and I've picked C++, java, python,
> bash-script and probably a couple of other languages for what solved
> my problem best.

I use other languages, too.

I hope I've made plain that other languages are beside the point.
C++ systems exist. Even if we could write ODBC libraries for them in
other languages, we couldn't automatically generate the I/O functions.

You suggest yacc and Corba. I think you actually understand why
they're nonstarters. Until you reach a large number of structures,
those solutions are *harder* to implement than writing operator<< by
hand. Both are tedious and made necessary by small but crucial missing
metadata in the C++ executable image. Neither is a C++ solution to a
C++ problem.

You seem to assume that if convenient I/O is important, it's important
enough to use some other language. That's simply not the case. Lack
of metadata makes C++ I/O harder than it needs to be. There's no virtue
in that, and not much defense, just Zero Mostel on the rooftop belting
out, "Tradition!".

You say it's not possible to present inheritance metadata or to
provide pointer extents. That argument loses on theoretical grounds.
Whether or not it's possible depends entirely on how the language is
defined, and it's that definition we're discussing.

You're concerned about cost, and you have a lot of company. I don't
think it's justified. If it is, the feature can be optional. RTTI is
a compile-time option in all compilers I've used. Heck, some on this
list don't even use the heap.

Why am I skeptical? It's no trouble to find examples of structures
for which pointer extents aren't needed. For every such example, there
are thousands of uses of functions like memcpy(3), which pass the
extent on the stack, by hand, because it's not computable. To me,
that's obviously wrong, an example of doing by hand -- with the
attendant error probability -- something that could be done better by
the machine. If pointer extents were made part of the language, the
whole standard library could be made smaller, faster, and safer to
use.

Like you, I live in the land of colossal computing where programs are
measured in compute-hours (or days). Unlike you, apparently, my
programs also move a lot of data over the wire. I believe the second
problem can be made more tractable at no expense, and perhaps some
benefit, to the first. If that's not true, I hope someone will
demonstrate why not, instead of instantly assuming ispo facto that every
feature has a cost.

--jkl

Edward Rosten

unread,
May 24, 2013, 8:28:45 AM5/24/13
to

On Wed, 22 May 2013 22:38:23 -0700, Joshua Maurice wrote:

> Actually, no. Suppose you ask for an array of double:
> new double[7];
> It's my understanding that a lot of heap implementations will not
> return you /exactly/ an array of 7 doubles. It'll actually return to
> you an array of 8 doubles. Nowhere will it track that it allocated
> exactly 7. Instead, it keeps track that it gave you 8 or less, and
> it doesn't care beyond that. It's actually quite possible for arrays
> of POD types to not contain the actual size at all. It's sometimes
> more efficient to not contain the exact size and instead round up to
> multiples of 2 to use a small number of separate pools.

Presumably, though that would not matter for these purposes. In this
case, the serialiser would write out a bit more data than is required,
and then load it in on the other end. That data would contain junk
which would be dutifully saved and reloaded, but it wouldn't then be
used by the program.

Of course, (de)serialising junk is a waste of bandwidth and storage,
so it wouldn't be ideal, but I don't see that it would be too much of
a problem otherwise.


-Ed

Thomas Richter

unread,
May 24, 2013, 1:59:24 PM5/24/13
to
On 23.05.2013 07:49, James K. Lowden wrote:
>
>> You already did. CORBA does solve the problem nicely. Probably
>> "inverse" from the way you want it because you first have to write
>> the IDL and not the C++ classes, but anyhow, it'll do.
>
> So say you. I don't consider CORBA a "library solution" (setting aside
> whether it's a solution at all, to anything). CORBA is not a C++
> library. It's a foreign system with C++ bindings.

So may it be - but does it matter? You need a solution, you got one.
Why does C++ need to address this problem if there is a solution that works.

>> If I had the same problem, I would probably write my "classes to be
>> serialized" in a restricted C++ dialect, then write a yacc or bison
>> code to parse that, and generate the necessary code automatically.
>
> Yup. Instead of using the language you already have, you'd recreate
> the information discarded by the compiler in a bespoke, nonstandard
> way.

Why "nonstandard"? It's probably another standard. But certainly not the
C++ standard. As I already said, I believe there is no need to put this
burden on the shoulders of the C++ standard as there are solutions that
address this problem, and they work with C++. IOW, why do you care
whether this is part of C++ or not?

> What you could not do is extend I/O streams to deal with user-defined
> types.

Why not? Of course I could. It's the matter of generating the code for it.

>> Or I would pick java right away, which has such things built in.
>
> Yeah, except there is no "pick". Stroustrup says there are billions of
> lines of extant C++, and some 3 million C++ programmers. Every one of
> those systems reads and writes data. Many thousands of those
> programmers have dealt with I/O, some not very well.

Why is that an argument against using java? Sorry, I don't understand.
The 3 million lines of code certainly do not need serialization. While I
cannot give you numbers, I would believe that only a minority of them
would profit from it, or solved it otherwise.

In the same vain, one could argue that C++ needs a standard GUI library,
a library that allows users to create graphical front-ends. It's really
missing, there are 3 million lines of C++ out there... and actually, I
(personally) would profit from it.

> C++ has some unique features, even some *good* unique features.
> Judging from the billions of lines and the lousy I/O support, it's safe
> to say it gets picked for good reason, and that I/O isn't one of them.

Probably not. GUIs are neither a strong argument for C++. But then
again, why solve the problem with C++ then in first place. It is
probably not the right language for that.

>> So why do you write C++ libraries if you need to solve database
>> problems? These are typically I/O bound in my experience, and you
>> don't need the low-level control C++ offers.
>>
>>> I didn't say the programs were I/O bound. I am saying they're I/O
>>> awkward. That matters because any useful computation requires input
>>> and output.
>>
>> That is so general that it applies to every program. Actually, a
>> program without output is pretty much useless, I would say.
>
> Good. So we agree, at least that output matters?

Sure. So we need a GUI library, right? Something to make my output
readable by a user, in a convenient way. And output means output to HTML
these days, so actually, we need an HTML + css creator within standard C++.

*Neither* of these features are part of C++ today. I believe for good
reason. There are already solutions for them, so there is no need to put
them into the standard.

> In case it helps, the problems I face day-to-day are compute-bound.
> Efficiency is very important. So is getting data in and out of the
> system, and sharing it with the rest of the organization.

So again, why not java? Have you made benchmarks?

>>> Pointers are nothing more than offsets, counters from zero into
>>> linear memory. To write them out and read them back in requires
>>> only to encode and decode them. If it couldn't be done,
>>> "marshalling" wouldn't be part of our working vocabulary.
>>
>> Sorry, but how is that supposed to work? If I de-serialize two
>> objects, how can one ensure that they have the same offset they had
>> when they were serialized?
>
> I don't see the problem. If, when those objects were serialized, the
> values of their pointers were stored, too, then on de-serialization you
> would have the exact information you needed to know if they were one and
> the same object.

How, and why? You need a unique identifier that identifies all the
objects you have written, then again on input, you need to scan this
information, need to keep the IDs in a local database, and re-set the
pointers correctly. Ok, probably do-able. But why do I need to put all
this stuff into the C++ core?

> I don't know what you mean by "same offset"; they
> don't need to have the same locations in memory, except relative to
> each other.

They will certainly not end up in the same relative offset. At least not
under a multitasking operating system.


>>>> The problem is just that you now assume networking to be part of
>>>> the language
>>>
>>> No. The problem is you're looking through the wrong end of the
>>> telescope. You seem to think the problem is unimportant and can be
>>> solved with a library. It cannot, and it matters.
>>
>> That depends on your problem domain. The problem actually *is*
>> unimportant for most C++ users, which is exactly why it is not part of
>> the language.
>
> That's not clear at all. The language was invented before it was
> used. As I pointed out elsewhere, it predates the Internet, the
> dominance of tcp/ip, and SQL. I rather think C++ reflects
> its era and its origin in the land of C, not what's important to the 3
> million programmers using it in 2013.

...and GUIs, don't forget the GUIs.... (-:

Honestly, for all that stuff are solutions that interact nicely with
C++. So again, why does that have to go into the language core?

> No, I'm simply addressing myself to the problem of I/O because I note
> it's a void in the language and the cause of a lot of work and
> nonstandardized behavior that would otherwise be unnecessary. It's
> something I claim to know something about, and something we cannot
> program around.

You can't? I can. See above, it looks like a practical solution that
works. I mean, CORBA basically works this way. You write the idl, and
the IDL compiler creates the C++ skeletons from that. Works. Ok, the
CORBA C++ binding stinks, but that's a different issue.

>>> We have typeof and sizeof and offsetof. We do not have, say,
>>> memberof or nameof or parentof or childof.
>>
>> In terms of inheritance, it should be possible to do that by language
>> means and template magic
>
> Maybe, but that is not a general solution.
>
>> In terms of "pointers", it cannot be solved by the language in first
>> place. C++ has no language means to identify a tree as a tree
>
> Nor does it need one. Do you deny that the tree exists, or that the
> language could provlde an iterator for it?

Neither, but that's not the problem. Writing out pointers as offsets
from a base address is not a solution either. You need something more
complicated to make it working in general.


>>> Now, suppose you could trap SIGSEGV and walk the stack in your
>>> program, using standard functions to identify each object by name
>>> and type.
>>
>> Which would impose quite an implementation overhead not needed by most
>> programs. But still, an implementation could support it. But,
>> actually, I *don't need it*, so I don't want to pay for the feature.
>> Actually, such a thing would probably require that function inlining
>> adjusts the stack frame, or puts additional data on the stack frame,
>> and that's an overhead I don't want to pay for because my problems
>> are CPU-bound and not CPU-bound, and I'm not willing to pay for a
>> feature I don't need. If I need this feature, I pick Java.
>
> I suppose you know that the debate over what's "needed" is as old as C
> itself. I imagine you've read the Unix Haters Handbook.
>
> It's not clear to me that the stack would need adjustment, or that
> CPU-bound programs pay a measurable price for stack size. I've never
> heard of someone making a program faster by minimizing frame sizes. So
> I'm not sure that argument holds water.

Actually, I did. I do some heavy signal processing here, and believe me,
it does make a huge difference if you can inline a function without
overhead directly into the code implementing a tight loop. It is a world
where you gain customers just because you're 10% faster than your
competitor, and I need this 10% I can gain by clever programming. (Yet,
again, the 10% also require to do some things in assembler, but I'll
avoid using that whenever possible, unless I really have to. And I
rarely have to these days, luckily.)

> I think we both know that on today's hardware small is fast because
> cache is king. That mostly goes to loop optimization, where the stack
> is irrelevant.

Stack is relevant if you need to manipulate it to create a stack frame
for capturing a segfault correctly. And no, I definitely *do not* want
that overhead in my code, because, yes, it does matter, and yes, I have
observed exactly that.

> I hope I've made plain that other languages are beside the point.

Why?

> C++ systems exist. Even if we could write ODBC libraries for them in
> other languages, we couldn't automatically generate the I/O functions.

Why not? I don't see your problem. I'm using makefiles plus parsers for
that.

> You suggest yacc and Corba. I think you actually understand why
> they're nonstarters.

Actually no. I have here a project that uses OpenCL. The openCL source
that is shipped is pre-processed from some other "almost-like" OpenCL by
some parsing plus macro-magic. Embedds nicely into the built system,
simply because the OpenCL that is defined in the language lacked some
features I needed. So I created some "minimalistic" language around it.

Ok, that's not C++ (at least not this part), but its the same type of
solution.

> Until you reach a large number of structures,
> those solutions are *harder* to implement than writing operator<< by
> hand. Both are tedious and made necessary by small but crucial missing
> metadata in the C++ executable image. Neither is a C++ solution to a
> C++ problem.

Why do I need a C++ solution? I need *a solution*.

> You seem to assume that if convenient I/O is important, it's important
> enough to use some other language. That's simply not the case. Lack
> of metadata makes C++ I/O harder than it needs to be. There's no virtue
> in that, and not much defense, just Zero Mostel on the rooftop belting
> out, "Tradition!".

For most of the cases, I/O is too special to have a generic solution for
that. The I/O I need to do is byte-oriented, and I need to build all the
structures themselves. It does not make sense to have that generated by
C++ at all because the I/O format is too far away from any type of
structure or data type C++ might offer. So, when I do I/O, I'll do that
myself. Depending on the problem, even on the bit level if it must be
(and it must be, sometimes). So the solution you're offering is likely
not a solution I would consider helpful - it is too special for me and
doesn't solve my problem. It's not that I'm serializing data structures
right away.

> You say it's not possible to present inheritance metadata or to
> provide pointer extents. That argument loses on theoretical grounds.
> Whether or not it's possible depends entirely on how the language is
> defined, and it's that definition we're discussing.
>
> You're concerned about cost, and you have a lot of company. I don't
> think it's justified. If it is, the feature can be optional. RTTI is
> a compile-time option in all compilers I've used. Heck, some on this
> list don't even use the heap.

RTTI has some storage overhead, but no running time overhead, so that's
acceptable. Still, I turn it off. I don't need it. Sometimes I do not
even need or want exceptions (or have my own mechanisms of dealing with
them). So I can turn them off.

> Why am I skeptical? It's no trouble to find examples of structures
> for which pointer extents aren't needed. For every such example, there
> are thousands of uses of functions like memcpy(3), which pass the
> extent on the stack, by hand, because it's not computable. To me,
> that's obviously wrong, an example of doing by hand -- with the
> attendant error probability -- something that could be done better by
> the machine. If pointer extents were made part of the language, the
> whole standard library could be made smaller, faster, and safer to
> use.

I kinna doubt this, unless you've tried and can show some results.

> Like you, I live in the land of colossal computing where programs are
> measured in compute-hours (or days). Unlike you, apparently, my
> programs also move a lot of data over the wire. I believe the second
> problem can be made more tractable at no expense, and perhaps some
> benefit, to the first. If that's not true, I hope someone will
> demonstrate why not, instead of instantly assuming ispo facto that every
> feature has a cost.

The basic questions I have, and I haven't found a convincing answer is:
*) Why is C++ the right solution for the job?
*) Why is using external tools on C++ not a good solution?

I've used both, i.e. switched languages if I had other problems (like
GUIs, they're working like a charm in Java) or used external tools (like
my OpenCL generator) if I had something that required tight integration
into C++.

So long,
Thomas

Edward Rosten

unread,
May 24, 2013, 2:00:41 PM5/24/13
to
On Wed, 22 May 2013 22:41:50 -0700, James K. Lowden wrote:

> 2. Discarded by the compiler today:
> nameof(d) == "d"
> extentof(d) == sizeof(double)

That's the bit I'm not sure does work. If, for instance d* is passed to a
separate translation unit before serialisation, the compiler can't tell
whether d is pointing to an individual double or an array of doubles.

At this point there's little choice but to serialise the entire allocated
block that d points into.

Though, the serialisation can be done naively, since clearly it cannot
contain anything other than doubles that would need to be valid on the
other end.

I'm not actually 100% sure about that last point. If you can legitimately
construct a pointer to a

struct NotSure{
double first_member;
//other stuff
};


from a pointer to a double then you can't even assume that the only data
pointed to is doubles. I can't remember if this is legal in C++ off hand.

-Ed

Francis Glassborow

unread,
May 24, 2013, 8:06:11 PM5/24/13
to
On 24/05/2013 18:59, Thomas Richter wrote:
> On 23.05.2013 07:49, James K. Lowden wrote:
>>
>>> You already did. CORBA does solve the problem nicely. Probably
>>> "inverse" from the way you want it because you first have to write
>>> the IDL and not the C++ classes, but anyhow, it'll do.
>>
>> So say you. I don't consider CORBA a "library solution" (setting aside
>> whether it's a solution at all, to anything). CORBA is not a C++
>> library. It's a foreign system with C++ bindings.
>
> So may it be - but does it matter? You need a solution, you got one.
> Why does C++ need to address this problem if there is a solution that
> works.
>

WG21 is usually very reluctant to introduce changes to the core unless
they solve wider problems. I suspect that reflection might help to solve
serialisation problems but it will also help in quite a few other ways
(or so I am told) So the serialisation problem might add motivation for
reflection.

However there is another issue. The problem is not just endian-ness. C++
specifies very little about fundamental types and that is a deliberate
design decision that has been around since the beginning (and has been
reconsidered and re-affirmed). If different implementations have
different widths for int how will serialisation help in transfer of int
data between programs even given a solution to the endian-ness?

Now rather than messing with the core of the language it would, I think,
be very helpful if implementers ensured that relevant meta-data was
retained in object files so that linkers could ensure that all TUs were
compatible (e.g. every object file should contain a header identifying
the settings of all compiler switches.

Please note that even extern "C" only works for compatible
implementations of C and C++

Regards
Francis

woodb...@googlemail.com

unread,
May 25, 2013, 2:59:29 AM5/25/13
to
{ Please remove double-spacing in the quoting. Reformatted. -mod }

On Tuesday, May 14, 2013 1:11:46 AM UTC-5, James K. Lowden wrote:
> Without compiler support, C++ is "just another language" participating
> in the IDL-driven language-neutral serialization circus. Inevitably,
> the IDL defines the very structures that could be better defined
> directly in C++. Twice the complexity, half the features, and none of
> the fun.

I agree with your comments about the language-neutral circus.
CORBA's approach seemed to provide mediocre marshalling support
for a number of languages, but not strong support for any of
the languages.

The C++ Middleware Writer (see link below) is an on line
code generator that aims to provide strong marshalling
support for C++. I'm not thinking about support for
another language at this point, because there's still a
lot of work to do to improve the C++ support. As ever,
your help is appreciated in terms of identifying the
most distressing weaknesses.


Brian
Ebenezer Enterprises
http://webEbenezer.net

woodb...@googlemail.com

unread,
May 25, 2013, 3:00:28 AM5/25/13
to
{ Please remove double-spacing in the quoting. Reformatted. -mod }

On Friday, May 24, 2013 7:06:11 PM UTC-5, Francis Glassborow wrote:
> However there is another issue. The problem is not just endian-ness. C++
> specifies very little about fundamental types and that is a deliberate
> design decision that has been around since the beginning (and has been
> reconsidered and re-affirmed). If different implementations have
> different widths for int how will serialisation help in transfer of int
> data between programs even given a solution to the endian-ness?
>

Is that something that can be dodged by telling people to
use int8_t, int16_t and so on in a marshalling context? That's
what I recommend if I find someone talking about int in that
context. If you're aiming toward portability, I think you have
to have clarity about the meaning of the types involved.


Brian
Ebenezer Enterprises
http://webEbenezer.net


Seungbeom Kim

unread,
May 25, 2013, 7:34:38 PM5/25/13
to

On 2013-05-24 05:28, Edward Rosten wrote:
>
> Of course, (de)serialising junk is a waste of bandwidth and storage,
> so it wouldn't be ideal, but I don't see that it would be too much
> of a problem otherwise.

For one thing, trying to read uninitialized junk results in undefined
behavior, and you cannot get around it because it happens
automatically.

--
Seungbeom Kim

Francis Glassborow

unread,
May 25, 2013, 7:47:51 PM5/25/13
to

On 25/05/2013 08:00, woodb...@googlemail.com wrote:
> { Please remove double-spacing in the quoting. Reformatted. -mod }
>
> On Friday, May 24, 2013 7:06:11 PM UTC-5, Francis Glassborow wrote:
>> However there is another issue. The problem is not just
>> endian-ness. C++ specifies very little about fundamental types and
>> that is a deliberate design decision that has been around since the
>> beginning (and has been reconsidered and re-affirmed). If different
>> implementations have different widths for int how will
>> serialisation help in transfer of int data between programs even
>> given a solution to the endian-ness?
>>
>
> Is that something that can be dodged by telling people to use
> int8_t, int16_t and so on in a marshalling context? That's what I
> recommend if I find someone talking about int in that context.

And what do you do about systems where a char has more than 8 bits?
And, yes. they do exist.

If you're aiming toward portability, I think you have to have clarity
about the meaning of the types involved.

Indeed but the C++ type system is not designed for portability at that
level. There has never been a requirement for portability at that
level. Remember that this thread is about requiring core support for
the issues of endian-ness, padding etc. My point, in case anyone
missed it, is that such support would require fundamental changes to
the core of C++. There is nothing to prevent implementers from
providing support but requiring that they do so is, IMNSHO, a
non-starter.

Francis

Thomas Richter

unread,
May 25, 2013, 7:44:11 PM5/25/13
to

Am 25.05.2013 09:00, schrieb woodb...@googlemail.com:
> { Please remove double-spacing in the quoting. Reformatted. -mod }
>
> On Friday, May 24, 2013 7:06:11 PM UTC-5, Francis Glassborow wrote:
>> However there is another issue. The problem is not just
>> endian-ness. C++ specifies very little about fundamental types and
>> that is a deliberate design decision that has been around since the
>> beginning (and has been reconsidered and re-affirmed). If different
>> implementations have different widths for int how will
>> serialisation help in transfer of int data between programs even
>> given a solution to the endian-ness?
>>
>
> Is that something that can be dodged by telling people to use
> int8_t, int16_t and so on in a marshalling context? That's what I
> recommend if I find someone talking about int in that context. If
> you're aiming toward portability, I think you have to have clarity
> about the meaning of the types involved.

Which, again, would narrow the application domain of the
language. While these types are available on most desktop and server
systems, it is not unusual for embedded systems not to support all of
them. I'm aware of a couple of TI DSPs where char=short=int are all 16
bit wide. So at least, the language need to be split up, into a core
that does not define such types, and a language on top that adds
further constraints.

Actually, this makes me believe that the C++ standard is not the right
place for it. You can always create another standard that refers to
C++ but extends it by adding requirements to the core.

Greetings,
Thomas

David Lowndes

unread,
May 25, 2013, 7:41:31 PM5/25/13
to

>Is that something that can be dodged by telling people to use int8_t,
>int16_t and so on in a marshalling context?

I agree, the whole idea of trying to serialise *anything* seems
fundamentally problematic to me.

Surely, being able to serialise just specific types & classes (that
could be machine independently implemented), is a simpler and much
more achievable goal.

Dave

woodb...@googlemail.com

unread,
May 26, 2013, 3:10:57 AM5/26/13
to
On Saturday, May 25, 2013 10:50:02 PM UTC, Francis Glassborow wrote:
>
> And what do you do about systems where a char has more than 8 bits?
> And, yes. they do exist.
>

#if CHAR_BIT != 8
#error Only 8 bit char supported
#endif

I think I got that from James Kanze. It's kind of a poor man's answer
until there's more time to work on that.

> Indeed but the C++ type system is not designed for portability at
> that level. There has never been a requirement for portability at
> that level.

I guess you're saying that int8_t and friends aren't part of the C++
type system. But they work fine as far as I can tell. They may not
be guaranteed, but it seems like they are widely available.

Brian
Ebenezer Enterprises
http://webEbenezer.net


Thomas Richter

unread,
May 27, 2013, 4:45:18 AM5/27/13
to
On 26.05.2013 09:10, woodb...@googlemail.com wrote:
> On Saturday, May 25, 2013 10:50:02 PM UTC, Francis Glassborow wrote:
>>
>> And what do you do about systems where a char has more than 8 bits?
>> And, yes. they do exist.
>>
>
> #if CHAR_BIT != 8
> #error Only 8 bit char supported
> #endif
>
> I think I got that from James Kanze. It's kind of a poor man's
> answer until there's more time to work on that.

Actually, if I have the issue, I solve it by autoconf - basically
because this is exactly what autoconf was made for. It can also give
you the endianness of the system should you require it. It is just
another solution outside of the C++ standard, but again a solution
that works.

>> Indeed but the C++ type system is not designed for portability at
>> that level. There has never been a requirement for portability at
>> that level.
>
> I guess you're saying that int8_t and friends aren't part of the C++
> type system. But they work fine as far as I can tell. They may not
> be guaranteed, but it seems like they are widely available.

I don't think that Francis said that. They are part of the type
system, but they may or may not exist, depending on the system
architecture. On some embedded systems, they surely do not exist. It
was simply not a requirement of the C++ design to enforce that these
types must exist as it would limit the applicability of C++.

There are other languages that have stronger requirements on the
hardware they run on, and thus narrower application fields. Such
languages have types of guaranteed bitdepths. C++ does not.

So long,
Thomas

woodb...@googlemail.com

unread,
May 27, 2013, 5:21:55 PM5/27/13
to
On Monday, May 27, 2013 8:45:18 AM UTC, Thomas Richter wrote:
>
> I don't think that Francis said that. They are part of the type
> system, but they may or may not exist, depending on the system
> architecture. On some embedded systems, they surely do not exist. It
> was simply not a requirement of the C++ design to enforce that these
> types must exist as it would limit the applicability of C++.

I've never worked on embedded systems but I would like to. Couldn't
most embedded systems support a subset of int8_t and pals? I think
it's OK if a platform just supports one or two of them.

Edward Rosten

unread,
May 28, 2013, 7:12:05 AM5/28/13
to
On Sat, 25 May 2013 17:34:38 -0600, Seungbeom Kim wrote:

> On 2013-05-24 05:28, Edward Rosten wrote:
>>
>> Of course, (de)serialising junk is a waste of bandwidth and
>> storage, so it wouldn't be ideal, but I don't see that it would be
>> too much of a problem otherwise.
>
> For one thing, trying to read uninitialized junk results in
> undefined behavior, and you cannot get around it because it happens
> automatically.

That doesn't matter: we're talking about something new so the
implementation is free to define it in this case. In practice, if the
memory is mapped in, and not MMIO, then reading as bytes is
harmless. The allocator can know that since it was responsible for
allocating the memory in the first place.

-Ed


--

Martin Bonner

unread,
May 28, 2013, 5:20:16 PM5/28/13
to
On Monday, May 27, 2013 10:21:55 PM UTC+1, woodb...@googlemail.com wrote:
> I've never worked on embedded systems but I would like to. Couldn't
> most embedded systems support a subset of int8_t and pals? I think
> it's OK if a platform just supports one or two of them.

A few years ago, there were signal processors which could only support
32-bit arithmetic. That worked: sizeof(short) == sizeof(int) ==
sizeof(long) == 1. If you are wondering how sizeof(long) can be 1, the
answer is that CHAR_BIT was 32 ... that makes int8_t and int16_t tricky.

These days, I suspect the successor devices probably have CHAR_BIT==64.

James K. Lowden

unread,
Jun 2, 2013, 4:37:45 PM6/2/13
to

On Fri, 24 May 2013 17:06:11 -0700 (PDT)
Francis Glassborow <francis.g...@btinternet.com> wrote:

> >> I don't consider CORBA a "library solution" (setting aside
> >> whether it's a solution at all, to anything). CORBA is not a C++
> >> library. It's a foreign system with C++ bindings.
> >

> > So may it be - but does it matter? You need a solution, you got
> > one. Why does C++ need to address this problem if there is a
> > solution that works.
>
> WG21 is usually very reluctant to introduce changes to the core
> unless they solve wider problems. I suspect that reflection might
> help to solve serialisation problems but it will also help in quite
> a few other ways (or so I am told) So the serialisation problem
> might add motivation for reflection.
>
> However there is another issue. The problem is not just endian-ness.
> C++ specifies very little about fundamental types and that is a
> deliberate design decision that has been around since the beginning
> (and has been reconsidered and re-affirmed). If different
> implementations have different widths for int how will serialisation
> help in transfer of int data between programs even given a solution
> to the endian-ness?

There's no general solution to that, right, because we can't 10 pounds
of flour in a 5-pound sack?

In a more limited way, though, I don't see the difficulty. The
protocol can define or transmit the information required for the
receiver to interpret the data correctly.

As the guy who'd like to write that library, I'm stymied not by things
I can detect but by things I can't. I can, awkardly, determine the
endianism and size of integer types. I can't know the number of bits
(or sign) of a char, but I'm willing to assume or specify 8 bits,
signed, because that's hugely popular.

I'd be happier, though, with access to a table whose tuples gave name,
bit-size, endianism, and bit-semantics (not only sign but schema e.g.
two's complement) for every primitive type. it's not enough just to
embed such a table in the executable or translation unit. For it to
be used by a library, it must be accessible through the language.

(Someone will object, correctly, that we cannot know what bit-level
semantics may be implemented in the future, and conclude, incorrectly,
that we cannot therefore enumerate the widely used small set of
existing ones. The problem is neither large nor complex. As an order
of magnitude, the variety of int sizes, char sizes (in bits), integer
bit semantics, and floating point representations for which, say, GNU
can produces executable code each number less than 10.)

Where the language really falls down is class/struct member
enumeration. There's simply no way to iterate over the members of a
struct. Because of that, there's no way for a library to map the rows
in a database to an in-memory struct, even if the names and types of
the struct members exactly match those of the columns in the database
table. And that's just one important specific example. In general
it's impossible to map any C++ data structure onto an external
definition, except by hand.

> Now rather than messing with the core of the language it would, I
> think, be very helpful if implementers ensured that relevant
> meta-data was retained in object files so that linkers could ensure
> that all TUs were compatible (e.g. every object file should contain
> a header identifying the settings of all compiler switches.

Yes, except for "rather than". :-)

I doubt there is much demand for linkers that can ensure TU
compatibility. The answer to that has generally been "recompile".

Even a linker -- as currently defined -- that could detect
incompatible TUs couldn't do much more than abort with a diagnostic.
A "linker" that marshalls incompatible data across TUs would be a very
different beast indeed.

As a practical matter, getting implementations to do anything in a
standard way not specified by the standard is a lost cause. We have
25 years of name-mangling as a poster child. :-(

--jkl

James K. Lowden

unread,
Jun 2, 2013, 4:41:23 PM6/2/13
to

On Fri, 24 May 2013 10:59:24 -0700 (PDT)
Thomas Richter <th...@math.tu-berlin.de> wrote:

> > Like you, I live in the land of colossal computing where programs
> > are measured in compute-hours (or days). Unlike you, apparently,
> > my programs also move a lot of data over the wire.
> > The basic questions I have, and I haven't found a convincing
> > answer is:
> *) Why is C++ the right solution for the job?

I have answered that to the best of my ability. When I tell someone
that my compute jobs are measured in CPU-days, and am then asked why
C++ is the right language, I tend to agree with Jamie Zawinski that
maybe tending bar would be more profitable.

> *) Why is using external tools on C++ not a good solution?

First, because it requires two parties to agree on a tool. You seem
seem to start with the assumptions that the programmer begins with a
clean slate and controls all technical choices. Drop those
assumptions, and imposing an external tool becomes infeasible.

My personal hobbyhorse is database client libraries: They are harder
to use than would otherwise need be because the programmer is forced
to define operator<< (or similar) for every data structure
representing a row in the database he wants to use. That work is
boring, error-prone, and pointless. It would not be improved in any
way by introducing CORBA into the mix.

The absence of metadata in the compiled output forces the programmer
to restate the contents of any user-defined structure when the data
pass through a process boundary. To say that work would be
unnecessary if the compiler included the requisite metadata isn't just
obvious; it's tautological.

--jkl

Richard Damon

unread,
Jun 3, 2013, 12:08:37 AM6/3/13
to

On 6/2/13 1:37 PM, James K. Lowden wrote:
>
> As the guy who'd like to write that library, I'm stymied not by
> things I can detect but by things I can't. I can, awkardly,
> determine the endianism and size of integer types. I can't know the
> number of bits (or sign) of a char, but I'm willing to assume or
> specify 8 bits, signed, because that's hugely popular.
>

Small correction, but these ARE know and available at compile time
with the macros CHAR_BIT and CHAR_MIN (if =0 then unsigned, if <0 then
signed)

Öö Tiib

unread,
Jun 3, 2013, 12:10:04 AM6/3/13
to

On Sunday, 2 June 2013 22:40:06 UTC+3, James K. Lowden wrote:
> I can't know the number of bits (or sign) of a char, but I'm willing
> to assume or specify 8 bits, signed, because that's hugely popular.

Maybe I misunderstand something but following compile-time constants
have been available since C++98:

int const non_sign_bits = std::numeric_limits<char>::digits;
bool const signed = std::numeric_limits<char>::is_signed;

Thomas Richter

unread,
Jun 4, 2013, 8:47:43 AM6/4/13
to
On 02.06.2013 22:37, James K. Lowden wrote:
>
>> However there is another issue. The problem is not just
>> endian-ness. C++ specifies very little about fundamental types and
>> that is a deliberate design decision that has been around since the
>> beginning (and has been reconsidered and re-affirmed). If
>> different implementations have different widths for int how will
>> serialisation help in transfer of int data between programs even
>> given a solution to the endian-ness?
>
> There's no general solution to that, right, because we can't 10
> pounds of flour in a 5-pound sack?

Not as far as the use cases of C++ are defined.

> In a more limited way, though, I don't see the difficulty. The
> protocol can define or transmit the information required for the
> receiver to interpret the data correctly.

So consider the case where a system with 8-bit chars transmits data to
a system that does not have 8-bit chars, but 16 bit integers
only. Does that mean that the target system needs to emulate 8-bit
integer? If so, then this is in conflict with the C++ model.

> As the guy who'd like to write that library, I'm stymied not by
> things I can detect but by things I can't. I can, awkardly,
> determine the endianism and size of integer types. I can't know the
> number of bits (or sign) of a char, but I'm willing to assume or
> specify 8 bits, signed, because that's hugely popular.

Yes, but it is already beyond what C++ can ensure.

> I'd be happier, though, with access to a table whose tuples gave
> name, bit-size, endianism, and bit-semantics (not only sign but
> schema e.g. two's complement) for every primitive type. it's not
> enough just to embed such a table in the executable or translation
> unit. For it to be used by a library, it must be accessible through
> the language.

But you can do that right now. It's a technique called "type traits".
For each type you want to consider, you create a "traits" template
that lists all its features. It doesn't come with the language, but it
is neither hard to do that.

Just to give you an idea what I do by the means of autoconf and
traits:


/// TypeTrait
// Abstract base type, specialized versions supply all the data required.
template<typename type>
struct TypeTrait {
};
// This structure performs the reverse lookup: From ID to type.
template<int ID>
struct IDTrait {
};
///

/// TypeTrait<UBYTE>
template<>
struct TypeTrait<UBYTE> {
//
enum {
isSigned = false
};
enum {
isFloat = false
};
enum {
TypeID = CTYP_UBYTE
};
enum {
ByteSize = 1
};
enum {
BitSize = 8
};
enum {
MSBMask = 0x80UL
};
enum {
Min = MIN_UBYTE
};
enum {
Max = MAX_UBYTE
};
//
// Related types
typedef BYTE Signed;
typedef UBYTE Unsigned;
};
template<>
struct IDTrait<CTYP_UBYTE> {
typedef UBYTE Type;
};
///

/// TypeTrait<BYTE>
template<>
struct TypeTrait<BYTE> {
//
enum {
isSigned = true
};
enum {
isFloat = false
};
enum {
TypeID = CTYP_BYTE
};
enum {
ByteSize = 1
};
enum {
BitSize = 8
};
enum {
Min = MIN_BYTE
};
enum {
Max = MAX_BYTE
};
enum {
SignBit = 7
};
enum {
SignMask = 0x80
};
//
// Related types
typedef BYTE Signed;
typedef UBYTE Unsigned;
};
template<>
struct IDTrait<CTYP_BYTE> {
typedef BYTE Type;
};
///

plus many others. In this project, I need 8-bit ints, and check that
by autoconf. I do not mind about endianness here.

> (Someone will object, correctly, that we cannot know what bit-level
> semantics may be implemented in the future, and conclude,
> incorrectly, that we cannot therefore enumerate the widely used
> small set of existing ones. The problem is neither large nor
> complex. As an order of magnitude, the variety of int sizes, char
> sizes (in bits), integer bit semantics, and floating point
> representations for which, say, GNU can produces executable code
> each number less than 10.)

I still don't understand your problem then. If you say it is neither a
large nor a complex problem for you, where is the problem then in
first place? I use templates like the above, and I populate them by
autoconf. Problem solved. Why do I need to extend C++ as a language
for that, even more so that all the types I enumerate here (e.g. the
8-bit unsigned integer you see above) is not even available on some
machines C++ cares about. That I do not care about these machines is
another matter, but it is my matter, and not the matter of the
standards committee.

So long,
Thomas

Thomas Richter

unread,
Jun 4, 2013, 8:56:29 AM6/4/13
to
{ This thread is becoming somewhat repetitive. Please keep that in
mind; thanks! -mod }


On 02.06.2013 22:41, James K. Lowden wrote:
>
> On Fri, 24 May 2013 10:59:24 -0700 (PDT)
> Thomas Richter<th...@math.tu-berlin.de> wrote:
>
>>> Like you, I live in the land of colossal computing where programs
>>> are measured in compute-hours (or days). Unlike you, apparently,
>>> my programs also move a lot of data over the wire. The basic
>>> questions I have, and I haven't found a convincing answer is:
>> *) Why is C++ the right solution for the job?
>
> I have answered that to the best of my ability. When I tell someone
> that my compute jobs are measured in CPU-days, and am then asked why
> C++ is the right language, I tend to agree with Jamie Zawinski that
> maybe tending bar would be more profitable.

Well, to ask this question again, have you tried anything *but* C++,
then?

>> *) Why is using external tools on C++ not a good solution?
>
> First, because it requires two parties to agree on a tool.

Why do you need to agree on such a tool? You would need to agree on an
interface, indeed, but whether this interface is hand-written, or
machine-written should not matter for your customers.

> You seem seem to start with the assumptions that the programmer
> begins with a clean slate and controls all technical choices. Drop
> those assumptions, and imposing an external tool becomes infeasible.

It becomes very well feasible. I still don't get the problem. Create
the structures you need to handle in your database in some external
language, probably cut-down C. Write a small parser, probably in yacc
or bison, that from that creates C++ classes plus the necessary code
to map them to the database layer. You don't need to agree with anyone
how these tools work because they are not part of the interface. They
just create the code that implements the interface.

> My personal hobbyhorse is database client libraries: They are harder
> to use than would otherwise need be because the programmer is forced
> to define operator<< (or similar) for every data structure
> representing a row in the database he wants to use.
> That work is boring, error-prone, and pointless. It would not be
> improved in any way by introducing CORBA into the mix.

Well, probably not, but CORBA is a solution to a different problem. It
is not a database problem it solves (though I personally wouldn't want
to use CORBA for anything these days, because it is a dead technology,
but that's a different matter). Anyhow, but why do you create these
classes then by hand in first place?

> The absence of metadata in the compiled output forces the programmer
> to restate the contents of any user-defined structure when the data
> pass through a process boundary. To say that work would be
> unnecessary if the compiler included the requisite metadata isn't
> just obvious; it's tautological.

It becomes less tautological if you start to realize that the compiler
is only one tool in a larger toolchain, and that you problably need to
re-think your toolchain, and use another tool and another language to
include the metadata you need, and from that derive the input the
compiler needs. C++ is not the end of all problems. It is one tool out
of many to solve a problem.

Greetings,
Thomas

AdrianH

unread,
May 8, 2013, 3:32:41 AM5/8/13
to
I've been looking around and am just shocked that there still after so
many years, there doesn't seem to be any way of using the optimizer or
templating system in any consistent way across compilers to generate a
compile time endianness value. Does anyone know why the C++ committee
has steered clear of this?

Even if templates or optimizers are not capable of stating this. The
COMPILERS should have something EMBEDDED to state this. It is, after
all, generating the code. And there are constants that are lying
about in memory too that it has to create. So logically, it knows
what endianness of the target binary is going to be, even if it
doesn't know what it is outside the data segment that the constants
are stored in. I only know a little about multi-endian architectures,
but IIRC, they have a way to automagicaly transfer data from different
endian segments correctly.

Still, with cloud computing and other interoperable computing models,
COMPILERS need to take responsibility to make it possible for
programmers to make sane code without such clumsy methods as an
endian.h header file. Padding is another issue that the committee has
stayed clear of. This is just so infuriating in this day and age.

Actually, a FAR BETTER alternative would be to be able to force a type
to be of a specific endianness and have padding of a struct be able to
be forced as well. If you don't specify, then the compiler is free to
do what it wants. This way, the compiler can do whatever
optimizations it needs to make it work as optimally as possible in
either case, removing the endianness and padding issues from the non-
standard compiler world, to a more cooperative interoperable world.
Those who don't need it, don't have to worry about it. Those who do,
don't have to go writing code that could potentially barf on some
systems.

I think we should storm the C++ committee's castle and tell them that
this insanity has gone on long enough. Who's with me? :)

Comments are definitely welcome.


A

Andy Champ

unread,
May 8, 2013, 8:22:26 AM5/8/13
to
On 08/05/2013 08:32, AdrianH wrote:
> I've been looking around and am just shocked that there still after
> so many years, there doesn't seem to be any way of using the
> optimizer or templating system in any consistent way across
> compilers to generate a compile time endianness value. Does anyone
> know why the C++ committee has steered clear of this?

If you force a compiler to store data in the normal (human style)
format, or the byteswapped (Intel style format) then the processors
which access the data and have native formats that differ from your
stored format will have to perform a byte swapping operation on every
memory access to that structure. This is likely to cause an
unacceptable performance degradation. I have never worked on one, but
I understand there are processors that store a 32-bit quantity in a
3412 byte order, not the common 1234 or 4321!

For padding, different processors have different padding requirements,
and it is difficult to predict future requirements. Should the
compilers force padding of everything to 8K page boundaries, just to
be certain that the padding is accurate for all future processors?

Remember that the format of the numbers themselves may vary. I've
worked on systems where a native integer is 16, 24, 32 and 36 bits
long. To force the 36-bit one down to 32 bits would require an
operation on every memory store - probably every operation, for true
compatibility - and to force some of the 16-bit ones, which have no
hardware multiply or divide, to perform every operation in 32 bits
would be far too slow.

IMHO the best place to perform these conversions is during I/O. I'm
going to stand on the battlements and defend the castle alongside
those who produce the specification.

Andy
--
(this time anyway... I still think iterators into set should not be const!)

Joshua Maurice

unread,
May 8, 2013, 8:23:44 AM5/8/13
to
On May 8, 12:32 am, AdrianH <adrianh....@googlemail.com> wrote:
> I've been looking around and am just shocked that there still after
> so many years, there doesn't seem to be any way of using the
> optimizer or templating system in any consistent way across
> compilers to generate a compile time endianness value. Does anyone
> know why the C++ committee has steered clear of this?
[snip]

I'm sorry. I'm confused and lost. Exactly what functionality do you
think you want? Can you give a specific example please?

Jack Adrian Zappa

unread,
May 8, 2013, 4:23:55 PM5/8/13
to
On May 8, 8:23 am, Joshua Maurice <joshuamaur...@googlemail.com>
wrote:
> I'm sorry. I'm confused and lost. Exactly what functionality do you
> think you want? Can you give a specific example please?

I'm thinking about interoperability between compilers and platforms.
Streaming/storing data to be interpreted by another computer on
another system.

Currently we have things like htons(), htonl(), ntohs() and ntohl(),
which are fairly primitive and are dependent on macros located off of
a non standard header file(s), usually starting with the header file
endian.h. This is the best we have at the current moment.

By making endianness into a modifier, (and perhaps other attributes
like bit size as well), this can allow the compiler to deal with the
low level details, optimizing the system to the best of its ability.
After all, the compiler knows what size and endianness the base type
it's native system it is compiling to. And with strict aliasing
coming about, trying to determine such things is going to be harder if
not impossible. Also, it is currently not possible (across compilers)
to determine this at compile time anyway (hence the endian.h header
file, which is architecture specific and is not on all systems).


A

Jack Adrian Zappa

unread,
May 8, 2013, 4:23:35 PM5/8/13
to
Oh, that new google groups is crap. Just ate my two new messages. >:(

{ Two earlier submissions of yours did arrive in the moderation queue;
I'm choosing the more recent ones to be accepted. -mod/sk }

On May 8, 8:22 am, Andy Champ <no....@nospam.invalid> wrote:

> If you force a compiler to store data in the normal (human style)
> format, or the byteswapped (Intel style format) then the processors
> which access the data and have native formats that differ from your
> stored format will have to perform a byte swapping operation on every
> memory access to that structure. This is likely to cause an
> unacceptable performance degradation. I have never worked on one, but
> I understand there are processors that store a 32-bit quantity in a
> 3412 byte order, not the common 1234 or 4321!

I didn't specify endianness, this is compiler driven (actually
hardware, but the compiler 'knows what is best').

> For padding, different processors have different padding requirements,
> and it is difficult to predict future requirements. Should the
> compilers force padding of everything to 8K page boundaries, just to
> be certain that the padding is accurate for all future processors?

I didn't specify padding either. But this would be limited to
structs, probably only PODs. A minimum would be to be something like
gcc's attribute((packed)) to remove all padding, but it may be useful
to go halfway and allow a arbitrary alignment specification.

> Remember that the format of the numbers themselves may vary. I've
> worked on systems where a native integer is 16, 24, 32 and 36 bits
> long. To force the 36-bit one down to 32 bits would require an
> operation on every memory store - probably every operation, for true
> compatibility - and to force some of the 16-bit ones, which have no
> hardware multiply or divide, to perform every operation in 32 bits
> would be far too slow.
>
> IMHO the best place to perform these conversions is during I/O. I'm
> going to stand on the battlements and defend the castle alongside
> those who produce the specification.

Yes, these types do exist and need to be accounted for. And yes,
these would be slower than native representations, but this is for
interoperability, not for speed. Not using these parts of the
language would not cause a performance degradation. Using them would
allow the compiler to select the least amount of degradation possible.


--

Seungbeom Kim

unread,
May 9, 2013, 2:59:21 AM5/9/13
to
On 2013-05-08 00:32, AdrianH wrote:
>
> Still, with cloud computing and other interoperable computing
> models, COMPILERS need to take responsibility to make it possible
> for programmers to make sane code without such clumsy methods as an
> endian.h header file. Padding is another issue that the committee
> has stayed clear of. This is just so infuriating in this day and
> age.
>
> Actually, a FAR BETTER alternative would be to be able to force a
> type to be of a specific endianness and have padding of a struct be
> able to be forced as well.

As far as I know, these issues (endianness and padding) are all
relevant in serialization or external data formats, and I don't see
how good it would be to enforce a specific endianness or padding in
internal data.

The natural way is to read data from outside (file, network, etc.)
doing any necessary conversion, process the data in memory in the
machine's native format, and write data to outside doing any necessary
conversion.

I can imagine that some people want to read and write structs all at
once by a call to [f]read/[f]write, without any conversion, but even
if endianness and padding can be controlled, the method has very
limited applicability because it cannot handle pointer members and
non-POD structs (e.g. one with a vtbl). It cannot handle even very
simple structs that have a 'std::string name;' or 'const char* name;'
member but only something much more primitive like 'char
name[NAMELEN_MAX];'. So controlling endianness and padding will not
help and you have to learn to do proper serialization at some point
anyway.

--
Seungbeom Kim

Thomas Richter

unread,
May 9, 2013, 3:03:42 AM5/9/13
to
On 08.05.2013 09:32, AdrianH wrote:
> I've been looking around and am just shocked that there still after
> so many years, there doesn't seem to be any way of using the
> optimizer or templating system in any consistent way across
> compilers to generate a compile time endianness value. Does anyone
> know why the C++ committee has steered clear of this?

Probably because there is no reason to steer into this?

> Even if templates or optimizers are not capable of stating this.
> The COMPILERS should have something EMBEDDED to state this.

Sorry, but which use case do you want to solve? If any, then as far as
I see, only I/O is affected, and then this is not even an issue if you
design your I/O properly. But proper software design is not a matter
of a standard or a committee, but rather of the program author.

> Actually, a FAR BETTER alternative would be to be able to force a
> type to be of a specific endianness and have padding of a struct be
> able to be forced as well.

This would affect performance dramatically, for little benefit. If you
want to serialize structures or classes, then you cannot write them to
disk directly anyhow, even with a standard-defined padding and
endianness (otherwise, please explain me how to serialize a
pointer). So it requires a user-generated solution in first place, and
given that the slow part about serialization is the I/O part, why
shouldn't I write out the structures in some defined form in first
place. Say ASCII?

> If you don't specify, then the compiler is free to do what it wants.

s/wants/should do/ And this basically because if you would force the
compiler to do otherwise, things would be rather slow. C or C++ are
designed that way, namely with performance in mind.

> This way, the compiler can do whatever optimizations it needs to
> make it work as optimally as possible in either case, removing the
> endianness and padding issues from the non- standard compiler world,
> to a more cooperative interoperable world.

The way how C is designed this is and should be the matter of the
program author.

> Those who don't need it, don't have to worry about it. Those who
> do, don't have to go writing code that could potentially barf on
> some systems.

Actually, there is little you can do on systems with ANSI-C only. Not
even a graphical user interface or such elementary things like getting
a key press. C doesn't have all that. It does have the ability to link
to system defined functions and to adapt to its environment, and by
that become portable. Thus, what you describe is probably more an
issue of "upper level" operating system standards (not that Microsoft,
for example, would care), and its surely not a matter of C.

> I think we should storm the C++ committee's castle and tell them
> that this insanity has gone on long enough. Who's with me? :)

I'm not.

Joshua Maurice

unread,
May 9, 2013, 3:07:19 AM5/9/13
to
On May 8, 12:30 pm, Jack Adrian Zappa <adrianh....@googlemail.com>
wrote:
> On May 8, 8:22 am, Andy Champ <no....@nospam.invalid> wrote:
> > Remember that the format of the numbers themselves may vary. I've
> > worked on systems where a native integer is 16, 24, 32 and 36 bits
> > long. To force the 36-bit one down to 32 bits would require an
> > operation on every memory store - probably every operation, for
> > true compatibility - and to force some of the 16-bit ones, which
> > have no hardware multiply or divide, to perform every operation in
> > 32 bits would be far too slow.
>
> > IMHO the best place to perform these conversions is during
> > I/O. I'm going to stand on the battlements and defend the castle
> > alongside those who produce the specification.
>
> Yes, these types do exist and need to be accounted for. And yes,
> these would be slower than native representations, but this is for
> interoperability, not for speed. Not using these parts of the
> language would not cause a performance degradation. Using them
> would allow the compiler to select the least amount of degradation
> possible.

I don't see a use case. You still haven't provided one. What is a
specific concrete example where you would use this?

I just thought of something. Is this because you want an executable
compiled for one system to be executable by another? Is this your use
case? That's unrealistic, and padding and endianness are the least of
your concerns. C++ is not the language you want if you want to be able
to port an executable to many different systems. That just runs
counter to several of the design goals of C++, and changing C++ to do
that would probably result in an almost completely compiler
implementation (and VM) to make "C++" executables executable on many
different systems. Ex: Java.

Andy Champ

unread,
May 9, 2013, 3:14:56 AM5/9/13
to
On 08/05/2013 21:23, Jack Adrian Zappa wrote:
> Oh, that new google groups is crap. Just ate my two new
> messages. >:(

I take it you are also AdrianH, the original poster - same email
address, but different name? That's a little confusing.

>
<snip>
>
> I didn't specify endianness, this is compiler driven (actually
> hardware, but the compiler 'knows what is best').
>
Well, your original post says "I only know a little about multi-endian
architectures, but IIRC, they have a way to automagicaly transfer data
from different endian segments correctly."

None of the processors I've ever used have variable endian settings.

But I'm not going to comment directly on your post, but ask you to
step back - what is the problem that you want to solve that makes you
want to specify endianness? Ditto for padding.

Andy

Jack Adrian Zappa

unread,
May 9, 2013, 10:25:39 AM5/9/13
to
You're right padding is not so much of an issue. It is just somewhat
of a convenience assuming you don't use pointers, and I thought I had
mentioned that this would have been only for POD stucts. If not, then
that's my bad. I had someone mention padding to me at one point (not
in this thread) which got me on to this tangent. However, with this,
it could potentially make for cleaner code which is easier to read.

And this is not for internal data. It is for serialization, but in a
way that allows for the binary data to be defined explicitly, rather
than implicitly from the architecture you are coming from.

The rational is to be able to transfer binary data efficiently across
a medium. One can enforce byte order using shifts and masks, but it
is inefficient. Though I am starting to think that this isn't going
to help. However, somehow knowing internal representation could allow
for efficient communication between systems using like types, while
allowing for a slightly less efficient translation system for those of
different types.


--

Jack Adrian Zappa

unread,
May 9, 2013, 8:00:01 PM5/9/13
to
On May 9, 3:03 am, Thomas Richter <t...@math.tu-berlin.de> wrote:
> > Even if templates or optimizers are not capable of stating this.
> > The COMPILERS should have something EMBEDDED to state this.
>
> Sorry, but which use case do you want to solve? If any, then as far as
> I see, only I/O is affected, and then this is not even an issue if you
> design your I/O properly. But proper software design is not a matter
> of a standard or a committee, but rather of the program author.

The problem is to be able to transmit binary data cleanly (code wise)
and efficiently (no to little binary manipulation). Being able to
determine endianness would allow for more efficient passing of binary
data. If I didn't want efficiency, I'd use XML.

> > Actually, a FAR BETTER alternative would be to be able to force a
> > type to be of a specific endianness and have padding of a struct be
> > able to be forced as well.
>
> This would affect performance dramatically, for little benefit. If you
> want to serialize structures or classes, then you cannot write them to
> disk directly anyhow, even with a standard-defined padding and
> endianness (otherwise, please explain me how to serialize a
> pointer). So it requires a user-generated solution in first place, and
> given that the slow part about serialization is the I/O part, why
> shouldn't I write out the structures in some defined form in first
> place. Say ASCII?

It would not if used in the confined context of IO. And yes, you
can't serialize a pointer. That constraint is needed. As for writing
in ASCII, there are many places where limited bandwidth is a problem
as well as limited CPU cycles. I'm not talking about a desktop, or
even a laptop. I'm talking about seriously constrained
architectures. Yes, IO is slow, but it can be done concurrently
through hardware with some other operation(s). Not stealing cycles
from that/those operation(s) is important.

> > If you don't specify, then the compiler is free to do what it wants.
>
> s/wants/should do/ And this basically because if you would force the
> compiler to do otherwise, things would be rather slow. C or C++ are
> designed that way, namely with performance in mind.

Yes, that is my intent. This is not for use outside of IO as I stated
previously.

> > This way, the compiler can do whatever optimizations it needs to
> > make it work as optimally as possible in either case, removing the
> > endianness and padding issues from the non- standard compiler world,
> > to a more cooperative interoperable world.
>
> The way how C is designed this is and should be the matter of the
> program author.

Not sure what you are saying here.

> > Those who don't need it, don't have to worry about it. Those who
> > do, don't have to go writing code that could potentially barf on
> > some systems.
>
> Actually, there is little you can do on systems with ANSI-C only. Not
> even a graphical user interface or such elementary things like getting
> a key press. C doesn't have all that. It does have the ability to link
> to system defined functions and to adapt to its environment, and by
> that become portable. Thus, what you describe is probably more an
> issue of "upper level" operating system standards (not that Microsoft,
> for example, would care), and its surely not a matter of C.

No, low level. I'm trying to say that there should be some standard
of interrogating the binary layout of numbers. And not everything has
an OS.

Öö Tiib

unread,
May 9, 2013, 7:06:39 PM5/9/13
to
On Thursday, 9 May 2013 16:30:02 UTC+3, Jack Adrian Zappa wrote:
> The rational is to be able to transfer binary data efficiently across
> a medium.

You can read C++ FAQ, it has section about serialization. Then
you can study various libraries that support binary serialization
to various formats. Then you can come up with some more
useful suggestion how C++ can support or simplify writing such
libraries more efficiently. Right now it feels that you have some
misconceptions about serialization.

> One can enforce byte order using shifts and masks, but it
> is inefficient.

It is inefficient to keep data in memory in some file format. It is
inefficient and non-portable to keep data in file in memory
layout of particular platform. So something what you request
feels to be doomed to be fundamentally inefficient.

Andy Champ

unread,
May 9, 2013, 7:07:09 PM5/9/13
to
On 09/05/2013 15:25, Jack Adrian Zappa wrote:
> And this is not for internal data. It is for serialization, but in a
> way that allows for the binary data to be defined explicitly, rather
> than implicitly from the architecture you are coming from.

We've all lived for years with the pain of serialisation. It's a lot
less than the pain of translating on the fly in and out of memory.

Remember that your binary data might not make sense to my computer, even
when the byte order is correct - I might not like the size of your
integer. And floating points are a nightmare...

Andy

Jack Adrian Zappa

unread,
May 9, 2013, 8:02:29 PM5/9/13
to
On May 9, 3:14 am, Andy Champ <no....@nospam.invalid> wrote:
> On 08/05/2013 21:23, Jack Adrian Zappa wrote:
>
> > Oh, that new google groups is crap. Just ate my two new
> > messages. >:(
>
> I take it you are also AdrianH, the original poster - same email
> address, but different name? That's a little confusing.

Yeah, thank google groups.

> > I didn't specify endianness, this is compiler driven (actually
> > hardware, but the compiler 'knows what is best').
>
> Well, your original post says "I only know a little about multi-endian
> architectures, but IIRC, they have a way to automagicaly transfer data
> from different endian segments correctly."
>
> None of the processors I've ever used have variable endian settings.

PowerPC is one of a few examples.

> But I'm not going to comment directly on your post, but ask you to
> step back - what is the problem that you want to solve that makes you
> want to specify endianness? Ditto for padding.

Padding is more a convenience, syntactic sugar to allow for reading
part of a binary stream by overlaying a struct over top of it.

Endianness is for communication compatibility. Being able to know
what computer A uses in comparison to computer B would allow for less
bit manipulation if A and B are compatible.

Jens Schmidt

unread,
May 9, 2013, 8:00:57 PM5/9/13
to
Jack Adrian Zappa wrote:

> The rational is to be able to transfer binary data efficiently across
> a medium. One can enforce byte order using shifts and masks, but it
> is inefficient. Though I am starting to think that this isn't going
> to help. However, somehow knowing internal representation could allow
> for efficient communication between systems using like types, while
> allowing for a slightly less efficient translation system for those of
> different types.

For "it is inefficient" as usual please provide measurements. You will
be surprised by modern compilers and their optimization capabilities.
I.e., ensuring the byte order by shifts and masks versus doing aliasing
subfield accesses often compiles into the same machine code.
--
Greetings,
Jens Schmidt

Jack Adrian Zappa

unread,
May 9, 2013, 8:00:16 PM5/9/13
to
On May 9, 3:07 am, Joshua Maurice <joshuamaur...@googlemail.com>
wrote:
> I don't see a use case. You still haven't provided one. What is a
> specific concrete example where you would use this?

Specific use case: Communication between micro controller to other
systems via streams or file storage.
Requirements: Need to be able to stipulate binary representation of
data between systems for clean and compatible communication
interchange while requiring low memory and processing overhead on the
part of the micro controller.

> Is this because you want an executable
> compiled for one system to be executable by another?

No.

Jack Adrian Zappa

unread,
May 9, 2013, 8:03:40 PM5/9/13
to
On May 8, 8:22 am, Andy Champ <no....@nospam.invalid> wrote:
> Remember that the format of the numbers themselves may vary. I've
> worked on systems where a native integer is 16, 24, 32 and 36 bits
> long. To force the 36-bit one down to 32 bits would require an
> operation on every memory store - probably every operation, for true
> compatibility - and to force some of the 16-bit ones, which have no
> hardware multiply or divide, to perform every operation in 32 bits
> would be far too slow.

Hey Andy,

I was thinking about this and realized that std::int8_t, std::int16_t,
std::int32_t and std::int64_t types would not be implemented on such
systems as they are optional. I find that interesting as it is part
of a 'standard' library <cstdint>. This seems to show that already
the C++ community is sort of splintering which is not necessarily a
good thing. Would it not be a better idea that more generic types,
even if slower, be made available to bring the community back
together?


A


--

Richard

unread,
May 10, 2013, 2:36:18 AM5/10/13
to
[Please do not mail me a copy of your followup]

Jack Adrian Zappa <adria...@googlemail.com> spake the secret code
<a38ee95a-04fe-434e...@b2g2000yqe.googlegroups.com>
thusly:

>On May 9, 3:14 am, Andy Champ <no....@nospam.invalid> wrote:
>> None of the processors I've ever used have variable endian
>> settings.
>
>PowerPC is one of a few examples.

MIPS R3000 is another example, although I don't know how far forward
they carried that down the product line.
--
"The Direct3D Graphics Pipeline" free book <http://tinyurl.com/d3d-pipeline>
The Computer Graphics Museum <http://computergraphicsmuseum.org>
The Terminals Wiki <http://terminals.classiccmp.org>
Legalize Adulthood! (my blog) <http://legalizeadulthood.wordpress.com>

Andy Champ

unread,
May 10, 2013, 10:17:10 AM5/10/13
to
On 10/05/2013 07:36, Richard wrote:
> [Please do not mail me a copy of your followup]
>
> Jack Adrian Zappa <adria...@googlemail.com> spake the secret code
> <a38ee95a-04fe-434e...@b2g2000yqe.googlegroups.com>
> thusly:
>
>> On May 9, 3:14 am, Andy Champ <no....@nospam.invalid> wrote:
>>> None of the processors I've ever used have variable endian
>>> settings.
>>
>> PowerPC is one of a few examples.
>
> MIPS R3000 is another example, although I don't know how far forward
> they carried that down the product line.
>
OK. I'll rephrase that. None of the processors I've used in the
past... It appears later SPARC and ARM chips are variable too. I
expect to be using ARM in the near future.

But it's still a problem best address during I/O; the cost of
translating once on the way in and once on the way out is likely to be
many times less than the cost of translating on the fly during
processing.

Andy


--

Thomas Richter

unread,
May 10, 2013, 11:23:13 AM5/10/13
to

Am 10.05.2013 02:00, schrieb Jack Adrian Zappa:

>> Sorry, but which use case do you want to solve? If any, then as far
>> as I see, only I/O is affected, and then this is not even an issue
>> if you design your I/O properly. But proper software design is not
>> a matter of a standard or a committee, but rather of the program
>> author.
>
> The problem is to be able to transmit binary data cleanly (code
> wise) and efficiently (no to little binary manipulation). Being able
> to determine endianness would allow for more efficient passing of
> binary data. If I didn't want efficiency, I'd use XML.

So, in other words, you want to trade memory bandwidth for IO
bandwidth? This seems to be a bad trade for me. Memory bandwidth is
by magnitudes faster, and the performance "degradation" you get by
making endian swaps yourself in program code are small compared to the
performance degradations you get by requiring a specific memory order.

I'm writing libraries for years, and I/O speed was always the slowest
part of the operation, so adding an endian switch there never made a
difference. Where things do make a difference is in tight signal
processing loops. If the compiler had to do an implicit endian switch
there for me, this would have had a measurable performance impact.

Of course, if you have different data, please feel free to report.

>> This would affect performance dramatically, for little benefit. If
>> you want to serialize structures or classes, then you cannot write
>> them to disk directly anyhow, even with a standard-defined padding
>> and endianness (otherwise, please explain me how to serialize a
>> pointer). So it requires a user-generated solution in first place,
>> and given that the slow part about serialization is the I/O part,
>> why shouldn't I write out the structures in some defined form in
>> first place. Say ASCII?
>
> It would not if used in the confined context of IO.

This is so, but why does it matter then? That is, if I have an
endian-guaranteed word in the I/O-oriented structure, and a
native-endian word somewhere in the heavy-processing chain, I need to
copy the data from one to another anyhow, so I do not gain anything in
terms of program simplicity.

> And yes, you can't serialize a pointer. That constraint is needed.
> As for writing in ASCII, there are many places where limited
> bandwidth is a problem as well as limited CPU cycles. I'm not
> talking about a desktop, or even a laptop. I'm talking about
> seriously constrained architectures.

Look, the cost for endian swaps etc. have to go *somewhere*. Either
you do it manually, or you let the compiler do it for you. C and C++
have the ability to be extensible by libraries. All what you need can
be done in such libraries if you want the syntactic sugar.

> Yes, IO is slow, but it can be done concurrently through hardware
> with some other operation(s). Not stealing cycles from that/those
> operation(s) is important.

Either it is, or it isn't. For I/O, it isn't because I/O speed is
dominating. For all other accesses, e.g. signal processing, memory
bandwidth is more important, and then data should be in the native
organization. What you say makes simply no sense to me: Either the
endian swap and padding is a no-cost, then you can do it yourself
anyhow. Or it isn't, no matter whether the compiler performs it
implicitly or not. It is at best a question of syntactic sugar. And if
so, I'm pretty sure boost has a solution for that, i.e. to make the
program look simpler for serialization questions. But the operations
have to go somewhere. But why should I clutter the core language with
that?

>>> This way, the compiler can do whatever optimizations it needs to
>>> make it work as optimally as possible in either case, removing the
>>> endianness and padding issues from the non- standard compiler
>>> world, to a more cooperative interoperable world.
>>
>> The way how C is designed this is and should be the matter of the
>> program author.
>
> Not sure what you are saying here.

It is the duty of the program author to ensure interoperability of
I/O. This is how C and C++ are designed.

>> Actually, there is little you can do on systems with ANSI-C
>> only. Not even a graphical user interface or such elementary things
>> like getting a key press. C doesn't have all that. It does have the
>> ability to link to system defined functions and to adapt to its
>> environment, and by that become portable. Thus, what you describe
>> is probably more an issue of "upper level" operating system
>> standards (not that Microsoft, for example, would care), and its
>> surely not a matter of C.
>
> No, low level. I'm trying to say that there should be some standard
> of interrogating the binary layout of numbers. And not everything
> has an OS.

Then define one, but please not within ANSI C or C++. This is the
wrong place for it because the philosophy of *this* language is a
different one. It is "do not pay for what you do not need", and I do
not need it. I can write portable I/O just fine without the help of
the language. I use libraries for that.

Greetings,
Thomas
0 new messages