The freedom that give us C++ and C

Ramine

unread,

Dec 27, 2015, 8:39:11 PM12/27/15

to

Hello,

If you have read my previous post of my proof, what i mean is:

The freedom that give us C++ and C is not acceptable for realtime safety
critical systems !

Thank you,
Amine Moulay Ramdane.

MitchAlsup

unread,

Dec 29, 2015, 6:52:04 PM12/29/15

to

On Sunday, December 27, 2015 at 7:39:11 PM UTC-6, Ramine wrote:
> The freedom that give us C++ and C is not acceptable for realtime safety
> critical systems !

I don't think anyone in their right mind would recommend C or C++ for real time,
for mission critical, or for life preserving systems.

Quadibloc

unread,

Dec 29, 2015, 8:20:26 PM12/29/15

to

People certainly are aware of the limitations and weaknesses of C and C++ in the
areas to which the original poster was calling our attention.

I suspect, though, that there are cases when those languages are used -
carefully - for such purposes because there is no other choice.

In fact, I remember learning in this newsgroup about a standard devised for the
purpose of facilitating C coding in embedded systems that applies some severe
restrictions to the language for purposes of safety.

John Savard

Robert Wessel

unread,

Dec 30, 2015, 12:34:47 AM12/30/15

to

On Tue, 29 Dec 2015 17:20:22 -0800 (PST), Quadibloc
<jsa...@ecn.ab.ca> wrote:

>On Tuesday, December 29, 2015 at 4:52:04 PM UTC-7, MitchAlsup wrote:
>> On Sunday, December 27, 2015 at 7:39:11 PM UTC-6, Ramine wrote:
>> > The freedom that give us C++ and C is not acceptable for realtime safety
>> > critical systems !
>
>> I don't think anyone in their right mind would recommend C or C++ for real time,
>> for mission critical, or for life preserving systems.
>
>People certainly are aware of the limitations and weaknesses of C and C++ in the
>areas to which the original poster was calling our attention.
>
>I suspect, though, that there are cases when those languages are used -
>carefully - for such purposes because there is no other choice.

Sure, the qualities of the language are only one consideration.
Everything else from the quality and availability of compilers, to the
development environments, to the tool chains, to the libraries, to the
system interfaces, to the documentation tools, to the availability of
programmers is also a consideration.

Personally I would favor Ada for most embedded or systems type work,
and something like Haskell for most applications work, but practical
reality has pretty much made those choices impossible, and I'm right
back at C and C++ for most things.

Of course this is related to the innumerable and doomed attempts to
produce a "better" C or C++. Technically better C's and C++'s exist
by the dozen. But they lack the thing the makes C and C++ so popular
- namely the ubiquity of C and C++. IOW, any hope to replace C or C++
must start with a language approximately as popular as C or C++.

Depressing, I know.

Note that *is* eventually possible, Fortran used to have that role in
ye olden dayes and was used for everything from code portability to
compiler targets (and if you think handling text and dynamic
structures in C is fraught with difficulties, just try it in Fortran
77 and earlier). C has largely replaced Fortran in that role, and C++
has made a major inroad into C's popularity (although with C++ being a
98% upwards compatible language, it managed to cheat a bit).

>In fact, I remember learning in this newsgroup about a standard devised for the
>purpose of facilitating C coding in embedded systems that applies some severe
>restrictions to the language for purposes of safety.

Probably MISRA.

David Brown

unread,

Dec 30, 2015, 3:45:04 AM12/30/15

to

On 30/12/15 02:20, Quadibloc wrote:
> On Tuesday, December 29, 2015 at 4:52:04 PM UTC-7, MitchAlsup wrote:
>> On Sunday, December 27, 2015 at 7:39:11 PM UTC-6, Ramine wrote:
>>> The freedom that give us C++ and C is not acceptable for realtime safety
>>> critical systems !
>
>> I don't think anyone in their right mind would recommend C or C++ for real time,
>> for mission critical, or for life preserving systems.

Almost all mission critical, life-preserving, safety and realtime
systems are written in C or C++. These two languages totally dominate,
with Ada, Java, and assembly being minor in comparison. (Proportions
vary according to the type of system, of course - your options are
different if you are running a stock exchange or a pacemaker.)

>
> People certainly are aware of the limitations and weaknesses of C and C++ in the
> areas to which the original poster was calling our attention.

Indeed. The key point to making reliable systems is not the language
you pick, but your understanding of the language, the tools, and the
development process. You don't let amateur numpties like Ramine program
your safety critical code, regardless of the language.

>
> I suspect, though, that there are cases when those languages are used -
> carefully - for such purposes because there is no other choice.
>

C and C++ are used because, all in all, they are the best choice. It is
better to have tools and languages that are well known and well
understood, with qualified developers, than to have supposedly "safer"
languages with less experienced developers and tools that have not had
the same level of use and abuse to iron out defects.

(And note that used well, C++ gives at least as many "safety" features
as Ada. If you are worried about mixing signed and unsigned data, just
enable compiler warnings about it. If you are even more paranoid, write
classes Signed and Unsigned to give you exactly the level of control you
want.)

> In fact, I remember learning in this newsgroup about a standard devised for the
> purpose of facilitating C coding in embedded systems that applies some severe
> restrictions to the language for purposes of safety.
>

There are many, with different balances to suit different purposes.
Misra is one of the most famous. Also google for the JSF C++ coding
standard.

Coding standards are also used when developing with Ada or other
languages, whenever you are dealing with serious critical systems.

Remember also that for critical systems, the actual /coding/ is a small
part of the job. At the highest level, a programmer might average a
dozen lines of code per week. It is not the programming language that
matters, it is everything else around it that is important. (The
testing, the specifications, the code reviews, the design process, the
documentation, etc., etc.)

already...@yahoo.com

unread,

Dec 30, 2015, 5:08:16 AM12/30/15

to

Then our world is crazy.
Because in our world nearly all real time or life preserving systems are implemented in either C or C++.
I am not sure about "mission critical", because I am not sure about real-world meaning of the phrase "mission critical".

already...@yahoo.com

unread,

Dec 30, 2015, 5:14:31 AM12/30/15

to

MISRA guidelines. It's a crap, horrible crap, as a majority of "automotive electronic" things. It seems, this industry (or, at least, industry bodies) is concerned with C.Y.A. much more than with making good reliable products.

> John Savard

David Brown

unread,

Dec 30, 2015, 5:53:18 AM12/30/15

to

MISRA has some poor points, certainly, and also some entirely obvious
ones - but most of its rules make sense. It places limits on some of
the riskier aspects of C.

For example, it says "language extensions should not be used" - that's a
bad idea. If the compiler you use has features that let you write safer
or more reliable code, use them.

It says "there shall be no occurrence of undefined or critical
unspecified behaviour" - that is absolutely obvious. There is no need
of a rule that says the program has to work!

The rules covering identifiers and avoiding ambiguity or confusion are
good and clear, though perhaps also obvious.

There are plenty of rules covering things that are legal in C, and some
programmers use - but which can be unclear, hard to maintain, or risk
having hidden problems. For example, you are not allowed to mix signed
and unsigned types in expressions (one of Ramine's complaints). You are
not allowed to use the comma operator, or the result of an assignment
expression.

And of course using a set of rules like MISRA is only one part of making
reliable and safe code.

Nick Maclaren

unread,

Dec 30, 2015, 9:04:07 AM12/30/15

to

In article <n60ctv$2cq$1...@dont-email.me>,

David Brown <david...@hesbynett.no> wrote:
>On 30/12/15 11:14, already...@yahoo.com wrote:
>>
>> MISRA guidelines. It's a crap, horrible crap, as a majority of
>> "automotive electronic" things. It seems, this industry (or, at
>> least, industry bodies) is concerned with C.Y.A. much more than with
>> making good reliable products.
>
>MISRA has some poor points, certainly, and also some entirely obvious
>ones - but most of its rules make sense. It places limits on some of
>the riskier aspects of C.
>
>For example, it says "language extensions should not be used" - that's a
>bad idea. If the compiler you use has features that let you write safer
>or more reliable code, use them.
>
>It says "there shall be no occurrence of undefined or critical
>unspecified behaviour" - that is absolutely obvious. There is no need
>of a rule that says the program has to work!

Well, that would make sense if the C standard made sense, but the
wording is such that undefined behaviour includes both perfectly
correct behaviour that is 'guaranteed' to work but which is not
possible to specify precisely and clearly erroneous code.

Regards,
Nick Maclaren.

Megol

unread,

Dec 30, 2015, 9:15:45 AM12/30/15

to

Remember that means that you program in another language - not the standard one.

> It says "there shall be no occurrence of undefined or critical
> unspecified behaviour" - that is absolutely obvious. There is no need
> of a rule that says the program has to work!

But it is possible to write code that assumes a certain behavior and have it work reliable if the hardware have that behavior.

So here you are really two views on the same issue - that of using a language standard. IMHO one should either use the standard or not, not mix and match.

David Brown

unread,

Dec 30, 2015, 11:06:05 AM12/30/15

to

I am not quite sure how to parse that sentence. Are you trying to say
that there are things that the C standard says are undefined behaviour,
but that are "guaranteed" to work in some sense?

David Brown

unread,

Dec 30, 2015, 11:17:52 AM12/30/15

to

Not exactly - it means you program in an extension of the standard
language. And often this can be done (using preprocessor macros) in a
way that is compatible with standard code - the extensions are only used
when the supporting compiler is used.

Clearly one should not use anything compiler-specific unless there is
good reason, but there are occasions where they can help make code
significantly better.

>
>> It says "there shall be no occurrence of undefined or critical
>> unspecified behaviour" - that is absolutely obvious. There is no
>> need of a rule that says the program has to work!
>
> But it is possible to write code that assumes a certain behavior and
> have it work reliable if the hardware have that behavior.

That is true for implementation-dependent behaviour (say, the endianness
of integers or their sizes). It is not true for /undefined/ behaviour -
the compiler can optimise with the assumption that it cannot occur.

For example, just because you know your target hardware uses two's
complement signed integers of size 32-bit, does not mean you can assume
that integer overflow has wraparound behaviour. The behaviour of
integer overflow is undefined - sometimes it will wrap, sometimes it
will be ignored, sometimes it will launch nasal daemons.

So perhaps this is actually a /good/ rule in MISRA - it seems it is not
obvious that you have to avoid undefined behaviour in code.

(Note that in some cases, you can use compiler-specific extensions or
options that affect some undefined behaviours. For example with gcc,
you can use a flag that makes integer overflow have defined wrapping
behaviour.)

Nick Maclaren

unread,

Dec 30, 2015, 11:30:04 AM12/30/15

to

In article <n60v8d$3sj$1...@dont-email.me>,

Yes. And they are both pervasive and important.

Regards,
Nick Maclaren.

David Brown

unread,

Dec 30, 2015, 12:16:38 PM12/30/15

to

Examples?

(I am not claiming that the C standards are perfect, or as consistent or
clear as they could be - I just want to see exactly what you are
thinking of here.)

MitchAlsup

unread,

Dec 30, 2015, 12:18:02 PM12/30/15

to

On Tuesday, December 29, 2015 at 11:34:47 PM UTC-6, robert...@yahoo.com wrote:

> Of course this is related to the innumerable and doomed attempts to
> produce a "better" C or C++. Technically better C's and C++'s exist
> by the dozen. But they lack the thing the makes C and C++ so popular
> - namely the ubiquity of C and C++.

That and free operating systems written in that language, along with free
compilers and run time libraries (also written in that language); and
a generation of coders fully versed in that language. (Although Nick will
strenuously disagree with the word versed.)

already...@yahoo.com

unread,

Dec 30, 2015, 12:25:24 PM12/30/15

to

int x = -1;
int y = x << 1; // undefined !!! but works as expected on any target that matters

But I'd guess Nick meant something else. I am pretty consistent at misunderstanding what he means.

BGB

unread,

Dec 30, 2015, 12:31:22 PM12/30/15

to

On 12/30/2015 2:45 AM, David Brown wrote:

>
> Remember also that for critical systems, the actual /coding/ is a small
> part of the job. At the highest level, a programmer might average a
> dozen lines of code per week. It is not the programming language that
> matters, it is everything else around it that is important. (The
> testing, the specifications, the code reviews, the design process, the
> documentation, etc., etc.)
>

in my limited experience with MCU programming, one doesn't write a lot
of code either. ROM is fairly small, so each piece of code needs to
matter. contrast to larger systems where the code tends to be a mass of
anything even loosely related.

C is pretty hard to beat for an MCU, as it provides what is needed, but
omits things that would basically be unusable on an MCU anyways (random
thought of someone trying to use Java on an AVR8 or MSP430).

a similar line of thinking, back on the PC, can allow writing fast code,
as trying to keep everything small enough to fit in L1 helps a bit with
speed (and the L1 is still a fair bit bigger than the RAM and ROM on a
lot of these MCUs).

then one can write things which process data near or at the speed of a
straight memory copy (one example is a video codec producing decoded
video output at around 1.5 GB/sec on a PC with a 2GB/sec memcpy).

"real time" is a bit more ambiguous, with the implications of the term
depending significantly on the domain:
for MCU or robotics, referring primarily to maximum response latency;
for games programming, ability to deliver a consistent framerate and not
cause noticeable jitter or stalls/lag;
for media processing, ability to match the speed it would happen at
during live playback (ex: you want the codec to be "super real-time" as
opposed to "sub real-time");
...

how things are done in one situation may be pretty alien in another.

BGB

unread,

Dec 30, 2015, 12:48:51 PM12/30/15

to

need to copy 8 bytes:
typedef unsigned long long u64;
...
char *cs, *ct;
...
*(u64 *)ct=*(u64 *)cs;

one many targets, this will work as expected, but there are a few where
this risks things going horribly wrong if the pointers are misaligned.

a person could use memcpy, but there are a few compilers where memcpy
would be a slower option because it performs a function call to a memory
copy-loop, rather than producing a sane instruction sequence to copy N
bytes (or may only use a fast instruction sequence if it could somehow
prove that the pointers are always aligned).

closely related is using integer operations to access memory, and then
using arithmetic to work on data which conceivably exists as a stream of
bytes (and so the data would be in the wrong order, say, if the code
were run on a big-endian target).

or relaying on >> on negative numbers to sign extend the result.

...

Quadibloc

unread,

Dec 30, 2015, 1:01:08 PM12/30/15

to

Yes, that name rings a bell.

John Savard

Nick Maclaren

unread,

Dec 30, 2015, 1:17:34 PM12/30/15

to

In article <1d729df6-7ed9-4acf...@googlegroups.com>,

Yes, I do. I have explained this before, ad tedium, but will do so
again. C11 4.p2 says that any behaviour that does not have an
explicit definition is undefined; that has remained unchanged since
C90. The effect of 0.1+0.1 is undefined in that sense[*], as is a
huge number of other constructions. The numerous contradictions
also imply this problem, because it isn't explicitly specified
which wording overrides which, and you have to use the unwritten
"but everybody knows" rule. This isn't just in the semantics,
either, and there are a good many in the syntax.

All languages have some of this, but C and C++ have it much worse
than any other language I know of (except Pascal, as Wirth emitted
it).

[*] Please don't bring in the IEEE 754 red herring. I can explain
why that changes nothing, but it is foully complicated.

Regards,
Nick Maclaren.

Terje Mathisen

unread,

Dec 30, 2015, 2:16:48 PM12/30/15

to

BGB wrote:
> On 12/30/2015 2:45 AM, David Brown wrote:
>
>>
>> Remember also that for critical systems, the actual /coding/ is a small
>> part of the job. At the highest level, a programmer might average a
>> dozen lines of code per week. It is not the programming language that
>> matters, it is everything else around it that is important. (The
>> testing, the specifications, the code reviews, the design process, the
>> documentation, etc., etc.)
>>
>
> in my limited experience with MCU programming, one doesn't write a lot
> of code either. ROM is fairly small, so each piece of code needs to
> matter. contrast to larger systems where the code tends to be a mass of
> anything even loosely related.
>
> C is pretty hard to beat for an MCU, as it provides what is needed, but
> omits things that would basically be unusable on an MCU anyways (random
> thought of someone trying to use Java on an AVR8 or MSP430).
>
> a similar line of thinking, back on the PC, can allow writing fast code,
> as trying to keep everything small enough to fit in L1 helps a bit with
> speed (and the L1 is still a fair bit bigger than the RAM and ROM on a
> lot of these MCUs).
>
> then one can write things which process data near or at the speed of a
> straight memory copy (one example is a video codec producing decoded
> video output at around 1.5 GB/sec on a PC with a 2GB/sec memcpy).

On an original 8088 PC, it is in fact possible to decode LZ4 compressed
data at higher than memcpy() speeds. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

BGB

unread,

Dec 30, 2015, 2:49:24 PM12/30/15

to

how, exactly?...

unless maybe it is via a bad memcpy, or REP MOVSB?...

actually, if you allow naive memory copy loops, like:
char *dst, *src;
for(i=0; i<sz; i++)
dst[i]=src[i];

then my video codec are also faster than a memory copy:
this type of loop seems to max out at around 600 MB/s on my K10.

a slightly less naive loop, ex:
cs=src; cse=src+sz; ct=dst;
while(cs<cse)
*ct++=*cs++;

tends to peak out at around 900.
it is ~ 1.5GB/s if copying 32 bits at a time (int32), and 2.0 GB/s if
using SSE operations (MOVDQU).

I have a Xeon box where a similar MOVDQU copy pulls off 2.5GB/s, but
int32 copy is 1.0 GB/s.

I have a VQ codec, as noted, pulling off 1.0 to 1.5 GB/s (255 to 384
megapixels/second) on the BGRx decode path, for a single-threaded
plain-C decoder.

this is around 2x as fast as the H.264 HW decode on this PC, and 3x to
4x as fast as XviD.

though, like LZ4 vs Deflate, the VQ codec does have worse compression,
but it can passably do 720p30 in 8Mbps and 1080p in 12Mbps, so doesn't
seem too horrible (while pretty high bitrates vs YouTube videos, this is
a bit lower bitrate than typical ATSC broadcast TV using MPEG-2).

the YUY2 decode path is faster in an Mpix/sec sense (gets around 500
Mpix/sec), but lower in terms of raw bytes (YUY2 only uses 1/2 as many
bytes as 32-bit BGRx).

though, a lot of this is mostly because the VQ decode process does a lot
less work per pixel vs a traditional transform-based codec, and also in
many cases, producing output pixels is a series of 32-bit integer stores.

another (sort of) trick is handling Adaptive Rice coding in a similar
manner to conventional static-Huffman coding, which allows both AdRice
and SMTF+AdRice to perform similarly to static Huffman, but AdRice is a
little cheaper overall as it has less overhead than Huffman.

Niklas Holsti

unread,

Dec 30, 2015, 3:25:48 PM12/30/15

to

On 15-12-30 19:25 , BGB wrote:
> On 12/30/2015 2:45 AM, David Brown wrote:
>
>>
>> Remember also that for critical systems, the actual /coding/ is a small
>> part of the job. At the highest level, a programmer might average a
>> dozen lines of code per week. It is not the programming language that
>> matters, it is everything else around it that is important. (The
>> testing, the specifications, the code reviews, the design process, the
>> documentation, etc., etc.)
>>
>
> in my limited experience with MCU programming, one doesn't write a lot
> of code either. ROM is fairly small, so each piece of code needs to
> matter. contrast to larger systems where the code tends to be a mass of
> anything even loosely related.
>
> C is pretty hard to beat for an MCU, as it provides what is needed, but
> omits things that would basically be unusable on an MCU anyways (random
> thought of someone trying to use Java on an AVR8 or MSP430).

But Ada on AVR8 and MSP430 is perfectly practical. Memory sizes can put
some limits on the amount of standard libraries and kernel/RTS features
that can be used. But the core language, and all the compile-time
goodies, are ok even on such MCUs.

--
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
. @ .

BGB

unread,

Dec 30, 2015, 3:33:49 PM12/30/15

to

what could be useful, at least for a lot of common targets, would be
creating a more well-specified set of C semantics reflecting what is
generally seen on typical 32 and 64-bit hardware (ex: x86/x86-64 and ARM).

ex: 32-bit int, little endian, implementing two's complement modular
arithmetic, with sign extension for signed shift-right, ...

most compilers on these targets would have de-facto conformance, and we
can have a helpful label for "yeah, works on most stuff...".

this would effectively at this point cover "almost everything" apart
from a lot of MCUs, but MCUs can be a reasonable exception given that
rarely will the same code be used in both a PC and MCU application.

as-is, even a lot of "portable" code still ends up relying on GCC'isms
or other things, so it wouldn't really effect much (unless maybe a
special VM were made to test code on a simulated minimally conformant
implementation, and reject and try to break code that goes outside the
spec via generating randomized results in these cases).

BGB

unread,

Dec 30, 2015, 4:02:26 PM12/30/15

to

wasn't saying Ada couldn't be used, but Java would be a bit more of a
stretch, as would likely many of the proposed C replacements.

someone tries to use a Java port on an MSP430Gxxxx, creates a class,
tries to use 'new' on the class, compiler gives a "Hell No" error
message, as the new class instance would be bigger than the entire heap
space.

they are helpfully informed "well, you can only use static members, new
is basically off-limits, as are recursive methods, ...".

tries to use the class-library, but this doesn't work either: class
library is absent because of no real ROM space either (only a few
classes from java.lang exist, and even then, are merely faked as
existing at compile time, as are a few other classes containing GPIO and
HW registers). nor can they copy-paste other peoples' JARs into the project.

Java programmer starts freaking out, as they are unable to grasp what a
256B RAM space looks like, they keep trying to imagine more like what a
256kB space would look like. "poor fool, that is 1024 times as much RAM
as you have here...".

Terje Mathisen

unread,

Dec 31, 2015, 7:10:02 AM12/31/15

to

BGB wrote:
> On 12/30/2015 1:16 PM, Terje Mathisen wrote:
>> BGB wrote:
>>> then one can write things which process data near or at the speed of a
>>> straight memory copy (one example is a video codec producing decoded
>>> video output at around 1.5 GB/sec on a PC with a 2GB/sec memcpy).
>>
>> On an original 8088 PC, it is in fact possible to decode LZ4 compressed
>> data at higher than memcpy() speeds. :-)
>>
>
> how, exactly?...
>
> unless maybe it is via a bad memcpy, or REP MOVSB?...

The comparison is in fact fair:

Both versions are optimized asm code, so the memcpy is indeed a REP MOVS
(byte or word doesn't matter).

The LZ4 decoder alternates between new data and references to previous
data blocks, right?

If both of these are implemented with REP MOVSB then there is of course
no way to beat memcpy, but a very significant of those previous
references are in fact used to RLL encode one or two-byte repeated
sequences!

I.e. similar to the canonical method to zero memory on an IBM mainframe:
Write a single zero byte to the first location, then block copy that
byte to the next location. The hardware is of course optimized to
recognize those patterns and convert them into pure block writes.

Similarly, any really fast LZ4 decoder will detect at least 1, 2 and
4-byte wide repeated blocks and convert them into REP STOS instead,
avoiding all the reads that would otherwise be generated by the REP MOVS
operation.

In my own optimization to the LZ4 reference code I use unaligned SSE
vector stores to handle any repeated pattern of lengths from 1 to 16 bytes:

I load 16 bytes starting from where the reference pattern starts, then I
use a pair of table-driven permute operations to repeat the relevant
bytes across a pair of SSE regs, before I use a store loop like this to
fill out the target memory area:

do {
_mm_storeu_si128 ( target, sse_reg_1);
_mm_storeu_si128 ( target+16, sse_reg_2);
target += update_length;
} while (target < block_end);

update_length is of course the largest multiple of the pattern length
that is less or equal to 32, i.e.

update_length = (32 / pattern_length) * pattern_length;

except that I simply store that alongside the permutation tables.

> a slightly less naive loop, ex:
> cs=src; cse=src+sz; ct=dst;
> while(cs<cse)
> *ct++=*cs++;
>
> tends to peak out at around 900.
> it is ~ 1.5GB/s if copying 32 bits at a time (int32), and 2.0 GB/s if
> using SSE operations (MOVDQU).
>
> I have a Xeon box where a similar MOVDQU copy pulls off 2.5GB/s, but
> int32 copy is 1.0 GB/s.

Yeah, unaligned 16-byte stores tend to vary significantly in speed
betwen cpu models. Most handle all stores within a cache line with zero
extra cost, crossing cache lines (which happens ~25% of the time) costs
an extra cycle or three, while page crossings can be so expensive as to
significantly impact the average speed.

>
>
> I have a VQ codec, as noted, pulling off 1.0 to 1.5 GB/s (255 to 384
> megapixels/second) on the BGRx decode path, for a single-threaded
> plain-C decoder.
>
> this is around 2x as fast as the H.264 HW decode on this PC, and 3x to
> 4x as fast as XviD.
>
> though, like LZ4 vs Deflate, the VQ codec does have worse compression,
> but it can passably do 720p30 in 8Mbps and 1080p in 12Mbps, so doesn't
> seem too horrible (while pretty high bitrates vs YouTube videos, this is
> a bit lower bitrate than typical ATSC broadcast TV using MPEG-2).

Nice!

>
> the YUY2 decode path is faster in an Mpix/sec sense (gets around 500
> Mpix/sec), but lower in terms of raw bytes (YUY2 only uses 1/2 as many
> bytes as 32-bit BGRx).
>
> though, a lot of this is mostly because the VQ decode process does a lot
> less work per pixel vs a traditional transform-based codec, and also in
> many cases, producing output pixels is a series of 32-bit integer stores.
>
>
> another (sort of) trick is handling Adaptive Rice coding in a similar
> manner to conventional static-Huffman coding, which allows both AdRice
> and SMTF+AdRice to perform similarly to static Huffman, but AdRice is a
> little cheaper overall as it has less overhead than Huffman.
>

Thanks,

David Brown

unread,

Dec 31, 2015, 7:41:17 AM12/31/15

to

It is certainly possible to use Ada on these micros, at least the core
language (you can use gcc to compile Ada on these targets, but the
libraries are not all available). Last I heard, there were some parts
of the C++ libraries missing for the AVR that mean you can't use
exceptions - I don't know the state of C++ on the msp430 at the moment.

There are also other languages available, including Pascal, Forth, and
Basic.

However, a key point to making reliable and safe systems is that the
tools you use must be reliable. And a key point there is that they must
be well tested - both by the tool development team, and by users. It
doesn't help if Ada is the safest language in the world (I don't think
it is) if there are only half a dozen people using it on the AVR - you
are better off using C or C++ and knowing that as long as you write
correct source code, the compiler will generate correct object code.

David Brown

unread,

Dec 31, 2015, 7:51:52 AM12/31/15

to

On 30/12/15 18:25, already...@yahoo.com wrote:
> On Wednesday, December 30, 2015 at 7:16:38 PM UTC+2, David Brown wrote:
>> On 30/12/15 17:27, Nick Maclaren wrote:
>>> In article <n60v8d$3sj$1...@dont-email.me>,
>>> David Brown <david...@hesbynett.no> wrote:
>>>>
>>>> I am not quite sure how to parse that sentence. Are you trying to say
>>>> that there are things that the C standard says are undefined behaviour,
>>>> but that are "guaranteed" to work in some sense?
>>>
>>> Yes. And they are both pervasive and important.
>>>
>>
>> Examples?
>>
>> (I am not claiming that the C standards are perfect, or as consistent or
>> clear as they could be - I just want to see exactly what you are
>> thinking of here.)
>
> int x = -1;
> int y = x << 1; // undefined !!! but works as expected on any target that matters

No, that operation is fully defined - for signed types x, as long as "x
* 2^y" is representable in the return type (the type of x, int-promoted
if necessary), then the result is defined. If the value can't be
represented, then it is like any other integer overflow - undefined
behaviour.

(That's from C11 - I don't have the other standards conveniently
available to check for differences.)

David Brown

unread,

Dec 31, 2015, 9:27:59 AM12/31/15

to

A lot of people, including me, would like to see a common additional
standard that makes some of the implementation-dependent behaviour fully
defined, and moves some undefined behaviour into the unspecified (or
even fully defined) categories. But there a number of areas that will
need to remain implementation-dependent, and most undefined behaviour
should remain undefined (because there is no good way to define it, and
the benefits - in terms of code efficiency - are useful). In addition,
there are a number of features that would be very useful if they were
given in the standards.

There are already standards that enhance (by restriction) the C
standards. For example, if a compiler follows the POSIX standards, then
(AFAIUI) it must have "int" size a power of 2 of at least 32 bits, and
use two's complement signed integers. But I think we could do with a
better one covering "mainstream" targets - and I think there are a
number of key extensions or library functions that would be useful, but
cannot today be implemented without compiler and target specific features.

However, the real challenge is to see what would make sense in such a
standard. Most people - including you - have a view that is too
limited. The C world does not revolve around x86 and ARM - it is far wider.

I would start by fixing CHAR_BIT at 8, and insisting that signed
integers are all two's complement (but /not/ with modulo semantics -
integer overflow should remain undefined for performance reasons). An
"int" should be either 16 bit or 32 bit, according to target (there are
vast numbers of 8-bit and 16-bit cpus around - despite the increasing
popularity of ARM in embedded systems, these are still very important).

Some aspects of aliasing and accessing data using pointers to different
types could be tightened and clarified in the C specifications, and I
would like to see a way to say that a given pointer can alias anything else.

Endianness should be fixed at either big endian or little endian (target
dependent - there are many big endian cpus around). But there should be
a way to define data types with explicit endianness. It should also be
possible to be explicit about bitfield ordering.

> most compilers on these targets would have de-facto conformance, and we
> can have a helpful label for "yeah, works on most stuff...".

Agreed - but for a wider range of targets than you think.

>
>
> this would effectively at this point cover "almost everything" apart
> from a lot of MCUs, but MCUs can be a reasonable exception given that
> rarely will the same code be used in both a PC and MCU application.
>

There is an increasing level of overlap between PC and MCU code. If
nothing else, then MCU code is sometimes wrapped in code on a PC in
order to run unit tests on it. And for most MCU code, there really is
no problem in making it follow the same sort of standards - about the
only differences are the size of "int" (which might be 16-bit on an MCU)
and the endianness (and there are many workstation/server processors
that are big endian too).

Relaxing these restrictions to include such MCUs, then the only targets
left out are "dinosaurs" (where newer standards are seldom relevant) and
DSPs (where code really is very specialised).

already...@yahoo.com

unread,

Dec 31, 2015, 9:40:26 AM12/31/15

to

So, you changed you mind from a month ago?

https://groups.google.com/forum/#!original/comp.arch/_DKjqoiQsHM/YdwmPctrCAAJ

I didn't read C11 standard. Didn't read C99 either and C90 I read very long time ago.
My post was based on info from your post above and a post of Nick few days before yours.

David Brown

unread,

Dec 31, 2015, 9:47:13 AM12/31/15

to

That will fail in some circumstances. So it is not "guaranteed" to
work, in any real sense. A particular compiler implementation for a
particular target may say that it will work - a compiler is always free
to give you more guarantees than the C standards say.

> a person could use memcpy, but there are a few compilers where memcpy
> would be a slower option because it performs a function call to a memory
> copy-loop, rather than producing a sane instruction sequence to copy N
> bytes (or may only use a fast instruction sequence if it could somehow
> prove that the pointers are always aligned).

True (although most modern compilers either automatically optimise
memcpy inline, or offer a alternative "builtin" version). When it comes
to getting the most optimised code, you often need to work with the
details of the specific compiler.

>
> closely related is using integer operations to access memory, and then
> using arithmetic to work on data which conceivably exists as a stream of
> bytes (and so the data would be in the wrong order, say, if the code
> were run on a big-endian target).
>
> or relaying on >> on negative numbers to sign extend the result.
>

That's implementation dependent - not /undefined/. So any given
compiler has to document the effect of right-shift on a negative number.
/Usually/ it is sign extension - but there are a few targets that
can't do that as efficiently as treating it as an unsigned value. And
usually it does not matter - because the typical reason for writing "x
>> 1" with signed x is because the programmer thinks it is a good way
to write "x / 2". The programmer would be wrong, of course - if you
want "x / 2", you write "x / 2" and let the compiler generate optimal code.

David Brown

unread,

Dec 31, 2015, 10:50:37 AM12/31/15

to

I've corrected it, based on a more accurate reading of the relevant bits
of the standard. The C standards are not light reading, and it is not
always easy to get a clear view. This is not an area where the details
really matter, since there are rarely any good reasons for using shifts
on signed integers.

But I do apologise for not having been correct in my earlier post. If
you need details of these things, ask in comp.lang.c - mistakes about
the standards, however minor, are usually corrected quickly (and if the
standards are not clear, they will be argued over intensely).

The relevant parts of the C11 standards (using N1570, which is freely
available on the net if you want it) from 6.5.7:

The result of E1 << E2 is E1 left-shifted E2 bit positions; vacated bits
are filled with zeros. If E1 has an unsigned type, the value of the
result is E1 × 2^E2 , reduced modulo one more than the maximum value
representable in the result type. If E1 has a signed type and
nonnegative value, and E1 × 2^E2 is representable in the result type,
then that is the resulting value; otherwise, the behavior is undefined.

The result of E1 >> E2 is E1 right-shifted E2 bit positions. If E1 has
an unsigned type or if E1 has a signed type and a nonnegative value, the
value of the result is the integral part of the quotient of E1 / 2^E2 .
If E1 has a signed type and a negative value, the resulting value is
implementation-defined.

Jean-Marc Bourguet

unread,

Dec 31, 2015, 12:34:29 PM12/31/15

to

David Brown <david...@hesbynett.no> writes:

> That's implementation dependent - not /undefined/. So any given compiler
> has to document the effect of right-shift on a negative number. /Usually/
> it is sign extension - but there are a few targets that can't do that as
> efficiently as treating it as an unsigned value. And usually it does not
> matter - because the typical reason for writing "x
>>> 1" with signed x is because the programmer thinks it is a good way
> to write "x / 2". The programmer would be wrong, of course - if you want
> "x / 2", you write "x / 2" and let the compiler generate optimal code.

x / 2 and x >> 1 with a signed x have the same result only for positive x.
Compilers will probably not substite the later for the former.

Yours,

--
Jean-Marc

already...@yahoo.com

unread,

Dec 31, 2015, 12:52:49 PM12/31/15

to

It sounds like you were correct a month ago, not today. ((int)(-1) << 1) = undefined.

already...@yahoo.com

unread,

Dec 31, 2015, 1:18:43 PM12/31/15

to

On Thursday, December 31, 2015 at 4:47:13 PM UTC+2, David Brown wrote:
>
> That's implementation dependent - not /undefined/. So any given
> compiler has to document the effect of right-shift on a negative number.
> /Usually/ it is sign extension - but there are a few targets that
> can't do that as efficiently as treating it as an unsigned value. And
> usually it does not matter - because the typical reason for writing "x
> >> 1" with signed x is because the programmer thinks it is a good way
> to write "x / 2". The programmer would be wrong, of course - if you
> want "x / 2", you write "x / 2" and let the compiler generate optimal code.

I confess that I am not a typical person, but I never ever write "x >> n" for signed and possibly negative 'x' because I want 'x / 2**n'. I do it when I specifically want rounding toward negative infinity.
In my practice the most common application of such shift is a conversion of sign bit to mask of all ones or all zeros. So I use 'x >> 31' instead of 'x < 0 ? -1 : 0'.
I would guess that today on most compilers I can use the later form and optimizer will automatically produce the fastest possible asm sequence. But even 10 years ago many compilers were not that good, much less so 20-25 years ago.

Ivan Godard

unread,

Dec 31, 2015, 5:01:31 PM12/31/15

to

Which is why the Mill defines rounding mode for right shifts.
int i = -1;
shiftrsfz(i, 1) -> -1 // shift right signed rounded from zero
shiftrstz(i, 1) -> 0 // shift right signed rounded toward zero

Note that shiftrstz is exactly the same as divide for both signed and
unsigned; this is surprisingly common, and a real divide is *very*
expensive. shiftrsfz is the usual "arithmetic" right shift.

Ivan Godard

unread,

Dec 31, 2015, 5:04:57 PM12/31/15

to

On 12/31/2015 2:01 PM, Ivan Godard wrote:

Oops; corrected:

Which is why the Mill defines rounding mode for right shifts.
int i = -1;

shiftrsn(i, 1) -> -1 // shift right signed rounded toward
negative infinity

shiftrstz(i, 1) -> 0 // shift right signed rounded toward zero

Note that shiftrstz is exactly the same as divide for both signed and
unsigned; this is surprisingly common, and a real divide is *very*

expensive. shiftrsn is the usual "arithmetic" right shift.

Chris M. Thomasson

unread,

Dec 31, 2015, 5:07:00 PM12/31/15

to

> "MitchAlsup" wrote in message
> news:fb042ab1-30e4-40d2...@googlegroups.com...

> > On Sunday, December 27, 2015 at 7:39:11 PM UTC-6, Ramine wrote:
> > The freedom that give us C++ and C is not acceptable for realtime safety
> > critical systems !

> I don't think anyone in their right mind would recommend C or C++ for real
> time,
> for mission critical, or for life preserving systems.

What about:

http://www.stroustrup.com/JSF-AV-rules.pdf

:^)

Mike Stump

unread,

Dec 31, 2015, 8:30:13 PM12/31/15

to

In article <fb042ab1-30e4-40d2...@googlegroups.com>,

MitchAlsup <Mitch...@aol.com> wrote:
>On Sunday, December 27, 2015 at 7:39:11 PM UTC-6, Ramine wrote:
>> The freedom that give us C++ and C is not acceptable for realtime safety
>> critical systems !
>
>I don't think anyone in their right mind would recommend C or C++ for real time,
>for mission critical, or for life preserving systems.

:-) I would. Hint, look up Wind River and VxWorks and figure out what
language it is written in and what it does and where it has been and
who uses it and for what.

Mike Stump

unread,

Dec 31, 2015, 8:45:03 PM12/31/15

to

In article <4758335a-c2a5-43b3...@googlegroups.com>,
<already...@yahoo.com> wrote:

>On Wednesday, December 30, 2015 at 1:52:04 AM UTC+2, MitchAlsup wrote:
>> I don't think anyone in their right mind would recommend C or C++ for real time,
>> for mission critical, or for life preserving systems.

>Because in our world nearly all real time or life preserving systems are implemented in either C or C++.

Google's self-driving car is written in C++. :-)

BGB

unread,

Dec 31, 2015, 9:11:00 PM12/31/15

to

ok, nifty idea. in my own LZ decoder, these cases are currently a less
efficient case, as they have involve falling back to byte-based copies
rather than 32-bit or 64-bit copies.

I tried using 128-bit SSE copies, but found that on average (for the
often short LZ matches) this costs more than using 32 or 64 bit GPR copies.

like the LZ4 decoders I had looked at, the copy operations may write a
few bytes past the end of the span.

> In my own optimization to the LZ4 reference code I use unaligned SSE
> vector stores to handle any repeated pattern of lengths from 1 to 16 bytes:
>
> I load 16 bytes starting from where the reference pattern starts, then I
> use a pair of table-driven permute operations to repeat the relevant
> bytes across a pair of SSE regs, before I use a store loop like this to
> fill out the target memory area:
>
>
> do {
> _mm_storeu_si128 ( target, sse_reg_1);
> _mm_storeu_si128 ( target+16, sse_reg_2);
> target += update_length;
> } while (target < block_end);
>
> update_length is of course the largest multiple of the pattern length
> that is less or equal to 32, i.e.
>
> update_length = (32 / pattern_length) * pattern_length;
>
> except that I simply store that alongside the permutation tables.
>

ok, may have to try similar.

>> a slightly less naive loop, ex:
>> cs=src; cse=src+sz; ct=dst;
>> while(cs<cse)
>> *ct++=*cs++;
>>
>> tends to peak out at around 900.
>> it is ~ 1.5GB/s if copying 32 bits at a time (int32), and 2.0 GB/s if
>> using SSE operations (MOVDQU).
>>
>> I have a Xeon box where a similar MOVDQU copy pulls off 2.5GB/s, but
>> int32 copy is 1.0 GB/s.
>
> Yeah, unaligned 16-byte stores tend to vary significantly in speed
> betwen cpu models. Most handle all stores within a cache line with zero
> extra cost, crossing cache lines (which happens ~25% of the time) costs
> an extra cycle or three, while page crossings can be so expensive as to
> significantly impact the average speed.

yeah. for larger copies, they seem to be the fastest option, but tend to
suffer a bit for small irregular copies.

ex:
while(cs<cse)
{ *(u64)ct=*(u64)cs; ct+=8; cs+=8; }
may be faster than:
while(cs<cse)
{
x0=_mm_loadu_si128(cs);
_mm_storeu_si128(ct, x0);
ct+=16; cs+=16;
}

though, cases where one has a match longer than 64 bytes or so (seems
roughly break even) are uncommon enough that it seems to end up spending
overall more time in the 'if()' branch for whether to use an SSE copy,
so it works out cheaper to just use a GPR copy.

could depend a lot on the data being decoded though.

>>
>>
>> I have a VQ codec, as noted, pulling off 1.0 to 1.5 GB/s (255 to 384
>> megapixels/second) on the BGRx decode path, for a single-threaded
>> plain-C decoder.
>>
>> this is around 2x as fast as the H.264 HW decode on this PC, and 3x to
>> 4x as fast as XviD.
>>
>> though, like LZ4 vs Deflate, the VQ codec does have worse compression,
>> but it can passably do 720p30 in 8Mbps and 1080p in 12Mbps, so doesn't
>> seem too horrible (while pretty high bitrates vs YouTube videos, this is
>> a bit lower bitrate than typical ATSC broadcast TV using MPEG-2).
>
> Nice!

yep.

this particular design (BT1H) was originally meant for my robot
projects, for use in streaming video encoding on ARM based hardware
(700MHz ARM11), as an attempt to improve the quality/bitrate over its
predecessor (at a hopefully modest speed cost, BT1G was a similar
design, but lacked an entropy coder, opting instead for a fixed-format
byte-oriented design).

however, it has done a bit better in general than expected, and is
currently doing better than my previous VQ codecs in many areas.

I have thus far mostly been using it for desktop capture, as the encoder
is currently fast enough on my K10 to encode 1080p30 at ~15% CPU load.
the current single-threaded encoder maxes the encode thread at around
45-50 fps though.

in tests it is fast enough to encode 480p30 on a Raspberry Pi (700MHz
ARM11), and (still single threaded) 720p30 on a Raspberry Pi 2 (900MHz
ARM Cortex A7 quad-core). on the RPi2, 1080p30 could be theoretically
done with a tandem encoder design.

its per-thread decode speed still doesn't reach an older codec (BT1C) in
certain modes, but it does match the old speeds in the mode I typically
used it (essentially RPZA-like with 23-bit color endpoints and using my
BTLZH encoder as an entropy backend).

basically, in certain edge cases, my older codec could reach absurd
decode speeds (gigapixel/second territory for multi-threaded decoding).

however, these weren't really "actually useable" scenarios (for more
than a benchmark, things would be seriously IO bound).

however, quality/bitrate in BT1H is significantly better than BT1C (1).

at the time, BTLZH (an extended Deflate variant) reached a speed of
around 250 MB/sec for decode, but I had more recently micro-optimized
BTLZH to 475 MB/sec. though, in pure Huffman tests, it is still a little
slower than ZLIBH and FSE-HUF (140MB/s vs 200MB/s).

BTLZH basically extends Deflate slightly (and remains backwards
compatible by design), getting BZ2 like compression but much faster
decoding, mostly by using a larger sliding window and borrowing a few
tricks from LZMA (though sticking with static Huffman, adding the
ability to predict match lengths and offsets from previous matches).

apparently the design competes in the same general space as Zstd.
both seem to have otherwise fairly comparable stats.

superficially, BT1C and BT1H are similar in that both use 4x4 pixel
blocks, and blocky VQ, with each block typically as 4x4x2b with 2-bits
encoding a value interpolated between a pair of endpoints.

a bit of a difference though comes up in terms of how color endpoints
are handled:
BT1C primarily used a pair of 23-bit RGB endpoints (R8G8B7);
BT1H uses YUVD or YUVDyuv, or essentially a vector in YUV space
(essentially the JPEG YCbCr colorspace).

YUVD could encode a pair of endpoints differing in luma, whereas YUVDyuv
encodes a pair of endpoints in luma/chroma space.

BT1C used BTLZH as the entropy backend, whereas BT1H uses AdRice and a
raw bitstream.

BT1H also allows more readily using a number of different types of
blocks, ex:
flat (no bits);
2x2x1 (4 bits);
2x2x2 (8 bits);
4x2x2 (16 bits);
2x4x2 (16 bits);
4x4x1 (16 bits);
4x4x2 (32 bits);
4x4x3 (48 bits);
4x4x2Y2x2x2UV (48 bits, YUV 4:2:0);
4x4x3Y2x2x2UV (64 bits, YUV 4:2:0);
4x4x2YUV (96 bits, YUV 4:4:4);
4x4x3Y4x4x2UV (112 bits, YUV 4:4:4).

the type of block being chosen per-block based on quality level and
heuristics.

the 4:2:0 and 4:4:4 blocks interpolate YUV from YUVDyuv endpoints for
individual pixels. these are used for blocks with high chroma contrast
(there was an issue with my past VQ codecs I called "the Rainbow Dash
issue", where basically sharp transitions in chroma, such as Rainbow
Dash's mane, look terrible absent having blocks which can deal with
higher chroma resolution).

however, for the majority of blocks, the contrast is a bit lower, and
cheaper blocks can be used (fewer bits and less resolution).

for most blocks, the YUV<->RGB conversion is per-block, not per-pixel.

lower quality levels force the encoder to use more cheaper blocks, and
to use harsher quantization on the color-vector deltas.

there were a few block types which achieved YUV 420 and 444 via encoding
more color-points, but these aren't really used at this point as they
compress worse and are slower to encode/decode.

generally, the currently used blocks all use a single color-vector per
block, just the higher-chroma blocks interpolate the components
independently, as opposed to interpolating between a pair of points.

a few of the directly-coded 420 and 444 blocks could conceivably be used
for lossless encoding, but the current decoder is not capable of doing so.

>>
>> the YUY2 decode path is faster in an Mpix/sec sense (gets around 500
>> Mpix/sec), but lower in terms of raw bytes (YUY2 only uses 1/2 as many
>> bytes as 32-bit BGRx).
>>
>> though, a lot of this is mostly because the VQ decode process does a lot
>> less work per pixel vs a traditional transform-based codec, and also in
>> many cases, producing output pixels is a series of 32-bit integer stores.
>>
>>
>> another (sort of) trick is handling Adaptive Rice coding in a similar
>> manner to conventional static-Huffman coding, which allows both AdRice
>> and SMTF+AdRice to perform similarly to static Huffman, but AdRice is a
>> little cheaper overall as it has less overhead than Huffman.
>>

I am not sure if the current video codec could work-out as-is if Huffman
coded. the original use-case it was designed for are mostly why Rice is
used.

I am not sure it is currently the best option for block-commands, which
currently make up a fair part of the bitstream at lower bitrates, and
seem highly saturated.

it is possible low-bitrate operation could benefit from the use of an
optional secondary range-coder step (say, feeding Rice-coded bits
through a bitwise range coder), but I have reservations here.

luckily at-least AdRice should be a little easier to mesh well with a
bitwise range-coder than Huffman was. TBD if it is worth the effort to
implement and test this case.

Waldek Hebisch

unread,

Dec 31, 2015, 9:32:04 PM12/31/15

to

MitchAlsup <Mitch...@aol.com> wrote:
> On Sunday, December 27, 2015 at 7:39:11 PM UTC-6, Ramine wrote:
> > The freedom that give us C++ and C is not acceptable for realtime safety
> > critical systems !
>

> I don't think anyone in their right mind would recommend C or C++ for real time,
> for mission critical, or for life preserving systems.

In the past I would agree. But things changed. For example
seL4 project (ssrg.nicta.com.au/projects/seL4/) developed
formaly verified microkernel "in C". If you read their report
you will learn that they actually wrote the sofware in Haskell,
formally verified it and then translated it by hand to C.
Then they verified that the C version _up to genereated
machine code_ satisfies the same specification as the
Haskcell version. Using such methodolody safety properties
of "target" language are almost irrelevant. What counts
is tool support for verification. And nowadays C seem to
have best tool support. Granted, Ada has safety features
build in and there is also external tool (SPARK). However
C tools seem to go beyond what is available for Ada.

--
Waldek Hebisch

BGB

unread,

Dec 31, 2015, 11:16:30 PM12/31/15

to

yes, but I wasn't claiming that everything is x86 and ARM, rather it
would be a specification which would be N/A for architectures which
differ significantly from x86 and ARM.

it would in this sense be more like DirectX, where it applies where it
does (ex: on Windows), but is N/A pretty much everywhere else.

> I would start by fixing CHAR_BIT at 8, and insisting that signed
> integers are all two's complement (but /not/ with modulo semantics -
> integer overflow should remain undefined for performance reasons). An
> "int" should be either 16 bit or 32 bit, according to target (there are
> vast numbers of 8-bit and 16-bit cpus around - despite the increasing
> popularity of ARM in embedded systems, these are still very important).
>

that is why I was saying before it would not apply to MCUs, or other
unusual targets.

there is a lot of code which exists on PCs which would not work with a
non-modular 'char' type (or other integer types for that matter).

ex: people expect their hash functions, and things like PNG filtering,
to work correctly, but neither will, as commonly implemented, without
modular integer types.

> Some aspects of aliasing and accessing data using pointers to different
> types could be tightened and clarified in the C specifications, and I
> would like to see a way to say that a given pointer can alias anything
> else.
>

ok.

> Endianness should be fixed at either big endian or little endian (target
> dependent - there are many big endian cpus around). But there should be
> a way to define data types with explicit endianness. It should also be
> possible to be explicit about bitfield ordering.
>

there are big endian CPUs, but many of these have since largely fallen
by the wayside, and are mostly limited to specialized embedded targets
at this point.

it looks like, for the most part, on CPUs, LE has won.

nevermind if BE seems to be slightly more popular for file-formats.

>> most compilers on these targets would have de-facto conformance, and we
>> can have a helpful label for "yeah, works on most stuff...".
>
> Agreed - but for a wider range of targets than you think.
>

possibly. as noted, I was thinking of something which mostly excludes
specialized embedded targets.

probably for the majority of programmers, their code will never leave
the PC and mobile space.

>>
>>
>> this would effectively at this point cover "almost everything" apart
>> from a lot of MCUs, but MCUs can be a reasonable exception given that
>> rarely will the same code be used in both a PC and MCU application.
>>
>
> There is an increasing level of overlap between PC and MCU code. If
> nothing else, then MCU code is sometimes wrapped in code on a PC in
> order to run unit tests on it. And for most MCU code, there really is
> no problem in making it follow the same sort of standards - about the
> only differences are the size of "int" (which might be 16-bit on an MCU)
> and the endianness (and there are many workstation/server processors
> that are big endian too).
>

server and workstation probably means aging PowerPC boxes?...

I wouldn't expect all that much new HW much outside IBM land would still
be using PPC.

for the mainstream it remains to be seen how the x86 vs ARM war will
turn out, ex, if either can make significant inroads into the others' area.

I am a little bit concerned as the "low end" of the laptop space is
starting to fall out to rather unimpressive tablet-class hardware
(Cortex Axx with XX GB Flash on the MOBO), and the price for "good"
laptops seems to be being pushed more towards the higher-end.

status quo or not, I prefer laptops with an x86 processor and removable
HDD. yet it looks like removable HDD is starting to become more of a
luxury feature.

I tend to run code for ARM controllers on directly on my PC, but
consider ARM to be mostly within this fold.

for MSP430 code, I generally run it in an emulator on my PC for testing,
though my emulator doesn't fully implement some functionality that I
use, but haven't recently done much on the 430's either (then I might go
fix some issues with the ADC and Watchdog Timer behavior and similar).

but, this is along similar lines to improving my electronics simulator
(making it not suck), or adding a virtual oscilloscope vs dumping
signals to WAV and looking at these in an audio program.

> Relaxing these restrictions to include such MCUs, then the only targets
> left out are "dinosaurs" (where newer standards are seldom relevant) and
> DSPs (where code really is very specialised).
>

ok.

BGB

unread,

Jan 1, 2016, 12:05:01 AM1/1/16

to

on both x86 and current ARM targets, it will work.
on older ARM targets, and many other targets, it will fail if things
aren't aligned just-right.

however, on x86 and ARM, it is faster than the main alternatives, so it
is a pretty compelling option if speed is a concern.

>> a person could use memcpy, but there are a few compilers where memcpy
>> would be a slower option because it performs a function call to a memory
>> copy-loop, rather than producing a sane instruction sequence to copy N
>> bytes (or may only use a fast instruction sequence if it could somehow
>> prove that the pointers are always aligned).
>
> True (although most modern compilers either automatically optimise
> memcpy inline, or offer a alternative "builtin" version). When it comes
> to getting the most optimised code, you often need to work with the
> details of the specific compiler.
>

this is a "most" that excludes the default behavior of MSVC, which tends
to by-default use function calls for this sort of thing rather than
built-ins.

and when it does inline it, will generally do so via "REP MOVSB", even
when it is a stupid option (vs the use of GPRs and SSE).

so, if you want it to be predictable and fast, the use of explicit
integer or SSE operations is needed.

granted, GCC is a little smarter about this, tending to make more
sensible choices when faced with "memcpy()" calls.

>>
>> closely related is using integer operations to access memory, and then
>> using arithmetic to work on data which conceivably exists as a stream of
>> bytes (and so the data would be in the wrong order, say, if the code
>> were run on a big-endian target).
>>
>> or relaying on >> on negative numbers to sign extend the result.
>>
>
> That's implementation dependent - not /undefined/. So any given
> compiler has to document the effect of right-shift on a negative number.
> /Usually/ it is sign extension - but there are a few targets that
> can't do that as efficiently as treating it as an unsigned value. And
> usually it does not matter - because the typical reason for writing "x
> >> 1" with signed x is because the programmer thinks it is a good way
> to write "x / 2". The programmer would be wrong, of course - if you
> want "x / 2", you write "x / 2" and let the compiler generate optimal code.
>

if you write 'x/2', typically the compiler will give you a
multiply-by-reciprocal. also '/' and '>>' give different behavior for
negatives.

more common use cases are things like using it to convert a bit into a
mask, or to perform efficient sign folding.

ex:
if you write:
i=(i<0)?-1:0;
compiler may give you something like:
XOR edx, edx
CMP eax, 0
CMOVL edx, -1
MOV eax, edx

whereas, if you write:
i=i>>31;
you get:
SAR eax, 31

some of this can be used in combination with other arithmetic operations
to implement more efficient branch-free logic.

namely, unpredictable 'if()' branches can get rather expensive, but if
you can reduce the logic entirely to bit-twiddling and arithmetic
operations, it can often go faster (failed branches simply get masked
out of the result).

I have yet to see a compiler which pulls this off effectively.

though, granted, it is very case-by-case whether a branched or
branch-free version of a piece of logic will be faster.

ex, sign folding:
(v<<1)^(v>>31) and (i>>1)^((i<<31)>>31)
vs:
(v>=0)?(v<<1):(((-v)<<1)-1) and (i&1)?(-((i+1)>>1)):(i>>1)

the former pair is a lot faster than the latter.

both will map negative numbers to positive numbers in the sequence:
0, -1, 1, -2, 2, -3, 3, -4, ...

where even when not necessarily the best sequence, is often good enough.

you can get an alternate sequence:
0, 1, -1, 2, -2, ...
but this costs an additional negation.

note though that this is mostly for cases where direct twos complement
or adding/subtracting a bias are not really practical solutions.

Terje Mathisen

unread,

Jan 1, 2016, 5:57:35 AM1/1/16

to

That's fair, I didn't really see any huge gains from doing this but
running over the entire LZ4 test set I did get consistently sightly
better numbers than the reference implementation.

>
> like the LZ4 decoders I had looked at, the copy operations may write a
> few bytes past the end of the span.

In LZ4 this is guaranteed safe since it always ends with a block of new
bytes, at least 5 afair?

This means that using 32-bit writes will be safe even when writing into
a caller-supplied buffer where you cannot write past the assigned
length, but from my own code I added the requirement that you must
always provide at least ~30 bytes extra tail end space.

If this cannot be provided then it is easy to wrap my decoder with a
function which calls the fast code on "input_buffer_size-32" bytes, then
use single-byte code on the last 32 input bytes.

>
>
>> In my own optimization to the LZ4 reference code I use unaligned SSE
>> vector stores to handle any repeated pattern of lengths from 1 to 16
>> bytes:
>>
>> I load 16 bytes starting from where the reference pattern starts, then I
>> use a pair of table-driven permute operations to repeat the relevant
>> bytes across a pair of SSE regs, before I use a store loop like this to
>> fill out the target memory area:
>>
>>
>> do {
>> _mm_storeu_si128 ( target, sse_reg_1);
>> _mm_storeu_si128 ( target+16, sse_reg_2);
>> target += update_length;
>> } while (target < block_end);
>>
>> update_length is of course the largest multiple of the pattern length
>> that is less or equal to 32, i.e.
>>
>> update_length = (32 / pattern_length) * pattern_length;
>>
>> except that I simply store that alongside the permutation tables.
>>
>
> ok, may have to try similar.

The main benefit was that this simplified the special handling logic
significantly, while finding a few more actual use cases. It is true
though that RLL blocks larger than 4 or 8 bytes are very rare. :-(

if (mat_offset <= 16) { // Bulk store only!
a = _mm_loadu_si128((const __m128i *)msrc);
a2 = _mm_shuffle_epi8(a, replicateTable2[mat_offset]);
a = _mm_shuffle_epi8(a, replicateTable[mat_offset]);
step = stepSize32[mat_offset];
do {
_mm_storeu_si128((__m128i *)dstore, a);
_mm_storeu_si128((__m128i *)(dstore+16), a2);
dstore += step;
} while (dstore < d);
}

I'm guessing a significant percentage of all matches do start within the
previous 16 bytes, even if they are only repeated once: The code above
avoids any tests for this, it blindly replicates the pattern across 32
bytes and then stores those bytes at least once, at which point the
final "while" check will be correctly predicted to drop through.

Terje

Walter Banks

unread,

Jan 1, 2016, 9:15:04 AM1/1/16

to

On 31/12/2015 9:27 AM, David Brown wrote:

> On 30/12/15 21:27, BGB wrote:
>>
>> ex: 32-bit int, little endian, implementing two's complement modular
>> arithmetic, with sign extension for signed shift-right, ...
>>
>
> A lot of people, including me, would like to see a common additional
> standard that makes some of the implementation-dependent behaviour fully
> defined, and moves some undefined behaviour into the unspecified (or
> even fully defined) categories. But there a number of areas that will
> need to remain implementation-dependent, and most undefined behaviour
> should remain undefined (because there is no good way to define it, and
> the benefits - in terms of code efficiency - are useful). In addition,
> there are a number of features that would be very useful if they were
> given in the standards.

The issue is in the details and not the objective. Most of the
unspecified and undefined issues are represented by significant
implementation issues for wide spread conformance. Defining some of
these in a standard is much, much tougher bar to reach that it works as
is expected in most cases.

The one exception to this is right shifts. I have always implemented
signed right shifts on all targets and it generally is a minor code
generation hit in processors that don't have ASR. If this is a **real**
issue for an application unsigned ints are always available as an option.

>
> There are already standards that enhance (by restriction) the C
> standards. For example, if a compiler follows the POSIX standards, then
> (AFAIUI) it must have "int" size a power of 2 of at least 32 bits, and
> use two's complement signed integers. But I think we could do with a
> better one covering "mainstream" targets - and I think there are a
> number of key extensions or library functions that would be useful, but
> cannot today be implemented without compiler and target specific features.
>
> However, the real challenge is to see what would make sense in such a
> standard.

Int size is not the same issue now as it once was for several reasons.
Size specific declarations relieve the pressure of the biggest argument
for must be at least some size or power of two.

It might make sense in the hosted world to set some restrictions on int
size but it doesn't in the embedded world that I work in most of the time.

To give some examples, there are some widely distributed 24 bit embedded
processors. An int in those other than 24 bits would not make sense.

There is a push back now from the one size fits all 32 bit solution to
smaller native sized processors in embedded systems. Not only are some
developers re-looking at 8 and 16 bit solutions but other things have
started to emerge where applications sometimes map better on 18 and 36
bit processors.

>
> I would start by fixing CHAR_BIT at 8, and insisting that signed
> integers are all two's complement (but /not/ with modulo semantics -
> integer overflow should remain undefined for performance reasons). An
> "int" should be either 16 bit or 32 bit, according to target (there are
> vast numbers of 8-bit and 16-bit cpus around - despite the increasing
> popularity of ARM in embedded systems, these are still very important).
>

My opinion is that chars should always be unsigned as well. There are
few reasons that a signed char makes sense.

> Some aspects of aliasing and accessing data using pointers to different
> types could be tightened and clarified in the C specifications, and I
> would like to see a way to say that a given pointer can alias anything
> else.
>
> Endianness should be fixed at either big endian or little endian (target
> dependent - there are many big endian cpus around). But there should be
> a way to define data types with explicit endianness. It should also be
> possible to be explicit about bitfield ordering.

Effectively this is really about structures and how they should be
organized and described. Most of the comments you have just made
eventually comes down to restricting ISA's to conform to the language or
forcing code generation to be ineffective on some processors. The
current case supports application specific processors but not code
portability.

>
>> most compilers on these targets would have de-facto conformance, and we
>> can have a helpful label for "yeah, works on most stuff...".
>
> Agreed - but for a wider range of targets than you think.
>
>>
>>
>> this would effectively at this point cover "almost everything" apart
>> from a lot of MCUs, but MCUs can be a reasonable exception given that
>> rarely will the same code be used in both a PC and MCU application.
>>
>
> There is an increasing level of overlap between PC and MCU code. If
> nothing else, then MCU code is sometimes wrapped in code on a PC in
> order to run unit tests on it. And for most MCU code, there really is
> no problem in making it follow the same sort of standards - about the
> only differences are the size of "int" (which might be 16-bit on an MCU)
> and the endianness (and there are many workstation/server processors
> that are big endian too).

Two comments.
1) Simulate to run unit tests instead of assuming portable code. Current
PC's have the capability to do this type of virtual approach.

2) The other code overlap is heterogeneous applications with PC and
MCU's a lot of work is needed to define how these interfaces should be
described and used. It is related to the structures earlier.

All the best in the New Year all

w..

BGB

unread,

Jan 1, 2016, 3:10:54 PM1/1/16

to

On 1/1/2016 8:14 AM, Walter Banks wrote:
> On 31/12/2015 9:27 AM, David Brown wrote:
>> On 30/12/15 21:27, BGB wrote:
>>>
>>> ex: 32-bit int, little endian, implementing two's complement modular
>>> arithmetic, with sign extension for signed shift-right, ...
>>>
>>
>> A lot of people, including me, would like to see a common additional
>> standard that makes some of the implementation-dependent behaviour fully
>> defined, and moves some undefined behaviour into the unspecified (or
>> even fully defined) categories. But there a number of areas that will
>> need to remain implementation-dependent, and most undefined behaviour
>> should remain undefined (because there is no good way to define it, and
>> the benefits - in terms of code efficiency - are useful). In addition,
>> there are a number of features that would be very useful if they were
>> given in the standards.
>
> The issue is in the details and not the objective. Most of the
> unspecified and undefined issues are represented by significant
> implementation issues for wide spread conformance. Defining some of
> these in a standard is much, much tougher bar to reach that it works as
> is expected in most cases.
>
> The one exception to this is right shifts. I have always implemented
> signed right shifts on all targets and it generally is a minor code
> generation hit in processors that don't have ASR. If this is a **real**
> issue for an application unsigned ints are always available as an option.
>

yeah.

this is why my idea was to limit the range for which such a standard
would apply.

but, on mainstream HW, too much code will break if many of these
"commonly accepted" conventions don't hold true, so almost may as well
canonize them as some sort of supplemental specification.

though, yes, there are still a lot of things which would need to remain
undefined to allow efficient compilers.

ex:
int x, y, z, *px;
px=&y;
px[1]=3;

does 3 get assigned to x or z, or neither?...
while one could argue that x or z should visibly change to 3, or which
one, this would put a serious burden on optimizers, so is better left as
undefined (ex: we may only reliably see changes in the value of a
variable if we take the address of them, and then do something where the
value of the variable could "reasonably" be changed by the operation).

an "unreasonable" change would be something like sticking the address
into a location accessible to another thread and have another thread
come along and alter it mid operation and expecting the change to be
visible (this is, after all, more or less what we have 'volatile' for).

though this isn't quite the same as supporting atomic / bus-locking
operations in-language, which are their own sort of thing (possibly
better still left to intrinsics at this point).

>
>>
>> There are already standards that enhance (by restriction) the C
>> standards. For example, if a compiler follows the POSIX standards, then
>> (AFAIUI) it must have "int" size a power of 2 of at least 32 bits, and
>> use two's complement signed integers. But I think we could do with a
>> better one covering "mainstream" targets - and I think there are a
>> number of key extensions or library functions that would be useful, but
>> cannot today be implemented without compiler and target specific
>> features.
>>
>> However, the real challenge is to see what would make sense in such a
>> standard.
>
> Int size is not the same issue now as it once was for several reasons.
> Size specific declarations relieve the pressure of the biggest argument
> for must be at least some size or power of two.
>
> It might make sense in the hosted world to set some restrictions on int
> size but it doesn't in the embedded world that I work in most of the time.
>
> To give some examples, there are some widely distributed 24 bit embedded
> processors. An int in those other than 24 bits would not make sense.
>
> There is a push back now from the one size fits all 32 bit solution to
> smaller native sized processors in embedded systems. Not only are some
> developers re-looking at 8 and 16 bit solutions but other things have
> started to emerge where applications sometimes map better on 18 and 36
> bit processors.
>

I think it is "whatever is cheapest".

personally I would like seeing cheap MCUs with more RAM and ROM space in
DIP packaging, just as-is, getting a decent-sized RAM and ROM almost
invariably requires QFP (or the more expensive option of a QFP on a
breakout board).

granted, it is likely the hobbyist market for MCUs is pretty small vs
the mass-production market, and mass-production has little need for
larger RAM and ROM space in a DIP.

MCUs are in general a bit more specialized than are more general-use

processors.

>
>>
>> I would start by fixing CHAR_BIT at 8, and insisting that signed
>> integers are all two's complement (but /not/ with modulo semantics -
>> integer overflow should remain undefined for performance reasons). An
>> "int" should be either 16 bit or 32 bit, according to target (there are
>> vast numbers of 8-bit and 16-bit cpus around - despite the increasing
>> popularity of ARM in embedded systems, these are still very important).
>>
>
> My opinion is that chars should always be unsigned as well. There are
> few reasons that a signed char makes sense.
>

in general, unsigned would make more sense, but on mainstream targets
signed char seems to be the consensus. both signed and unsigned make
sense in different cases.

>
>> Some aspects of aliasing and accessing data using pointers to different
>> types could be tightened and clarified in the C specifications, and I
>> would like to see a way to say that a given pointer can alias anything
>> else.
>>
>> Endianness should be fixed at either big endian or little endian (target
>> dependent - there are many big endian cpus around). But there should be
>> a way to define data types with explicit endianness. It should also be
>> possible to be explicit about bitfield ordering.
>
> Effectively this is really about structures and how they should be
> organized and described. Most of the comments you have just made
> eventually comes down to restricting ISA's to conform to the language or
> forcing code generation to be ineffective on some processors. The
> current case supports application specific processors but not code
> portability.

I have also wanted to see support for explicit endianess data-types, but
had generally taken it as a given that using an explicit endianess type
for a variable would take a performance hit in some cases.

one thing in favor of it though is that generally the instruction sets
have ways to do endianess swapping faster than it can be done in user code.

though, if you did have a specific endianess for the language's default
variables, and it mismatches the native endianess used by the hardware,
generally it may be possible to "fake it" in most cases where the
compiler can determine it wont matter to the result. only using the
specified endianess in cases where the compiler thinks it may be
possible to see the variable's contents via a pointer (similar to the
current practice of caching the results of a variable in a register).

>>
>>> most compilers on these targets would have de-facto conformance, and we
>>> can have a helpful label for "yeah, works on most stuff...".
>>
>> Agreed - but for a wider range of targets than you think.
>>
>>>
>>>
>>> this would effectively at this point cover "almost everything" apart
>>> from a lot of MCUs, but MCUs can be a reasonable exception given that
>>> rarely will the same code be used in both a PC and MCU application.
>>>
>>
>> There is an increasing level of overlap between PC and MCU code. If
>> nothing else, then MCU code is sometimes wrapped in code on a PC in
>> order to run unit tests on it. And for most MCU code, there really is
>> no problem in making it follow the same sort of standards - about the
>> only differences are the size of "int" (which might be 16-bit on an MCU)
>> and the endianness (and there are many workstation/server processors
>> that are big endian too).
>
> Two comments.
> 1) Simulate to run unit tests instead of assuming portable code. Current
> PC's have the capability to do this type of virtual approach.
>
> 2) The other code overlap is heterogeneous applications with PC and
> MCU's a lot of work is needed to define how these interfaces should be
> described and used. It is related to the structures earlier.
>

yeah. for MCUs, a sanely written interpreter can emulate the MCU
potentially much faster than the real HW, so it makes some sense to run
the MCU code in an emulated environment, where it deals with simulated
forms of the external hardware (you can simulate the inputs/outputs on
various pins, and see how the program responds).

if the simulation is good, you can have a pretty good idea what it will
do on the real HW (using the compilers IHEX output or similar), more so
than trying to run the C code directly on a PC.

generalizing this to a full electronics sim is a little harder, as my
previous attempt kind of sucks. ironically, getting a decent electronics
simulation with a 48kHz tick is a little harder than it may seem at
first, more so if trying to deal with analog (side note: it may make
sense to separate digital and analog signaling in the simulation).

in this case, the external components could be simulated via a wiring
diagram rather than via simulating the hardware via code.

one drawback though vs either running native or on the real HW, is the
general lack of source-level debugging in such an emulator.

David Brown

unread,

Jan 1, 2016, 5:14:29 PM1/1/16

to

On 01/01/16 05:10, BGB wrote:
> On 12/31/2015 8:27 AM, David Brown wrote:
>> On 30/12/15 21:27, BGB wrote:
>>> On 12/30/2015 12:14 PM, Nick Maclaren wrote:

<snip>

>> However, the real challenge is to see what would make sense in such a
>> standard. Most people - including you - have a view that is too
>> limited. The C world does not revolve around x86 and ARM - it is far
>> wider.
>>
>
> yes, but I wasn't claiming that everything is x86 and ARM, rather it
> would be a specification which would be N/A for architectures which
> differ significantly from x86 and ARM.
>
> it would in this sense be more like DirectX, where it applies where it
> does (ex: on Windows), but is N/A pretty much everywhere else.

Do you mean in the sense that it would be a "standard" invented by a
major player to increase their dominance with a total disregard to the
developers, end-users, or health of the ecosystem as a whole? I won't
argue that OpenGL was faultless, but making an alternative that was tied
to a single OS was a purely marketing exercise, not a technical one.

I strongly hope that there will be no extra C standard that assumes
architectures different from ARM and x86 are irrelevant. But it would
not be unreasonable to omit support for systems that are already
outdated and highly unlikely to become relevant again, or architectures
that are so specialised that standards don't really apply to them.

>
>
>> I would start by fixing CHAR_BIT at 8, and insisting that signed
>> integers are all two's complement (but /not/ with modulo semantics -
>> integer overflow should remain undefined for performance reasons). An
>> "int" should be either 16 bit or 32 bit, according to target (there are
>> vast numbers of 8-bit and 16-bit cpus around - despite the increasing
>> popularity of ARM in embedded systems, these are still very important).
>>
>
> that is why I was saying before it would not apply to MCUs, or other
> unusual targets.

MCU's are not unusual - they dominate the processor market. And while
languages other than C or C++ are very common on other platforms, they
are far and away the most important languages for MCU's. There are only
three* good reasons why anyone should be learning or using C these days
- either you work in low-level code where performance-critical code that
must be "close to the metal" (operating system kernels, low-level
libraries, etc.), or your have to maintain old systems, or you work with
embedded systems. I may be sticking my neck out with this claim, but I
think most people who write code in C would be more productive and write
better programs if they used different languages - except in the cases I
mentioned.

Since a new restricted standard would be of limited use to people
working with operating system level code (you already have restrictions
on the target, and don't need them in a standard), and you certainly
don't need them for maintenance of old code, this theoretical new
standard would be of use for people writing high efficiency
cross-platform low-level libraries, and embedded developers.

( * Some people also program in C for fun. That's always a good reason!)

>
> there is a lot of code which exists on PCs which would not work with a
> non-modular 'char' type (or other integer types for that matter).

That would be code that is broken.

>
> ex: people expect their hash functions, and things like PNG filtering,
> to work correctly, but neither will, as commonly implemented, without
> modular integer types.
>

Signed integers don't work as modular types. If code relies on that, it
is broken. It might work by luck, but not by design.

If such code works on the assumption that you can cast such signed
integers into unsigned types, do modular arithmetic there, then cast
them back to signed types and get the same answer as though signed
integers are modulo types - /then/ the code will work correctly on most
platforms (those that have nice two's complement signed integers).

>
>> Some aspects of aliasing and accessing data using pointers to different
>> types could be tightened and clarified in the C specifications, and I
>> would like to see a way to say that a given pointer can alias anything
>> else.
>>
>
> ok.
>
>
>> Endianness should be fixed at either big endian or little endian (target
>> dependent - there are many big endian cpus around). But there should be
>> a way to define data types with explicit endianness. It should also be
>> possible to be explicit about bitfield ordering.
>>
>
> there are big endian CPUs, but many of these have since largely fallen
> by the wayside, and are mostly limited to specialized embedded targets
> at this point.

Big endian is not as common as little endian - that is certainly true.
But it is far from dead.

>
> it looks like, for the most part, on CPUs, LE has won.
>
> nevermind if BE seems to be slightly more popular for file-formats.
>
>
>>> most compilers on these targets would have de-facto conformance, and we
>>> can have a helpful label for "yeah, works on most stuff...".
>>
>> Agreed - but for a wider range of targets than you think.
>>
>
> possibly. as noted, I was thinking of something which mostly excludes
> specialized embedded targets.

I realise that's what you were thinking of - I am just trying to
persuade that you need to think wider.

>
> probably for the majority of programmers, their code will never leave
> the PC and mobile space.
>

But such a standard is only of use when code /does/ leave that
restricted space. What is the point in making a standard that tightens
implementation-defined behaviour such as the size of "int" and its
endianness if it is only going to be used on a platform that already has
"int" fixed as little-endian, 32-bits? The benefit of a standard that
restricts some of the flexibility and vagueness of the main C standards
is that you can use it on a range of targets - otherwise the current
"implementation-defined behaviour" system is absolutely fine.

>
>>>
>>>
>>> this would effectively at this point cover "almost everything" apart
>>> from a lot of MCUs, but MCUs can be a reasonable exception given that
>>> rarely will the same code be used in both a PC and MCU application.
>>>
>>
>> There is an increasing level of overlap between PC and MCU code. If
>> nothing else, then MCU code is sometimes wrapped in code on a PC in
>> order to run unit tests on it. And for most MCU code, there really is
>> no problem in making it follow the same sort of standards - about the
>> only differences are the size of "int" (which might be 16-bit on an MCU)
>> and the endianness (and there are many workstation/server processors
>> that are big endian too).
>>
>
> server and workstation probably means aging PowerPC boxes?...

"Workstation" processors are getting rare these days, but there are
plenty of servers with chips like Power (not PowerPC) and Sparc
processors. You find these in the kind of systems that x86 machines can
only dream of (uptimes of decades, hot-swapable processors and memory,
etc.)

The numbers of such systems are of course much smaller than for the x86
and ARM world - but they are vital parts of the big picture.

And of course big endian processors are the mainstay of networking
systems (Power, PPC, MIPS) - as well as being common in embedded
systems. It is not unlikely that you have more big endian processors in
your house than little-endian ones - and you certainly have more 8-bit
ones than 32-bit ones.

It is also worth noting that it is not often a problem when coding in C
that you don't know the endianness of your processor. It is only in the
lowest level of transport of data in or out of the program that it is
needed, and often the "hton" and "ntoh" family of functions will give
you all that you need. Adding C language support for specifying
endianness in data would make it almost entirely a non-issue.

>
> I wouldn't expect all that much new HW much outside IBM land would still
> be using PPC.
>
>
> for the mainstream it remains to be seen how the x86 vs ARM war will
> turn out, ex, if either can make significant inroads into the others' area.

Each is already attacking the others space.

BGB

unread,

Jan 1, 2016, 5:53:32 PM1/1/16

to

yes, ok.

>>
>> like the LZ4 decoders I had looked at, the copy operations may write a
>> few bytes past the end of the span.
>
> In LZ4 this is guaranteed safe since it always ends with a block of new
> bytes, at least 5 afair?
>
> This means that using 32-bit writes will be safe even when writing into
> a caller-supplied buffer where you cannot write past the assigned
> length, but from my own code I added the requirement that you must
> always provide at least ~30 bytes extra tail end space.
>
> If this cannot be provided then it is easy to wrap my decoder with a
> function which calls the fast code on "input_buffer_size-32" bytes, then
> use single-byte code on the last 32 input bytes.

there is the option of using a fast copy loop for most of the data, then
use a series of "if()" branches to deal with whatever bytes are left over.

doing a copy which may write extra works, and saves the need for these
extra checks, but as noted, isn't safe if operating near the end of a
user-supplied buffer.

it is passable in BTLZH mostly as the decoder doesn't deal with
incremental/stream decoding. the API currently only deals with blocks of
data, and tends to require about 256 bytes of overrun space (up to 256
bytes past the end of the output buffer may be trampled by the decoder).

practically, a smaller limit could be used though.

yep.

I haven't done much analysis on this, but possibly true in some data sets.

in my profile results though, it seemed like RLE-like runs were minor
players for the types of data I was often dealing with.

granted, if using it for primarily PNG compression, or compressing
raster images, it is quite likely that these cases would dominate.

Terje Mathisen

unread,

Jan 1, 2016, 6:14:01 PM1/1/16

to

David Brown wrote:
> And of course big endian processors are the mainstay of networking
> systems (Power, PPC, MIPS) - as well as being common in embedded
> systems. It is not unlikely that you have more big endian processors in
> your house than little-endian ones - and you certainly have more 8-bit
> ones than 32-bit ones.

Hmmm...

Here I am now we have 5 PCs and an iPad, plus 3 Android and 1 iPhone.

We also have a linux-based WiFi router and the fiber vendor modem/wifi
unit so that is at least 12 32/64-bit computers.

The smaller embedded stuff would be the processors in the oven, the
count-down alarm clock, the dishwasher, washing machine and dryer, plus
the remote controlled lights (5)and a couple of remote controls, these
might well be 8-bit. (12 total)

The digital radio/internet receiver is almost certainly something like
an ARM, and the same goes for the Samsung TV (which runs Linux) and the
IP TV set-top box.

All my thermostats are analog.

I'm probably forgetting a few devices but this looks like more 32+ than
8-bit cpus?

Terje
PS. Happy New Year everyone!

Rick C. Hodgin

unread,

Jan 1, 2016, 7:02:59 PM1/1/16

to

Terje Mathisen wrote:
> PS. Happy New Year everyone!

Happy New Year, Terje! And to everyone else.

Best regards,
Rick C. Hodgin

Stephen Fuld

unread,

Jan 1, 2016, 8:13:58 PM1/1/16

to

On 1/1/2016 3:13 PM, Terje Mathisen wrote:
> David Brown wrote:
>> And of course big endian processors are the mainstay of networking
>> systems (Power, PPC, MIPS) - as well as being common in embedded
>> systems. It is not unlikely that you have more big endian processors in
>> your house than little-endian ones - and you certainly have more 8-bit
>> ones than 32-bit ones.
>
> Hmmm...
>
> Here I am now we have 5 PCs and an iPad, plus 3 Android and 1 iPhone.
>
> We also have a linux-based WiFi router and the fiber vendor modem/wifi
> unit so that is at least 12 32/64-bit computers.
>
> The smaller embedded stuff would be the processors in the oven, the
> count-down alarm clock, the dishwasher, washing machine and dryer, plus
> the remote controlled lights (5)and a couple of remote controls, these
> might well be 8-bit. (12 total)
>
> The digital radio/internet receiver is almost certainly something like
> an ARM, and the same goes for the Samsung TV (which runs Linux) and the
> IP TV set-top box.
>
> All my thermostats are analog.
>
> I'm probably forgetting a few devices but this looks like more 32+ than
> 8-bit cpus?

I wonder how this would change if you included your car? While the main
engine controller is probably 32 bit (I understand Power is big here), I
have heard that the average modern car has over 20 microprocessors in
it. I'd bet many of those are 8 bit - Anti-lock brakes, climate
control, seat control, etc.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

BGB

unread,

Jan 1, 2016, 9:32:32 PM1/1/16

to

I meant more in the sense that it is pretty strong on one target
(Windows), but pretty much everyone else ignores it (either using
OpenGL, GL ES, or various other graphics APIs).

personally I use also GL on Windows, but I am possibly in the minority here.

if it is N/A for the target, then the idea is it will be ignored.

>
> ( * Some people also program in C for fun. That's always a good reason!)
>
>>
>> there is a lot of code which exists on PCs which would not work with a
>> non-modular 'char' type (or other integer types for that matter).
>
> That would be code that is broken.
>

not necessarily, people take it as given that these types are modular.

>>
>> ex: people expect their hash functions, and things like PNG filtering,
>> to work correctly, but neither will, as commonly implemented, without
>> modular integer types.
>>
>
> Signed integers don't work as modular types. If code relies on that, it
> is broken. It might work by luck, but not by design.
>
> If such code works on the assumption that you can cast such signed
> integers into unsigned types, do modular arithmetic there, then cast
> them back to signed types and get the same answer as though signed
> integers are modulo types - /then/ the code will work correctly on most
> platforms (those that have nice two's complement signed integers).
>

the distinction is generally ignored in practice.

this is why 'int' needs to remain 32-bits.
if 'int' got either bigger or smaller, or started trapping on overflow,
... and used to run generic PC code, it is very likely that one would be
seeing misbehaving code and crashes all over the place (far worse than
the already existing issues with int/pointer casts when porting 32-bit
code to 64-bits).

it is like the sizeof(long) issue for 64-bit Windows: it was decided by
there being more code which assumed it holding a 32-bit value.

elsewhere (Linux, ...) it was more common to rely on the assumption that
long/pointer casts would work correctly.

>>
>>> Some aspects of aliasing and accessing data using pointers to different
>>> types could be tightened and clarified in the C specifications, and I
>>> would like to see a way to say that a given pointer can alias anything
>>> else.
>>>
>>
>> ok.
>>
>>
>>> Endianness should be fixed at either big endian or little endian (target
>>> dependent - there are many big endian cpus around). But there should be
>>> a way to define data types with explicit endianness. It should also be
>>> possible to be explicit about bitfield ordering.
>>>
>>
>> there are big endian CPUs, but many of these have since largely fallen
>> by the wayside, and are mostly limited to specialized embedded targets
>> at this point.
>
> Big endian is not as common as little endian - that is certainly true.
> But it is far from dead.
>

yep. but, a person writing code for PC or cellphone class hardware wont
see it, and a relative minority of programmers (or code) will ever move
outside this domain.

>>
>> it looks like, for the most part, on CPUs, LE has won.
>>
>> nevermind if BE seems to be slightly more popular for file-formats.
>>
>>
>>>> most compilers on these targets would have de-facto conformance, and we
>>>> can have a helpful label for "yeah, works on most stuff...".
>>>
>>> Agreed - but for a wider range of targets than you think.
>>>
>>
>> possibly. as noted, I was thinking of something which mostly excludes
>> specialized embedded targets.
>
> I realise that's what you were thinking of - I am just trying to
> persuade that you need to think wider.
>

you can't nail down the specifics if you include things for which those
specifics don't apply.

>>
>> probably for the majority of programmers, their code will never leave
>> the PC and mobile space.
>>
>
> But such a standard is only of use when code /does/ leave that
> restricted space. What is the point in making a standard that tightens
> implementation-defined behaviour such as the size of "int" and its
> endianness if it is only going to be used on a platform that already has
> "int" fixed as little-endian, 32-bits? The benefit of a standard that
> restricts some of the flexibility and vagueness of the main C standards
> is that you can use it on a range of targets - otherwise the current
> "implementation-defined behaviour" system is absolutely fine.
>

the idea is would be to canonize this set of implementation-defined
behavior, to be able to say that the code will work on implementations
which look like this one.

IOW: for description, not prescription.

>>
>>>>
>>>>
>>>> this would effectively at this point cover "almost everything" apart
>>>> from a lot of MCUs, but MCUs can be a reasonable exception given that
>>>> rarely will the same code be used in both a PC and MCU application.
>>>>
>>>
>>> There is an increasing level of overlap between PC and MCU code. If
>>> nothing else, then MCU code is sometimes wrapped in code on a PC in
>>> order to run unit tests on it. And for most MCU code, there really is
>>> no problem in making it follow the same sort of standards - about the
>>> only differences are the size of "int" (which might be 16-bit on an MCU)
>>> and the endianness (and there are many workstation/server processors
>>> that are big endian too).
>>>
>>
>> server and workstation probably means aging PowerPC boxes?...
>
> "Workstation" processors are getting rare these days, but there are
> plenty of servers with chips like Power (not PowerPC) and Sparc
> processors. You find these in the kind of systems that x86 machines can
> only dream of (uptimes of decades, hot-swapable processors and memory,
> etc.)
>
> The numbers of such systems are of course much smaller than for the x86
> and ARM world - but they are vital parts of the big picture.
>

but not for the average programmer, and the people with these machines
probably aren't going to be running random code they grabbed off the
internet on them.

>
> And of course big endian processors are the mainstay of networking
> systems (Power, PPC, MIPS) - as well as being common in embedded
> systems. It is not unlikely that you have more big endian processors in
> your house than little-endian ones - and you certainly have more 8-bit
> ones than 32-bit ones.
>

yes, but few people write code for them.

ex: small number of programmers for a large number of devices, is a very
different situation than a large number of programmers for a few devices.

ex: far more people worked on the code running in ones' PC or phone,
than in their TV remote or thermostat.

>
> It is also worth noting that it is not often a problem when coding in C
> that you don't know the endianness of your processor. It is only in the
> lowest level of transport of data in or out of the program that it is
> needed, and often the "hton" and "ntoh" family of functions will give
> you all that you need. Adding C language support for specifying
> endianness in data would make it almost entirely a non-issue.
>

yeah, possibly.

it comes up a bit more in performance-sensitive code though, such as in
video codecs.

if you work on pixel-data in terms of bytes, performance will suck. so,
generally, it is needed to work on the data in larger units, and
read/write it all at once (typically via unrolled loops).

typically it also means parts of the codec need to be duplicated for
each input or output pixel-format supported (RGBA, BGRA, RGB, BGR, YUYV,
...). this while still ignoring the issue of endianess.

though, luckily, in most cases it is at least possible to abstract over
vertical flips, generally by one orientation having a positive
row-stride and the other negative, such that it works automatically.

while one might ask, "why not have a callback to load/store pixels
individually?...", general answer is that this works out to be
impractically slow in the use-case.

an example would be, say (RGB<->BGR swap):
for(i=0; i<n; i++)
{
cr=ibuf[i*4+0]; cg=ibuf[i*4+1];
cb=ibuf[i*4+2]; ca=ibuf[i*4+3];
obuf[i*4+0]=cb; obuf[i*4+1]=cg;
obuf[i*4+2]=cr; obuf[i*4+3]=ca;
}
then you have a loop which does maybe 50 Megapixels/second.

but a modest rewrite, say:
int *cs, *ct, *cse;
cs=(int *)ibuf; cse=cs+((n+3)&(~3)); ct=(int *)obuf;
while(cs<cse)
{
i0=cs[0]; i1=cs[1];
i2=cs[2]; i3=cs[3];
cs+=4;
ct[0]=(i0&0xFF00FF00)|
((i0>>16)&0x000000FF)|
((i0<<16)&0x00FF0000);
ct[1]=(i1&0xFF00FF00)|
((i1>>16)&0x000000FF)|
((i1<<16)&0x00FF0000);
ct[2]=(i2&0xFF00FF00)|
((i2>>16)&0x000000FF)|
((i2<<16)&0x00FF0000);
ct[3]=(i3&0xFF00FF00)|
((i3>>16)&0x000000FF)|
((i3<<16)&0x00FF0000);
ct+=4;
}

and you might be looking at more around, say, 375 or so.

at the cost that the portability of the code is pretty much shot...

>>
>> I wouldn't expect all that much new HW much outside IBM land would still
>> be using PPC.
>>
>>
>> for the mainstream it remains to be seen how the x86 vs ARM war will
>> turn out, ex, if either can make significant inroads into the others'
>> area.
>
> Each is already attacking the others space.
>

mobile devices running x86, or ARM-based PCs, are still far from the
norm though.

though, it could be in a way that Intel's near dominance in the x86
space is working as a disadvantage in mobile and embedded.

like, in embedded processors, there is a lot more range of features and
price-points.

for x86, there are only a few options, and they aren't really cheap.

if instead x86 processors could be made by pretty much anyone with a
suitable fab (more similar to the situation with ARM), it is likely it
would fair a little better in embedded.

Terje Mathisen

unread,

Jan 2, 2016, 5:14:06 AM1/2/16

to

My car is a Skoda Octavia 4WD station wagon with the (in)famous VW EA189
diesel engine and the very nice dual-clutch automatic transmission. It
has manual seat controls and a high-end radio/sat nav/car control unit
but you are still almost certainly correct that old 8-bit stuff stll
dominates.

Terje
PS. For those of you in the US: Here in Europe VAG (the VW/Audi parent
company) uses Skoda (from the Chech Republic) as their third big brand,
when they took over Skoda 10+ years ago they simply replaced something
like 70% of the Skoda parts with standard stuff already proven from
their VW and Audi lines. The Octavia uses the base engineering of the
latest VW Golf (Rabbit in the US?) or Audi A4 but costs NOK 50-150K less
than those cars. I.e. the classic method to be able to sell nearly the
same product at significantly different price points to different
customers, something which we are all very familiar with from the
CPU/computer business, right?

Terje Mathisen

unread,

Jan 2, 2016, 10:02:19 AM1/2/16

to

BGB wrote:
> an example would be, say (RGB<->BGR swap):

You obviously meant RGBA <-> BGRA, the code is even more interesting
when you have packed 24-bit pixels. :-)

This is where generic SIMD operations using permute tables really shine,
you can use the exact same code for many different pixel formats and
transforms.

The original Altivec permute operation was significantly better than
Intel's version though since it allowed dual sources so that you could
transparently handle data that would straddle the boundary between two
16-byte blocks.

>
> at the cost that the portability of the code is pretty much shot...

At this point I would rather wrap the compiler intrinsics in #define, I
have done so previously for my ogg vorbis decoder which was the world's
fastest at that point:

It would compile on both x86 and power/altivec with pretty much
identical performance.

Terje

BGB

unread,

Jan 2, 2016, 11:47:46 AM1/2/16

to

On 1/2/2016 9:02 AM, Terje Mathisen wrote:
> BGB wrote:
>> an example would be, say (RGB<->BGR swap):
>
> You obviously meant RGBA <-> BGRA, the code is even more interesting
> when you have packed 24-bit pixels. :-)
>

yeah... messed up there.

I very rarely deal with actual 24-bit RGB/BGR, more common is RGBX/BGRX,
which is basically a name I use for RGBA or BGRA where you can ignore
the alpha channel (leaving the A for cases where its contents are relevant).

an interesting thing about MSVC is, if you have SSE turned on, code like
the above will often automagically become SSE without needing to use
intrinsics (may sometimes need fiddly to make it work).

but, yeah, packed pixels end up used a lot, along with mixing byte and
integer oriented access.

>>
>> at the cost that the portability of the code is pretty much shot...
>
> At this point I would rather wrap the compiler intrinsics in #define, I
> have done so previously for my ogg vorbis decoder which was the world's
> fastest at that point:
>
> It would compile on both x86 and power/altivec with pretty much
> identical performance.
>

haven't done as much with audio compression.

a lot of this has been with lossy-DPCM variants, ADPCM, and blocky-VQ
applied to audio (did something at one point akin to DXTn for audio,
supporting random access to compressed audio for sake of a game
sound-effect mixer).

already...@yahoo.com

unread,

Jan 3, 2016, 4:17:47 AM1/3/16

to

On Friday, January 1, 2016 at 7:05:01 AM UTC+2, BGB wrote:
> On 12/31/2015 8:47 AM, David Brown wrote:
> > On 30/12/15 18:43, BGB wrote:
> >> a person could use memcpy, but there are a few compilers where memcpy
> >> would be a slower option because it performs a function call to a memory
> >> copy-loop, rather than producing a sane instruction sequence to copy N
> >> bytes (or may only use a fast instruction sequence if it could somehow
> >> prove that the pointers are always aligned).
> >
> > True (although most modern compilers either automatically optimise
> > memcpy inline, or offer a alternative "builtin" version). When it comes
> > to getting the most optimised code, you often need to work with the
> > details of the specific compiler.
> >
>
> this is a "most" that excludes the default behavior of MSVC, which tends
> to by-default use function calls for this sort of thing rather than
> built-ins.
> and when it does inline it, will generally do so via "REP MOVSB", even
> when it is a stupid option (vs the use of GPRs and SSE).
> so, if you want it to be predictable and fast, the use of explicit
> integer or SSE operations is needed.
> granted, GCC is a little smarter about this, tending to make more
> sensible choices when faced with "memcpy()" calls.
>

I wonder what MSVC you have in mind? 17 y.o. Visual Studio 6.0 ? I don't want to fire up an ancient computer where I have it installed, but it would not surprise me if even 6.0 is smarter than you think.
Or, may be, you think about non-optimized builds? Then it is true, for non-optimized build MSVC will call a library.
However I feel that it's very wrong when performance of non-optimized build affects your way of coding.

already...@yahoo.com

unread,

Jan 3, 2016, 5:02:40 AM1/3/16

to

On Friday, January 1, 2016 at 7:05:01 AM UTC+2, BGB wrote:

> ex:
> if you write:
> i=(i<0)?-1:0;
> compiler may give you something like:
> XOR edx, edx
> CMP eax, 0
> CMOVL edx, -1
> MOV eax, edx
>

int sext(int x) {
return x < 0 ? -1: 0;
}

gcc 5.2.0 (x64), -O2

mov %cx,%eax
sar $0x1f,%eax
retq

gcc 4.9 (ARMv7) -mthumb - cpu=cortex-m4 -O2

asrs r0,r0,#31
bx lr

MSVC 16.00.40219.01 (Visual Studio 2010), x64 -O2

xor eax,eax
test ecx,ecx
setns al
dec eax
ret

MSVC 18.00.31101 (Visual Studio 2013), x64 -O2

xor eax,eax
test ecx,ecx
setns al
dec eax
ret

MSVC 18.00.31101 (Visual Studio 2013), ARM -O2
cmp r0,#0
bge $LN3
mvn r0,#0
bx lr
$LN3:
movs r0,#0
bx lr

So, gcc already recognizes the idiom, both on ARM and on x64. MSVC - not yet, at least not in VS2013. In fact, I am more surprised by MSVC/ARM using branch instead of IT than by the fact that it does not recognize not particularly significant idiom.

BGB

unread,

Jan 3, 2016, 1:43:15 PM1/3/16

to

IIRC, I remember seeing questionable memcpy behavior with newer versions
of MSVC as well, generally with the "Windows Platform SDK" and similar.
currently mostly using VS2013, but not looked too much into its ASM output.

haven't really investigated into it too much, mostly just avoiding doing
things which tend to suck hard with performance IME.

I have generally been testing with optimized builds, but even with
optimization, MSVC still often likes making library calls, or sneaking
in the occasional call to internal functions if you are using "long
long" operations or similar.

though, in general, its optimized performance is pretty solid (a lot of
stuff compiled with MSVC goes faster than if compiled with GCC, even
with MSVC's sometimes obviously questionable output).

already...@yahoo.com

unread,

Jan 3, 2016, 1:52:38 PM1/3/16

to

I would like to see an example.

>
> though, in general, its optimized performance is pretty solid (a lot of
> stuff compiled with MSVC goes faster than if compiled with GCC, even
> with MSVC's sometimes obviously questionable output).

Yes, I agree. It is not uncommon that MSVC output looks worse than gcc, but performs slightly better.

BGB

unread,

Jan 3, 2016, 2:33:38 PM1/3/16

to

I was mostly doing an "educated guess" as to what its output might look
like, but as can be noted, I am not that much into looking into the ASM
output, more noting when things are turning up slow in the profiler I
might see the disassembled output.

but, in any case, there is a limit to how much cleverness one can expect
from a compilers' optimizer, in performance-sensitive use-cases.

David Brown

unread,

Jan 3, 2016, 5:01:06 PM1/3/16

to

On 02/01/16 00:13, Terje Mathisen wrote:
> David Brown wrote:
>> And of course big endian processors are the mainstay of networking
>> systems (Power, PPC, MIPS) - as well as being common in embedded
>> systems. It is not unlikely that you have more big endian processors in
>> your house than little-endian ones - and you certainly have more 8-bit
>> ones than 32-bit ones.
>
> Hmmm...
>
> Here I am now we have 5 PCs and an iPad, plus 3 Android and 1 iPhone.
>
> We also have a linux-based WiFi router and the fiber vendor modem/wifi
> unit so that is at least 12 32/64-bit computers.

The chances are reasonably high that the router and modem have a
big-endian (MIPS), though ARM is gaining market share in this area.

>
> The smaller embedded stuff would be the processors in the oven, the
> count-down alarm clock, the dishwasher, washing machine and dryer, plus
> the remote controlled lights (5)and a couple of remote controls, these
> might well be 8-bit. (12 total)
>
> The digital radio/internet receiver is almost certainly something like
> an ARM, and the same goes for the Samsung TV (which runs Linux) and the
> IP TV set-top box.
>
> All my thermostats are analog.
>
> I'm probably forgetting a few devices but this looks like more 32+ than
> 8-bit cpus?
>

TV (at least one, possibly more microcontroller per device). Remote
controller? Another 8-bit (or 4-bit, if its old enough). Microwave?
Modern toaster, food processor, electronic kitchen scales, bathroom
scales, or luggage scale? Bank card with a chip? Digital radio?
Modern hifi system? Bluray player?

What about those PC's - have they got a mouse or a keyboard? That's
another two 8-bit microcontrollers. Webcam? That's another. Monitor?
There's at least one microcontroller in there. Printers? Probably
several in each. Wifi dongles or cards? There will be a
microcontroller in each.

Electronic toys? Games console - plus controllers?

I would be surprised if you couldn't dig up a hundred microcontrollers
around the house - plus another hundred or so for each car in the
garage. Five of these are x86 - and perhaps another 20 or so are ARM.

But as you can see, exact numbers are hard to come by!

BGB

unread,

Jan 3, 2016, 5:46:54 PM1/3/16

to

off-hand, don't have much.

_allmul, _alldiv, _allshr, _aullshr, ... are the functions which MSVC
likes emitting calls to if using long-long operations (in 32-bit
builds). there may be more, but these seem to be the main ones which
show up most in the profiler.

these still show up if building with /O2 on VS2013 32-bit.

it does seem to do long-long load/store add/sub and bitwise operations
with inline operations though.

for the inline cases, it seems pretty consistent in keeping one 64-bit
item in EDX:EAX and then working against memory for the other operand.

_allshl and _allshr and friends seem to use EDX:EAX and CL for the
shift, returning the result in EDX:EAX

for 32-bits, they are used pretty much anywhere the relevant operations
are used.

general way to avoid the costs of long-long ops is to use 32-bit ops
unless there is a compelling reason to use 64-bit ops. though, for x64
builds, long long is "basically free".

as-is, still mostly doing 32-bit builds.

>>
>> though, in general, its optimized performance is pretty solid (a lot of
>> stuff compiled with MSVC goes faster than if compiled with GCC, even
>> with MSVC's sometimes obviously questionable output).
>
> Yes, I agree. It is not uncommon that MSVC output looks worse than gcc, but performs slightly better.
>

yep.

makes sense to use it where it is fast, but hedge things where it does
something questionable.

though, granted, it does make sense to limit effort mostly to code where
performance is actually likely to matter, rather than try to
systematically micro-optimize everything.

already...@yahoo.com

unread,

Jan 3, 2016, 6:11:00 PM1/3/16

to

I was asking specifically for examples in which memcpy() generates worse code than *(u64 *)ct=*(u64 *)cs.
I had seen such cases myself with pretty recent gcc/ARM, but don't remember seeing them on MSVC/x386 or MSVC/AMD64. Frankly, I don't observe MSVC/x86 asm output as often as I do it on smaller slower targets so it's very possible that I overlooked cases where MSVC underperforms.

https://groups.google.com/forum/#!searchin/comp.arch/memcpy|sort:date/comp.arch/EVcbmvyjIyQ/ywrLedeNetoJ

> _allmul, _alldiv, _allshr, _aullshr, ... are the functions which MSVC
> likes emitting calls to if using long-long operations (in 32-bit
> builds). there may be more, but these seem to be the main ones which
> show up most in the profiler.
>
> these still show up if building with /O2 on VS2013 32-bit.
>

I do expect 64-bit shifts, multiplications and divisions on 32-bit target to be implemented with calls into compiler support library. In big programs it is not necessarily slower than inline code, especially when emitted outside of inner loops.

BGB

unread,

Jan 3, 2016, 6:55:29 PM1/3/16

to

going and checking...

ok, checking for VS2013 32-bit, seems to use XMM registers for copying.
seems to also be the case for VS2008, which uses GPRs.

cant check Platform SDK, don't currently have it installed (oldest I
have is installed Windows SDK 6.0A).

IIRC, I may have been using VS2003 at the time, I forget, but this
predates my current Windows installation.

hmm...

I guess worthwhile to know I guess...

David Brown

unread,

Jan 3, 2016, 6:57:04 PM1/3/16

to

On 02/01/16 03:26, BGB wrote:
> On 1/1/2016 4:14 PM, David Brown wrote:
>> On 01/01/16 05:10, BGB wrote:
>>> On 12/31/2015 8:27 AM, David Brown wrote:
>>>> On 30/12/15 21:27, BGB wrote:

I'm snipping a fair bit of this to reduce the size.

>>
>> Since a new restricted standard would be of limited use to people
>> working with operating system level code (you already have restrictions
>> on the target, and don't need them in a standard), and you certainly
>> don't need them for maintenance of old code, this theoretical new
>> standard would be of use for people writing high efficiency
>> cross-platform low-level libraries, and embedded developers.
>>
>
> if it is N/A for the target, then the idea is it will be ignored.

My point is that if you limit your targets for this new standardisation
so much, you don't need a standardisation. A standard that says "int
will be fixed at 32-bit" that only applies to a couple of 32-bit
processors, is pointless.

The benefit of an additional standard covering "sensible" cpu
architectures is that would allow people to write code that will work
safely and correctly across a wide range of devices without having the
additional effort of dealing with "weird" architectures. So it has to
cover a good range of realistic, popular, modern devices - excluding
only those that are long outdated, or whose unusual characteristics
(such as CHAR_BIT greater than 8) make them hard to deal with.

>
>>
>> ( * Some people also program in C for fun. That's always a good reason!)
>>
>>>
>>> there is a lot of code which exists on PCs which would not work with a
>>> non-modular 'char' type (or other integer types for that matter).
>>
>> That would be code that is broken.
>>
>
> not necessarily, people take it as given that these types are modular.

Eh, no - people don't take signed types as modular. People don't assume
that if you add 1 to INT_MAX, it will roll round to INT_MIN. They
either assume it's not going to happen because their numbers never get
that big, or they avoid it because things are going to go wrong (adding
to a big positive integer and getting a big negative integer is very
rarely helpful behaviour).

(Of course, there is a certain proportion of the programming community
that think /they/ know best about what C /should/ do, and make
unwarranted assumptions based on a couple of test examples. It is not a
good idea to reward or encourage such attitudes.)

Signed types never have, and never will have, modular behaviour in C.
It doesn't make sense to standardise such incorrect behaviour -
especially as forcing signed integer overflow to have modular behaviour
(such as with the "-fwrapv" flag in gcc) makes correct code less
efficient due to lost optimisation opportunities.

>
>>>
>>> ex: people expect their hash functions, and things like PNG filtering,
>>> to work correctly, but neither will, as commonly implemented, without
>>> modular integer types.
>>>
>>
>> Signed integers don't work as modular types. If code relies on that, it
>> is broken. It might work by luck, but not by design.
>>
>> If such code works on the assumption that you can cast such signed
>> integers into unsigned types, do modular arithmetic there, then cast
>> them back to signed types and get the same answer as though signed
>> integers are modulo types - /then/ the code will work correctly on most
>> platforms (those that have nice two's complement signed integers).
>>
>
> the distinction is generally ignored in practice.
>
> this is why 'int' needs to remain 32-bits.

Nope.

"int" needs to remain 16-bit or 32-bit because making it bigger would be
inefficient for almost all code, and because having it bigger makes it
impossible to have 8-bit, 16-bit and 32-bit integer types using standard
C types. (Systems have been made with 64-bit "int", such as Solaris on
the SPARC64, but 32-bit "int" is usually faster and easier.)

> if 'int' got either bigger or smaller, or started trapping on overflow,
> ... and used to run generic PC code, it is very likely that one would be
> seeing misbehaving code and crashes all over the place (far worse than
> the already existing issues with int/pointer casts when porting 32-bit
> code to 64-bits).

Code that converts between pointers and "int" without using the proper
types, or without being specifically target-dependent, is badly broken -
and any new or additional standard is not going to be made on the basis
of such bad code. If you want to convert a pointer into an integer
type, use uintptr_t. That is what it is for. It's okay to make
assumptions about sizes in code that is tied tightly to a specific
platform - a program using the Win32 api can assume that "int" is
32-bit. But the mess that is Win64 demonstrates exactly what goes wrong
when people make such size assumptions.

>
> it is like the sizeof(long) issue for 64-bit Windows: it was decided by
> there being more code which assumed it holding a 32-bit value.

Yes, because MS knew that most Windows programmers are lazy and
ignorant, and made poor assumptions because they were so used to working
within a single limited environment that there code broke when that
environment changed.

>
> elsewhere (Linux, ...) it was more common to rely on the assumption that
> long/pointer casts would work correctly.
>

It was common to rely on the assumption that developers knew their jobs
- and if they did not, then it was /they/ that had to fix their broken
code, not the OS and ABI designers that had to work around other
people's mistakes.

*nix programmers had been happy writing code that worked across a range
of systems (big endian, little endian, 32-bit and 64-bit pointers and
longs, and even occasionally 64-bit ints).

>>
>> Big endian is not as common as little endian - that is certainly true.
>> But it is far from dead.
>>
>
> yep. but, a person writing code for PC or cellphone class hardware wont
> see it, and a relative minority of programmers (or code) will ever move
> outside this domain.

The point of having standards is so that more than one person can work
together, and over a wider domain - not so that one programmer can be
happy on his own machine!

>>
>> But such a standard is only of use when code /does/ leave that
>> restricted space. What is the point in making a standard that tightens
>> implementation-defined behaviour such as the size of "int" and its
>> endianness if it is only going to be used on a platform that already has
>> "int" fixed as little-endian, 32-bits? The benefit of a standard that
>> restricts some of the flexibility and vagueness of the main C standards
>> is that you can use it on a range of targets - otherwise the current
>> "implementation-defined behaviour" system is absolutely fine.
>>
>
> the idea is would be to canonize this set of implementation-defined
> behavior, to be able to say that the code will work on implementations
> which look like this one.
>
> IOW: for description, not prescription.

If "description" were enough, then no additional standard would be
needed - "this code works on targets somewhat like PC's or ARMs" would
be good enough.

Standards are precisely about prescription and details - not vague
descriptions.

>>
>> The numbers of such systems are of course much smaller than for the x86
>> and ARM world - but they are vital parts of the big picture.
>>
>
> but not for the average programmer, and the people with these machines
> probably aren't going to be running random code they grabbed off the
> internet on them.

And why can't they simply re-use existing code from other systems?
Because said code is either written to such a wide standard (such as
ANSI C) that it has inefficiencies, or it is targeted too specifically
and written with assumptions that don't match the new target. That's
why we need a standard in between. (Actually, for the server world,
POSIX goes a fair way towards providing this.)

That is a fine example of Knuth's law - premature optimisation is the
root of all evil.

The code is poor - the correct type here would be uint32_t, not "int".
It is totally and completely pointless to use "int" here when there is a
more accurate, clearer, and portable alternative. Such code as this
should use unsigned types, and since you know exactly what size you need
you should say so.

Secondly, it is almost invariably a bad idea to turn array accesses into
pointers - leave the code as arrays, to give the compiler more accurate
information.

It is almost invariably a bad idea to do manual loop unrolling - the
compiler will do a far better job than you will. It will happily unroll
the first version of the code - /if/ it decides that this is a good
idea, based on register pressure, and processor features such as branch
target buffers, L1 cache sizes, execution unit scheduling, instruction
pre-fetch buffers, etc.

And the more information you give it in clear terms (such as version 1)
rather than obscured terms (version 2), the better chance there is for
autovectorisation and other advanced optimisations.

In practice, the two versions compile to almost identical code on my
quick test (each giving around 700 megapixels per second). Cache
effects are more important than the minor code differences. But it is
likely that the maximal speed (limited purely by cache) would be
achieved by using the wider SIMD units on modern processors, which would
involve a little restructuring of the data and code.

So here you have managed to demonstrate that assumptions about the size
of "int" and the endianness of the system let you write unmaintainable
illegible code that is no better than a clear, simple and portable
version. I am not particularly impressed.

David Brown

unread,

Jan 3, 2016, 7:11:52 PM1/3/16

to

On 01/01/16 15:14, Walter Banks wrote:
> On 31/12/2015 9:27 AM, David Brown wrote:
>> On 30/12/15 21:27, BGB wrote:
>>>
>>> ex: 32-bit int, little endian, implementing two's complement modular
>>> arithmetic, with sign extension for signed shift-right, ...
>>>
>>
>> A lot of people, including me, would like to see a common additional
>> standard that makes some of the implementation-dependent behaviour fully
>> defined, and moves some undefined behaviour into the unspecified (or
>> even fully defined) categories. But there a number of areas that will
>> need to remain implementation-dependent, and most undefined behaviour
>> should remain undefined (because there is no good way to define it, and
>> the benefits - in terms of code efficiency - are useful). In addition,
>> there are a number of features that would be very useful if they were
>> given in the standards.
>
> The issue is in the details and not the objective. Most of the
> unspecified and undefined issues are represented by significant
> implementation issues for wide spread conformance. Defining some of
> these in a standard is much, much tougher bar to reach that it works as
> is expected in most cases.

Yes, the details will be tough.

>
> The one exception to this is right shifts. I have always implemented
> signed right shifts on all targets and it generally is a minor code
> generation hit in processors that don't have ASR. If this is a **real**
> issue for an application unsigned ints are always available as an option.

I think it would be reasonable to fix that case rather than making it
implementation dependent - I have not heard of any implementations that
don't make right shifts work as ASR. Still, it is something that should
seldom be used - "x / 2" is normally the better choice.

>
>
>>
>> There are already standards that enhance (by restriction) the C
>> standards. For example, if a compiler follows the POSIX standards, then
>> (AFAIUI) it must have "int" size a power of 2 of at least 32 bits, and
>> use two's complement signed integers. But I think we could do with a
>> better one covering "mainstream" targets - and I think there are a
>> number of key extensions or library functions that would be useful, but
>> cannot today be implemented without compiler and target specific
>> features.
>>
>> However, the real challenge is to see what would make sense in such a
>> standard.
>
> Int size is not the same issue now as it once was for several reasons.
> Size specific declarations relieve the pressure of the biggest argument
> for must be at least some size or power of two.
>
> It might make sense in the hosted world to set some restrictions on int
> size but it doesn't in the embedded world that I work in most of the time.

There are already restrictions - "int" must be at least 16 bits. If we
also note that "int" sizes of more than 32-bit are either on dinosaur
architectures (36-bit mainframes, etc.) or inefficient (like the
occasional 64-bit int system), then we are left with few choices other
than 16-bit and 32-bit. Yes, there are some 24-bit int systems, and
even 18-bit or 20-bit registers. (I have also seen 40-bit registers,
but they would count as "longs" of some sort, rather than "int".)

But such systems are rather specialised. The only 24-bit system I am
familiar with is the Freescale TPU (I know you write compilers for
these). But while there are vast numbers of deployed systems using
these processors (I have used them myself), the code for them is very
specialised. You don't write code that is portable between a TPU and
anything else - hence there is no need to support it within the
standards. The same applies to 24-bit DSPs, and thus they don't need to
be supported in a cross-target standard with greater restrictions than
the main C standards.

>
> To give some examples, there are some widely distributed 24 bit embedded
> processors. An int in those other than 24 bits would not make sense.

Can you give some examples that are not DSPs or very specialised
processors, for which it would be useful to have cross-platform code?

>
> There is a push back now from the one size fits all 32 bit solution to
> smaller native sized processors in embedded systems. Not only are some
> developers re-looking at 8 and 16 bit solutions but other things have
> started to emerge where applications sometimes map better on 18 and 36
> bit processors.

18-bit (and 36-bit) can map well onto soft processors inside FPGAs, but
again I think they would could as too specialised for such a standard.
8-bit and 16-bit are fine as far as I am concerned - I would want to
allow 16-bit "int" as well as 32-bit.

>
>
>>
>> I would start by fixing CHAR_BIT at 8, and insisting that signed
>> integers are all two's complement (but /not/ with modulo semantics -
>> integer overflow should remain undefined for performance reasons). An
>> "int" should be either 16 bit or 32 bit, according to target (there are
>> vast numbers of 8-bit and 16-bit cpus around - despite the increasing
>> popularity of ARM in embedded systems, these are still very important).
>>
>
> My opinion is that chars should always be unsigned as well. There are
> few reasons that a signed char makes sense.

Agreed. Although my preference would be that there was no such thing as
"signed char" or "unsigned char" - merely "char", and uint8_t and
int8_t. But there is no way to change that now!

Yes, that is certainly possible - and the only practical solution for
"weird" architectures.

David Brown

unread,

Jan 3, 2016, 7:29:55 PM1/3/16

to

The programmer gets assigned to the mail room after the code review.

> while one could argue that x or z should visibly change to 3, or which
> one, this would put a serious burden on optimizers, so is better left as
> undefined (ex: we may only reliably see changes in the value of a
> variable if we take the address of them, and then do something where the
> value of the variable could "reasonably" be changed by the operation).

One cannot possibly argue that x, y, or z should be changed here - and
the compiler will assume that they are not changed.

>
> an "unreasonable" change would be something like sticking the address
> into a location accessible to another thread and have another thread
> come along and alter it mid operation and expecting the change to be
> visible (this is, after all, more or less what we have 'volatile' for).

When you write such undefined behaviour, you had better be prepared for
nasal daemons. /Nothing/ that could happen on "px[1] = 3;" could be
considered reasonable, except a compile-time warning or a run-time trap
when using a sanitizer.

DIP packages are outdated. Their only use is for hobbyist markets, and
there is no shortage of pre-made boards containing the QFP, BGA, or
whatever microcontroller on a small board with DIP or double-row DIP
pins. Such boards are much more useful than a DIP MCU package would be,
as they contain the crystal, debug header, chicken feed, and other
necessities thus saving a great deal of time and effort.

>
> granted, it is likely the hobbyist market for MCUs is pretty small vs
> the mass-production market, and mass-production has little need for
> larger RAM and ROM space in a DIP.
>
> MCUs are in general a bit more specialized than are more general-use
> processors.
>
>
>>
>>>
>>> I would start by fixing CHAR_BIT at 8, and insisting that signed
>>> integers are all two's complement (but /not/ with modulo semantics -
>>> integer overflow should remain undefined for performance reasons). An
>>> "int" should be either 16 bit or 32 bit, according to target (there are
>>> vast numbers of 8-bit and 16-bit cpus around - despite the increasing
>>> popularity of ARM in embedded systems, these are still very important).
>>>
>>
>> My opinion is that chars should always be unsigned as well. There are
>> few reasons that a signed char makes sense.
>>
>
> in general, unsigned would make more sense, but on mainstream targets
> signed char seems to be the consensus. both signed and unsigned make
> sense in different cases.

No, relying on the signedness of plain char /never/ makes sense. If you
have some good reason for needing it, and being unable to use uint8_t
and int8_t (assuming vaguely modern tools, the only good reason would be
that you have a DSP that doesn't support 8-bit accesses), then specify
"signed char" or "unsigned char" explicitly.

>
>
>>
>>> Some aspects of aliasing and accessing data using pointers to different
>>> types could be tightened and clarified in the C specifications, and I
>>> would like to see a way to say that a given pointer can alias anything
>>> else.
>>>
>>> Endianness should be fixed at either big endian or little endian (target
>>> dependent - there are many big endian cpus around). But there should be
>>> a way to define data types with explicit endianness. It should also be
>>> possible to be explicit about bitfield ordering.
>>
>> Effectively this is really about structures and how they should be
>> organized and described. Most of the comments you have just made
>> eventually comes down to restricting ISA's to conform to the language or
>> forcing code generation to be ineffective on some processors. The
>> current case supports application specific processors but not code
>> portability.
>
> I have also wanted to see support for explicit endianess data-types, but
> had generally taken it as a given that using an explicit endianess type
> for a variable would take a performance hit in some cases.

It might have an effect in some cases, but a good many processors can
load or save swapped-endian data as fast as native data in many
circumstances. Don't "take as given".

>
> one thing in favor of it though is that generally the instruction sets
> have ways to do endianess swapping faster than it can be done in user code.

That is part of it. Compilers usually have intrinsics to cover loading
or saving endian-swapped data, and inline assembly can do the job. And
often a good compiler will optimise appropriate user-code into the
optimal instruction. But having specific ways to mark data as
endian-swapped would mean clearer and more flexible source code, and
perhaps also convenient debugging.

David Brown

unread,

Jan 3, 2016, 7:36:12 PM1/3/16

to

On 31/12/15 18:52, already...@yahoo.com wrote:
> On Thursday, December 31, 2015 at 5:50:37 PM UTC+2, David Brown wrote:
>> On 31/12/15 15:40, already...@yahoo.com wrote:
>>> On Thursday, December 31, 2015 at 2:51:52 PM UTC+2, David Brown wrote:
>>>> On 30/12/15 18:25, already...@yahoo.com wrote:
>>>>> On Wednesday, December 30, 2015 at 7:16:38 PM UTC+2, David Brown wrote:
>>>>>> On 30/12/15 17:27, Nick Maclaren wrote:
>>>>>>> In article <n60v8d$3sj$1...@dont-email.me>,

>>>>>>> David Brown <david...@hesbynett.no> wrote:
>>>>>>>>
>>>>>>>> I am not quite sure how to parse that sentence. Are you trying to say
>>>>>>>> that there are things that the C standard says are undefined behaviour,
>>>>>>>> but that are "guaranteed" to work in some sense?
>>>>>>>

>>>>>>> Yes. And they are both pervasive and important.
>>>>>>>
>>>>>>
>>>>>> Examples?
>>>>>>
>>>>>> (I am not claiming that the C standards are perfect, or as consistent or
>>>>>> clear as they could be - I just want to see exactly what you are
>>>>>> thinking of here.)
>>>>>
>>>>> int x = -1;
>>>>> int y = x << 1; // undefined !!! but works as expected on any target that matters
>>>>
>>>> No, that operation is fully defined - for signed types x, as long as "x
>>>> * 2^y" is representable in the return type (the type of x, int-promoted
>>>> if necessary), then the result is defined. If the value can't be
>>>> represented, then it is like any other integer overflow - undefined
>>>> behaviour.
>>>>
>>>> (That's from C11 - I don't have the other standards conveniently
>>>> available to check for differences.)
>>>>
>>>>
>>>>>
>>>>> But I'd guess Nick meant something else. I am pretty consistent at misunderstanding what he means.
>>>>>
>>>
>>> So, you changed you mind from a month ago?
>>>
>>> https://groups.google.com/forum/#!original/comp.arch/_DKjqoiQsHM/YdwmPctrCAAJ
>>
>> I've corrected it, based on a more accurate reading of the relevant bits
>> of the standard. The C standards are not light reading, and it is not
>> always easy to get a clear view. This is not an area where the details
>> really matter, since there are rarely any good reasons for using shifts
>> on signed integers.
>>
>> But I do apologise for not having been correct in my earlier post. If
>> you need details of these things, ask in comp.lang.c - mistakes about
>> the standards, however minor, are usually corrected quickly (and if the
>> standards are not clear, they will be argued over intensely).
>>
>> The relevant parts of the C11 standards (using N1570, which is freely
>> available on the net if you want it) from 6.5.7:
>>
>>
>> The result of E1 << E2 is E1 left-shifted E2 bit positions; vacated bits
>> are filled with zeros. If E1 has an unsigned type, the value of the
>> result is E1 × 2^E2 , reduced modulo one more than the maximum value
>> representable in the result type. If E1 has a signed type and
>> nonnegative value, and E1 × 2^E2 is representable in the result type,
>> then that is the resulting value; otherwise, the behavior is undefined.
>>
>
> It sounds like you were correct a month ago, not today. ((int)(-1) << 1) = undefined.
>

That's the risk of posting on New Year's Eve! Yes, my "correction" was
wrong.

So we are left with "int x = -1; int y = x << 1" being undefined. It is
not guaranteed to work "on any target that matters" - but it certainly
might work on some targets, and might even be /guaranteed/ to work on
them. And the standards could reasonably be changed to make it always
work - very few (if any) targets would suffer.

But if one wanted to change the standards here, I would suggest that a
better idea is to re-define the shift operators as being invalid for
signed integers - it is almost always better to simply use "y = x * 2;"
rather than a shift.

>> The result of E1 >> E2 is E1 right-shifted E2 bit positions. If E1 has
>> an unsigned type or if E1 has a signed type and a nonnegative value, the
>> value of the result is the integral part of the quotient of E1 / 2^E2 .
>> If E1 has a signed type and a negative value, the resulting value is
>> implementation-defined.
>>
>>
>>>
>>> I didn't read C11 standard. Didn't read C99 either and C90 I read very long time ago.
>>> My post was based on info from your post above and a post of Nick few days before yours.
>>>
>>>
>

BGB

unread,

Jan 3, 2016, 8:55:43 PM1/3/16

to

On 1/3/2016 5:57 PM, David Brown wrote:
> On 02/01/16 03:26, BGB wrote:
>> On 1/1/2016 4:14 PM, David Brown wrote:
>>> On 01/01/16 05:10, BGB wrote:

>>>
>>> But such a standard is only of use when code /does/ leave that
>>> restricted space. What is the point in making a standard that tightens
>>> implementation-defined behaviour such as the size of "int" and its
>>> endianness if it is only going to be used on a platform that already has
>>> "int" fixed as little-endian, 32-bits? The benefit of a standard that
>>> restricts some of the flexibility and vagueness of the main C standards
>>> is that you can use it on a range of targets - otherwise the current
>>> "implementation-defined behaviour" system is absolutely fine.
>>>
>>
>> the idea is would be to canonize this set of implementation-defined
>> behavior, to be able to say that the code will work on implementations
>> which look like this one.
>>
>> IOW: for description, not prescription.
>
> If "description" were enough, then no additional standard would be
> needed - "this code works on targets somewhat like PC's or ARMs" would
> be good enough.
>
> Standards are precisely about prescription and details - not vague
> descriptions.
>

I disagree.

I see standards more as being a sufficiently detailed description of
something that someone else can come along later, either making
something that can interoperate with it, or do their own version.

IOW: to describe how things are, not someones' opinion of how they
should be.

>>>
>>> The numbers of such systems are of course much smaller than for the x86
>>> and ARM world - but they are vital parts of the big picture.
>>>
>>
>> but not for the average programmer, and the people with these machines
>> probably aren't going to be running random code they grabbed off the
>> internet on them.
>
> And why can't they simply re-use existing code from other systems?
> Because said code is either written to such a wide standard (such as
> ANSI C) that it has inefficiencies, or it is targeted too specifically
> and written with assumptions that don't match the new target. That's
> why we need a standard in between. (Actually, for the server world,
> POSIX goes a fair way towards providing this.)
>

OTOH, a person can get pretty good results writing code for mobile and
basically pretending they were writing code for a PC.

nevermind if it falls apart if a person goes much further...

>
> In practice, the two versions compile to almost identical code on my
> quick test (each giving around 700 megapixels per second). Cache
> effects are more important than the minor code differences. But it is
> likely that the maximal speed (limited purely by cache) would be
> achieved by using the wider SIMD units on modern processors, which would
> involve a little restructuring of the data and code.
>

I don't get 700, but this is partly because 700 exceeds the memcpy speed
of my PC (3.4GHz AMD Phenom II X4).

checking disasm, it seems neither exploits SSE auto-vectorization in
this case.

as a result, added an explicitly SSE version for reference (version D).

m0=_mm_set_epi32(0xFF00FF00, 0xFF00FF00, 0xFF00FF00, 0xFF00FF00);
m1=_mm_set_epi32(0x000000FF, 0x000000FF, 0x000000FF, 0x000000FF);
m2=_mm_set_epi32(0x00FF0000, 0x00FF0000, 0x00FF0000, 0x00FF0000);
...
while(cs<cse)
{
x0=_mm_loadu_si128(cs);
x1=_mm_and_si128(x0, m0);
x2=_mm_and_si128(_mm_srli_epi32(x0, 16), m1);
x3=_mm_and_si128(_mm_slli_epi32(x0, 16), m2);
x4=_mm_or_si128(x1, _mm_or_si128(x2, x3));
_mm_storeu_si128(ct, x4);
cs+=4; ct+=4;

}

>
> So here you have managed to demonstrate that assumptions about the size
> of "int" and the endianness of the system let you write unmaintainable
> illegible code that is no better than a clear, simple and portable
> version. I am not particularly impressed.
>

I get different results on my end.

but, yeah, usually you would use a pile of #ifdef's to keep things
portable (ifdef's enabling disabling code as appropriate for the given
target).

test results (compiled with optimization "/O2", VS2013):
memcpy: 600 Mpix/sec;
version A: 370 Mpix/sec;
version B: 425 Mpix/sec.
version D: 542 Mpix/sec.

compiled with debug ("/Zi"):
memcpy: 600 Mpix/sec;
version A: 90 Mpix/sec;
version B: 316 Mpix/sec.
version D: 352 Mpix/sec.

where version D is an explicit SSE version.

note though that this requires using separate buffers and which are big
enough to not fit in cache.

granted, this isn't as big of a difference as I would have expected in
this case, but oh well...

BGB

unread,

Jan 3, 2016, 10:21:06 PM1/3/16

to

seen enough threads where people debate the results of these sorts of
things, sometimes coming up with rather wacky explanations for the
observed results.

>> while one could argue that x or z should visibly change to 3, or which
>> one, this would put a serious burden on optimizers, so is better left as
>> undefined (ex: we may only reliably see changes in the value of a
>> variable if we take the address of them, and then do something where the
>> value of the variable could "reasonably" be changed by the operation).
>
> One cannot possibly argue that x, y, or z should be changed here - and
> the compiler will assume that they are not changed.
>

yeah.

>>
>> an "unreasonable" change would be something like sticking the address
>> into a location accessible to another thread and have another thread
>> come along and alter it mid operation and expecting the change to be
>> visible (this is, after all, more or less what we have 'volatile' for).
>
> When you write such undefined behaviour, you had better be prepared for
> nasal daemons. /Nothing/ that could happen on "px[1] = 3;" could be
> considered reasonable, except a compile-time warning or a run-time trap
> when using a sanitizer.
>

granted, the compiler or runtime complaining would probably be a better
solution...

>>
>> I think it is "whatever is cheapest".
>>
>>
>> personally I would like seeing cheap MCUs with more RAM and ROM space in
>> DIP packaging, just as-is, getting a decent-sized RAM and ROM almost
>> invariably requires QFP (or the more expensive option of a QFP on a
>> breakout board).
>
> DIP packages are outdated. Their only use is for hobbyist markets, and
> there is no shortage of pre-made boards containing the QFP, BGA, or
> whatever microcontroller on a small board with DIP or double-row DIP
> pins. Such boards are much more useful than a DIP MCU package would be,
> as they contain the crystal, debug header, chicken feed, and other
> necessities thus saving a great deal of time and effort.
>

depending though, a lot of those dev boards are often $10 each or so...

a DIP MCU can generally be bought for $1.50 - $4 each for small orders,
or cheaper for larger orders (100 or more, you can get the same chips
maybe $0.75 - $1.00 each).

the QFP chips are cheaper and have better stats, but you can't easily
hand-solder them or use them with sockets on perfboard. DIP is a lot
nicer for hand-soldering and wire-wrap construction.

a lot of the modern MCU's seem to have an internal clock generator, so
no external crystal is needed.

though, did see at one point where there were boards IIRC with a 500MHz
ARM9, 128MB of RAM, 2GB of Flash, and onboard 802.11b, for $3 each, but
IIRC had a minimum order of 100. if I had a job, would probably have
ordered some.

but, ATM, I don't have any projects big enough (nor do I build enough
robots) to justify such an investment.

now there is the RPi Zero, which is pretty good apart from lacking
on-board WiFi: it effectively doubles the cost to get the parts to add
WiFi support (a WiFi dongle and OTG/USB adapter).

though, still a little cheaper than an RPi A+ and WiFi dongle.

>>>>
>>>> I would start by fixing CHAR_BIT at 8, and insisting that signed
>>>> integers are all two's complement (but /not/ with modulo semantics -
>>>> integer overflow should remain undefined for performance reasons). An
>>>> "int" should be either 16 bit or 32 bit, according to target (there are
>>>> vast numbers of 8-bit and 16-bit cpus around - despite the increasing
>>>> popularity of ARM in embedded systems, these are still very important).
>>>>
>>>
>>> My opinion is that chars should always be unsigned as well. There are
>>> few reasons that a signed char makes sense.
>>>
>>
>> in general, unsigned would make more sense, but on mainstream targets
>> signed char seems to be the consensus. both signed and unsigned make
>> sense in different cases.
>
> No, relying on the signedness of plain char /never/ makes sense. If you
> have some good reason for needing it, and being unable to use uint8_t
> and int8_t (assuming vaguely modern tools, the only good reason would be
> that you have a DSP that doesn't support 8-bit accesses), then specify
> "signed char" or "unsigned char" explicitly.
>

in general, yeah, it is better to define signed or unsigned explicitly.

but, I more meant, in terms of the compilers, signed char appears to
usually be the default behavior.

>>
>>
>>>
>>>> Some aspects of aliasing and accessing data using pointers to different
>>>> types could be tightened and clarified in the C specifications, and I
>>>> would like to see a way to say that a given pointer can alias anything
>>>> else.
>>>>
>>>> Endianness should be fixed at either big endian or little endian
>>>> (target
>>>> dependent - there are many big endian cpus around). But there
>>>> should be
>>>> a way to define data types with explicit endianness. It should also be
>>>> possible to be explicit about bitfield ordering.
>>>
>>> Effectively this is really about structures and how they should be
>>> organized and described. Most of the comments you have just made
>>> eventually comes down to restricting ISA's to conform to the language or
>>> forcing code generation to be ineffective on some processors. The
>>> current case supports application specific processors but not code
>>> portability.
>>
>> I have also wanted to see support for explicit endianess data-types, but
>> had generally taken it as a given that using an explicit endianess type
>> for a variable would take a performance hit in some cases.
>
> It might have an effect in some cases, but a good many processors can
> load or save swapped-endian data as fast as native data in many
> circumstances. Don't "take as given".
>

ok, either way, could be good.

a few of my scripting languages had it as a feature, but generally it
was limited to structures and pointers (it was N/A for local variables
or similar).

>>
>> one thing in favor of it though is that generally the instruction sets
>> have ways to do endianess swapping faster than it can be done in user
>> code.
>
> That is part of it. Compilers usually have intrinsics to cover loading
> or saving endian-swapped data, and inline assembly can do the job. And
> often a good compiler will optimise appropriate user-code into the
> optimal instruction. But having specific ways to mark data as
> endian-swapped would mean clearer and more flexible source code, and
> perhaps also convenient debugging.
>

yep, seems reasonable.

Terje Mathisen

unread,

Jan 4, 2016, 2:08:23 AM1/4/16

to

David Brown wrote:
> On 02/01/16 00:13, Terje Mathisen wrote:
>> David Brown wrote:
>>> And of course big endian processors are the mainstay of networking
>>> systems (Power, PPC, MIPS) - as well as being common in embedded
>>> systems. It is not unlikely that you have more big endian processors in
>>> your house than little-endian ones - and you certainly have more 8-bit
>>> ones than 32-bit ones.
>>
>> Hmmm...
>>
>> Here I am now we have 5 PCs and an iPad, plus 3 Android and 1 iPhone.
>>
>> We also have a linux-based WiFi router and the fiber vendor modem/wifi
>> unit so that is at least 12 32/64-bit computers.
>
> The chances are reasonably high that the router and modem have a
> big-endian (MIPS), though ARM is gaining market share in this area.

Right, I didn't try to count BE vs LE.

>>
>> The smaller embedded stuff would be the processors in the oven, the
>> count-down alarm clock, the dishwasher, washing machine and dryer, plus
>> the remote controlled lights (5)and a couple of remote controls, these
>> might well be 8-bit. (12 total)
>>
>> The digital radio/internet receiver is almost certainly something like
>> an ARM, and the same goes for the Samsung TV (which runs Linux) and the
>> IP TV set-top box.
>>
>> All my thermostats are analog.
>>
>> I'm probably forgetting a few devices but this looks like more 32+ than
>> 8-bit cpus?
>>
>
> TV (at least one, possibly more microcontroller per device). Remote
> controller? Another 8-bit (or 4-bit, if its old enough). Microwave?

I did count the TV and the two remote controls.

> Modern toaster, food processor, electronic kitchen scales, bathroom
> scales, or luggage scale? Bank card with a chip? Digital radio? Modern
> hifi system? Bluray player?

The location was our mountain cabin (which got fiber 5 years before our
Oslo home!), so no/very few of those items listed above.

We did have at least a couple of chip+pin bank/credit cards each though.

>
> What about those PC's - have they got a mouse or a keyboard? That's

Laptops only.

> another two 8-bit microcontrollers. Webcam? That's another. Monitor?
> There's at least one microcontroller in there. Printers? Probably

My printer is a high-end HP 4700 color laser, so that's definitely
another 32/64-bit cpu there.

> several in each. Wifi dongles or cards? There will be a
> microcontroller in each.

No wifi dongles, but an Ant+ dongle for my older Garmin clock.

BTW, I'm guessing that both my Garmin GPS watches contains 32-bit chips.

>
> Electronic toys? Games console - plus controllers?

Who needs those? We've never bought a single dedicated game device, PC's
are far more generally useful. Toys tends to be board games or
mechanical/wooden puzzles.

>
> I would be surprised if you couldn't dig up a hundred microcontrollers
> around the house - plus another hundred or so for each car in the
> garage. Five of these are x86 - and perhaps another 20 or so are ARM.

One car, no garage. :-)

>
> But as you can see, exact numbers are hard to come by!

Particularly when you start counting USB hard drives and thumb drives:

Every single thumb drive contains a 32-bit controller which is mostly
used to run the startup self-test which determines how much of the
attached flash memory is actually usable, i.e. is this a 1/2/4/8/16 or
32 GB unit?

"Bunny" Chang have a nice lecture about how you can reprogram many of
those units to lie about how large they are, he used this to halve the
memory and use the second half to mirror all writes to the first, but
disregard any erase statements: This gave him a USB drive which could be
used to exfiltrate secret data from a secure location, since he could
reprogram once more and gain access to the mirror.

Terje

already...@yahoo.com

unread,

Jan 4, 2016, 4:07:17 AM1/4/16

to

I disagree with almost everything you wrote in last three paragraphs.
"Modular" behavior by itself is not important, but associative behavior across overflow boundaries is often important and desirable. "Modular' just happens to be the most natural way to achieve "associative".
Also, due to same virtue of associativity, -fwrapv creates much more optimization opportunities that it loses. In fact, on "normal" ISAs I can't thing about any optimization downside.

already...@yahoo.com

unread,

Jan 4, 2016, 4:38:16 AM1/4/16

to

I don't remember if it is part of the C standard or just an unwritten consensus, but all C compilers that I ever encountered treat integer division by positive number as an operation that rounds toward zero.
On the other hand, right shift (again, on all C compilers that I ever encountered) is implemented as an operation that rounds toward negative infinity.

Oh, that was long way of saying that x >> 1 and x /2 are not equivalent, but I already tried to express it shorter few posts above and you didn't get it.

Besides, when I deal with bits, I want to use shifts, this way my intention is more clear then when I replace shifts by equivalent multiplication/division.

Example:
// non-standard, but clear
int SignExtend12bTo32b(int32_t inp) {
return (inp << (32-12)) >> (32-12);
}

// almost standard (IIRC, casting to signed is undefined by standard), but not clear
int SignExtend12bTo32b(uint32_t inp) {
return (int32_t)(inp * (1 << (32-12))) / (1 << (32-12));

I agree about TPU, but not about 24-bit DSPs.
In my experience, majority of code (not majority of cycles) running on DSPs is pretty normal code that can be cat&pasted to/from general-purpose computers or 32-bit MCUs.
Of course, with Freescale doing its best to become as irrelevant as it possibly could, 24-bit DSPs are not as a big concern as they used to be.

already...@yahoo.com

unread,

Jan 4, 2016, 4:44:39 AM1/4/16

to

That sounds too radical to me.
Making it "implementation-defined", i.e. the same as right shift, sounds more practical.

David Brown

unread,

Jan 4, 2016, 6:37:39 AM1/4/16

to

Fair enough - often part of the aim of the "hytte" is to get away from
all these gadgets! Though surely you've got some sort of "ski computer"
watch to track your distance, pulse, etc. - that's another
microcontroller :-)

>
> We did have at least a couple of chip+pin bank/credit cards each though.
>>
>> What about those PC's - have they got a mouse or a keyboard? That's
>
> Laptops only.

You'll probably find a number of processors in the laptop too - such as
in the Wifi module.

>
>> another two 8-bit microcontrollers. Webcam? That's another. Monitor?
>> There's at least one microcontroller in there. Printers? Probably
>
> My printer is a high-end HP 4700 color laser, so that's definitely
> another 32/64-bit cpu there.

You have that in your cabin?? But what about skiing, swimming in the
lake, climbing the mountain, playing Ludo, and explaining to the kids
that they can't have "mikropop" because you don't have a microwave?

>
>> several in each. Wifi dongles or cards? There will be a
>> microcontroller in each.
>
> No wifi dongles, but an Ant+ dongle for my older Garmin clock.
>
> BTW, I'm guessing that both my Garmin GPS watches contains 32-bit chips.
>>
>> Electronic toys? Games console - plus controllers?
>
> Who needs those? We've never bought a single dedicated game device, PC's
> are far more generally useful. Toys tends to be board games or
> mechanical/wooden puzzles.

I guess people have different selections of gadgets. At our "hytte" (my
father-in-law's old house on an island in the north of Norway), we are
far more likely to fly remote-controlled quadcopters in the garden than
to need a fast printer.

>>
>> I would be surprised if you couldn't dig up a hundred microcontrollers
>> around the house - plus another hundred or so for each car in the
>> garage. Five of these are x86 - and perhaps another 20 or so are ARM.
>
> One car, no garage. :-)
>>
>> But as you can see, exact numbers are hard to come by!
>
> Particularly when you start counting USB hard drives and thumb drives:
>
> Every single thumb drive contains a 32-bit controller which is mostly
> used to run the startup self-test which determines how much of the
> attached flash memory is actually usable, i.e. is this a 1/2/4/8/16 or
> 32 GB unit?

I didn't consider these - I don't know to what extent these usb sticks
have a processor, or whether it is just a state machine. Sometimes the
distinction can be blurred.

David Brown

unread,

Jan 4, 2016, 6:55:34 AM1/4/16

to

"Modular behaviour" means that the type wraps around from the highest to
the lowest value - that is what C gives you for unsigned types, but
specifically does not give you for signed types. Modular types have
associative arithmetic - but modular behaviour is by no means necessary
to be associative. In particular, the C specifications for signed types
- that overflow is undefined behaviour and therefore, as far as the
compiler is concerned, impossible - lets the compiler treat signed
arithmetic as associative.

In addition to being associative, C signed integers (i.e., with
undefined overflow behaviour) are more amenable to optimisation of
operations that could produce overflows in the middle of operations.
Consider this simple test code:

T test(T x) {
return (x * 2) / 2;
}

You might think that this can be reduced to "return x;", regardless of
the type of T. But that is not correct. If T is "int", the compiler
knows it can ignore what happens if "x * 2" is outside the range of
"int" - and as long as it is inside the range, then "(x * 2) / 2" is
always just "x". Thus it can generate code for "return x;".

But if T is "unsigned int", then if "x * 2" goes beyond the maximum for
"unsigned int", it must be reduced modulo 2^32 (assuming 32-bit int for
convenience) before being divided by 2. This means that the best
optimisation the compiler can do is effectively "return x & 0x7fffffff".

If you enable "-fwrapv", the compiler can't even simplify that way - it
has to do "return (x + x) >> 1;".

There will be occasions when "-fwrapv" gives additional optimisation
opportunities, but on the whole it leads to worse code - sometimes
significantly so.

<http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html>

Nick Maclaren

unread,

Jan 4, 2016, 7:29:39 AM1/4/16

to

In article <n6dmel$rpp$1...@dont-email.me>,

David Brown <david...@hesbynett.no> wrote:
>
>"Modular behaviour" means that the type wraps around from the highest to
>the lowest value - that is what C gives you for unsigned types, but
>specifically does not give you for signed types. Modular types have
>associative arithmetic - but modular behaviour is by no means necessary
>to be associative. In particular, the C specifications for signed types
>- that overflow is undefined behaviour and therefore, as far as the
>compiler is concerned, impossible - lets the compiler treat signed
>arithmetic as associative.

Furthermore, C-style "modular behaviour" is not modular in the strict
sense, because neither division nor right shift are modular. True
modular arithmetic is something else entirely, and obeys all the rules
of arithmetic (including associativity and distributivity). That is
a serious 'gotcha' and causes a lot of problems.

However, in practice, almost all of the problems occur with conversion
between signed, unsigned and precisions. The rules are so bizarre and
convoluted that virtually nobody knows them, even the most skilled of
programmers get caught by them, and even compilers occasionally get
them wrong.

Regards,
Nick Maclaren.

already...@yahoo.com

unread,

Jan 4, 2016, 7:36:54 AM1/4/16

to

For integer types I most certainly do not want "(x * 2) / 2" to be optimized into "x". May be, in other languges, but not in 'C'. I never wrote the code like that, but if I'd ever will I'll do, because I want it to produce something different from x for a certain class of inputs.

> But if T is "unsigned int", then if "x * 2" goes beyond the maximum for
> "unsigned int", it must be reduced modulo 2^32 (assuming 32-bit int for
> convenience) before being divided by 2. This means that the best
> optimisation the compiler can do is effectively "return x & 0x7fffffff".
>
> If you enable "-fwrapv", the compiler can't even simplify that way - it
> has to do "return (x + x) >> 1;".
>
>
> There will be occasions when "-fwrapv" gives additional optimisation
> opportunities, but on the whole it leads to worse code - sometimes
> significantly so.
>
>
> <http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html>

His excuses are, at best, unconvincing.
In nearly all cases that he mentioned implementation-defined behavior will serve users better than undefined behavior.

already...@yahoo.com

unread,

Jan 4, 2016, 7:44:31 AM1/4/16

to

Well, may be, not most cases. Out-of-band pointer/array accesses are undefined for a perfectly valid reason. Here he is correct.

David Brown

unread,

Jan 4, 2016, 8:00:40 AM1/4/16

to

On 04/01/16 13:36, already...@yahoo.com wrote:

> For integer types I most certainly do not want "(x * 2) / 2" to be
> optimized into "x". May be, in other languges, but not in 'C'. I
> never wrote the code like that, but if I'd ever will I'll do, because
> I want it to produce something different from x for a certain class
> of inputs.
>

It would be strange to manually write "(x * 2) / 2" - but these sorts of
things turn up as a result of constants, inlining, macro expansion, etc.
And in almost all such cases, you /do/ want this to be optimised to
"x". Usually the programmer knows that everything is well within
limits, but the compiler doesn't know it.

Terje Mathisen

unread,

Jan 4, 2016, 8:14:38 AM1/4/16

to

David Brown wrote:
> On 04/01/16 08:08, Terje Mathisen wrote:
>> The location was our mountain cabin (which got fiber 5 years before our
>> Oslo home!), so no/very few of those items listed above.
>
> Fair enough - often part of the aim of the "hytte" is to get away from
> all these gadgets! Though surely you've got some sort of "ski computer"
> watch to track your distance, pulse, etc. - that's another
> microcontroller :-)

Look below, I did mention my two Garmin GPS watches.

>
>>
>> We did have at least a couple of chip+pin bank/credit cards each though.
>>>
>>> What about those PC's - have they got a mouse or a keyboard? That's
>>
>> Laptops only.
>
> You'll probably find a number of processors in the laptop too - such as
> in the Wifi module.
>
>>
>>> another two 8-bit microcontrollers. Webcam? That's another. Monitor?
>>> There's at least one microcontroller in there. Printers? Probably
>>
>> My printer is a high-end HP 4700 color laser, so that's definitely
>> another 32/64-bit cpu there.
>
> You have that in your cabin?? But what about skiing, swimming in the

I got three of them surplus, including a stack of toner boxes, just had
to pick them up. The company had switched to modern "hybrid"
printer/scanner/copier/etc units.

> lake, climbing the mountain, playing Ludo, and explaining to the kids
> that they can't have "mikropop" because you don't have a microwave?

Oops, we do have a microwave. :-)

>
>>
>>> several in each. Wifi dongles or cards? There will be a
>>> microcontroller in each.
>>
>> No wifi dongles, but an Ant+ dongle for my older Garmin clock.
>>
>> BTW, I'm guessing that both my Garmin GPS watches contains 32-bit chips.
>>>
>>> Electronic toys? Games console - plus controllers?
>>
>> Who needs those? We've never bought a single dedicated game device, PC's
>> are far more generally useful. Toys tends to be board games or
>> mechanical/wooden puzzles.
>
> I guess people have different selections of gadgets. At our "hytte" (my
> father-in-law's old house on an island in the north of Norway), we are
> far more likely to fly remote-controlled quadcopters in the garden than
> to need a fast printer.

I spent the last 4 years making base maps for the Junior World
Championship in orienteering which took place in the forests around our
cabin. Having a fast color printer with waterproof ink was very useful. :-)

This is the GPS tracking for the men's long distance, our cabin is
located underneath the area covered by the symbolic description for
control #15:

http://www.tulospalvelu.fi/gps/20150709M20/?v=m3

>>> But as you can see, exact numbers are hard to come by!
>>
>> Particularly when you start counting USB hard drives and thumb drives:
>>
>> Every single thumb drive contains a 32-bit controller which is mostly
>> used to run the startup self-test which determines how much of the
>> attached flash memory is actually usable, i.e. is this a 1/2/4/8/16 or
>> 32 GB unit?
>
> I didn't consider these - I don't know to what extent these usb sticks
> have a processor, or whether it is just a state machine. Sometimes the
> distinction can be blurred.

According to Bunny the cost pressure on flash memory is such that every
single flash chip produced is actually sold, the 32-bit cpu which is
added to the raw flash only cost a few cents but is powerful enough to
handle both the initial self-test and the subsequent operation, which
includes fancy error correcting codes, along with the wear leveling
which adds an extra level of address indirection.

I guess the key takeaway is that for really high-volume products, even a
32-bit cpu is almost free, this means that they will turn up in many
traditional 8-bit MC scenarios.

David Brown

unread,

Jan 4, 2016, 9:33:04 AM1/4/16

to

On 04/01/16 14:14, Terje Mathisen wrote:

>
> According to Bunny the cost pressure on flash memory is such that every
> single flash chip produced is actually sold, the 32-bit cpu which is
> added to the raw flash only cost a few cents but is powerful enough to
> handle both the initial self-test and the subsequent operation, which
> includes fancy error correcting codes, along with the wear leveling
> which adds an extra level of address indirection.
>
> I guess the key takeaway is that for really high-volume products, even a
> 32-bit cpu is almost free, this means that they will turn up in many
> traditional 8-bit MC scenarios.
>

Yes, the price of 32-bit is getting extremely low. You can get a Cortex
M0+ microcontroller for $0.40 (I think that was 10K prices) - the cost
of adding an M0 or M0+ core to a chip is going to be a lot lower than that.

Still, there are applications where a few cents extra for a 32-bit micro
is too much - some of these still have 4-bit micros. Think about
"musical" greeting cards, plastic toys in children's comics, etc. I
once had some correspondence with a guy whose job it was to reduce costs
in such systems - if he could reduce the part count by a resistor or
two, his salary for the year was justified.

There are also still good reasons to pick 8-bit or 16-bit
microcontrollers. In most of our new designs, we are seeing more ARM
Cortex M devices - but sometimes the peripheral mix, the power
requirements, the temperature requirements, etc., make an 8-bit AVR or a
16-bit MSP430 a better choice.

BGB

unread,

Jan 4, 2016, 11:14:24 AM1/4/16

to

the behavior of the largest value wrapping around to the lowest value is
generally what is observed in-practice though.

if a person wants different behavior, like overflow trapping or
saturating arithmetic, ideally this would need a separate type.

similar goes for trying to sneak a larger integer type in the mix. say,
if the implementation uses a 48 or 64 bit value or whatever else as an
intermediate representation for a 32 bit type, ideally it "should" try
to make the value behave as-if it were still 32-bits.

>>
>> You might think that this can be reduced to "return x;", regardless of
>> the type of T. But that is not correct. If T is "int", the compiler
>> knows it can ignore what happens if "x * 2" is outside the range of
>> "int" - and as long as it is inside the range, then "(x * 2) / 2" is
>> always just "x". Thus it can generate code for "return x;".
>>
>
> For integer types I most certainly do not want "(x * 2) / 2" to be optimized into "x". May be, in other languges, but not in 'C'. I never wrote the code like that, but if I'd ever will I'll do, because I want it to produce something different from x for a certain class of inputs.
>

yeah.

algebraic equivalences often tend to break down for integer types.

there are various calculations around, such as the RCT colorspace being
reversible, which depend on the calculation being done basically
as-written. algebraic equivalents exist, but couldn't be used as they
would risk changing the rounding behavior in ways which breaks the
conversion being reversible.

there are other things like ((x/256)*y), which would produce erroneous
results if changed to ((x*y)/256), even if algebra rules would claim
them to be equivalent.

>> But if T is "unsigned int", then if "x * 2" goes beyond the maximum for
>> "unsigned int", it must be reduced modulo 2^32 (assuming 32-bit int for
>> convenience) before being divided by 2. This means that the best
>> optimisation the compiler can do is effectively "return x & 0x7fffffff".
>>
>> If you enable "-fwrapv", the compiler can't even simplify that way - it
>> has to do "return (x + x) >> 1;".
>>
>>
>> There will be occasions when "-fwrapv" gives additional optimisation
>> opportunities, but on the whole it leads to worse code - sometimes
>> significantly so.
>>
>>
>> <http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html>
>
> His excuses are, at best, unconvincing.
> In nearly all cases that he mentioned implementation-defined behavior will serve users better than undefined behavior.
>

yeah.

Terje Mathisen

unread,

Jan 4, 2016, 12:17:09 PM1/4/16

to

David Brown wrote:
> Still, there are applications where a few cents extra for a 32-bit micro
> is too much - some of these still have 4-bit micros. Think about
> "musical" greeting cards, plastic toys in children's comics, etc. I
> once had some correspondence with a guy whose job it was to reduce costs
> in such systems - if he could reduce the part count by a resistor or
> two, his salary for the year was justified.

If you can find Dave Hampton's story about Furby on the net, please
read/watch it!

He did an absolutely amazing job of making almost no cpu power at all
behave so sophisticated that a three letter agency banned them at work.

>
> There are also still good reasons to pick 8-bit or 16-bit
> microcontrollers. In most of our new designs, we are seeing more ARM
> Cortex M devices - but sometimes the peripheral mix, the power
> requirements, the temperature requirements, etc., make an 8-bit AVR or a
> 16-bit MSP430 a better choice.

The MSP is a nice chip imho.

Anton Ertl

unread,

Jan 4, 2016, 1:05:55 PM1/4/16

to

BGB <cr8...@hotmail.com> writes:
>On 1/4/2016 6:36 AM, already...@yahoo.com wrote:
>> On Monday, January 4, 2016 at 1:55:34 PM UTC+2, David Brown wrote:
>>> "Modular behaviour" means that the type wraps around from the highest to
>>> the lowest value - that is what C gives you for unsigned types, but
>>> specifically does not give you for signed types. Modular types have
>>> associative arithmetic - but modular behaviour is by no means necessary
>>> to be associative. In particular, the C specifications for signed types
>>> - that overflow is undefined behaviour and therefore, as far as the
>>> compiler is concerned, impossible - lets the compiler treat signed
>>> arithmetic as associative.

Only broken-by-design compilers consider undefined behaviour to be
impossible just because it is undefined. Overflow-is-undefined is not
even closed (and trapping-on-overflow is not closed, either), so it's
not even a groupoid (aka magma).

>algebraic equivalences often tend to break down for integer types.
>
>there are various calculations around, such as the RCT colorspace being
>reversible, which depend on the calculation being done basically
>as-written. algebraic equivalents exist, but couldn't be used as they
>would risk changing the rounding behavior in ways which breaks the
>conversion being reversible.
>
>there are other things like ((x/256)*y), which would produce erroneous
>results if changed to ((x*y)/256), even if algebra rules would claim
>them to be equivalent.

What algebra rule claims that? Only one for a field, but integers
(not even mathematical integers) are not a field.

>>> There will be occasions when "-fwrapv" gives additional optimisation
>>> opportunities, but on the whole it leads to worse code - sometimes
>>> significantly so.
>>>
>>>
>>> <http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html>

The funny thing about this blog entry series is that, while it claims
"optimization opportunities" right and left, it does not give a single
empirical result to support these claims. And, as it turns out, there
is a reason for that: the benefits are ridiculously small, so if he
actually presented the results, he would totally undermine his claims.

In particular, Wang et al.
<http://people.csail.mit.edu/nickolai/papers/wang-undef-2012-08-21.pdf>
actually ran GCC and LLVM with and without -fwrapv on SPECint 2006,
and found that it made a measurable difference in only one place of
one benchmark, and that would be easy to also achieve by changing the
source code (one index variable would need to change from int to
unsigned or long). That place (in an inner loop) made a difference of
7.2% (GCC) and 9% (LLVM) for 456.hmmer, so 0.6%-0.7% difference for
SPECint; no wonder that Chris Lattner did not provide empirical data.

BTW, adding -fwrapv etc. does not provide additional optimization
opportunities to compilers (it only puts additional requirements on
compilers), but it does so to programmers, resulting in shorter code
<http://www.complang.tuwien.ac.at/kps2015/proceedings/KPS_2015_submission_29.pdf>;
but more importantly, it does not silently "optimize" away bounds
checks.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

David Brown

unread,

Jan 4, 2016, 5:41:14 PM1/4/16

to

It is. I don't like the 20-bit extensions - they add a lot of
complexity, for little gain (if you need that much memory, it is much
easier to go to 32 bits). But I think the MSP430's cpu is elegant, it's
peripherals are flexible yet easy to use, and it is very power efficient.

BGB

unread,

Jan 4, 2016, 6:05:54 PM1/4/16

to

yep. I also use some MSP's, pretty good overall (apart from the decided
lack of RAM and ROM space).

had used an MSP430G2232 for a POT-and-switch controlled 3-phase VFD (for
BLDC and low-voltage induction motors).

if I had more ROM space (more than 2kB), I could probably make a serial
control protocol, but I don't think I have enough ROM space to do an
effective serial interface (in addition to doing the VFD stuff).

examples of this would be using a serial message protocol to change
motor speeds or move to a given position.

though, could possibly pull off a similar control scheme to that used
for controlling tallons or servos (ex: you give it a PWM pulse-train to
say how fast and what direction to spin the motor).

BGB

unread,

Jan 4, 2016, 6:19:19 PM1/4/16

to

On 1/4/2016 11:06 AM, Anton Ertl wrote:
> BGB <cr8...@hotmail.com> writes:
>> On 1/4/2016 6:36 AM, already...@yahoo.com wrote:
>>> On Monday, January 4, 2016 at 1:55:34 PM UTC+2, David Brown wrote:
>>>> "Modular behaviour" means that the type wraps around from the highest to
>>>> the lowest value - that is what C gives you for unsigned types, but
>>>> specifically does not give you for signed types. Modular types have
>>>> associative arithmetic - but modular behaviour is by no means necessary
>>>> to be associative. In particular, the C specifications for signed types
>>>> - that overflow is undefined behaviour and therefore, as far as the
>>>> compiler is concerned, impossible - lets the compiler treat signed
>>>> arithmetic as associative.
>
> Only broken-by-design compilers consider undefined behaviour to be
> impossible just because it is undefined. Overflow-is-undefined is not
> even closed (and trapping-on-overflow is not closed, either), so it's
> not even a groupoid (aka magma).
>

could be better regarded as implementation defined.
even then, typical behavior is as it is.

>> algebraic equivalences often tend to break down for integer types.
>>
>> there are various calculations around, such as the RCT colorspace being
>> reversible, which depend on the calculation being done basically
>> as-written. algebraic equivalents exist, but couldn't be used as they
>> would risk changing the rounding behavior in ways which breaks the
>> conversion being reversible.
>>
>> there are other things like ((x/256)*y), which would produce erroneous
>> results if changed to ((x*y)/256), even if algebra rules would claim
>> them to be equivalent.
>
> What algebra rule claims that? Only one for a field, but integers
> (not even mathematical integers) are not a field.
>

ok.

as I understood it, algebra tends to assume: A*B=B*A and A/B=A*(1/B).

thus, presumably:
((x/256)*y) -> x*(1/256)*y -> x*y*(1/256) -> ((x*y)/256).

however, this fails hard on integers.

as I understand it though, traditional algebra has no real notion of a
distinct integer type either, rather you have some special braces which
round up or down (floor or ceil). like, integers and whole numbers are
special categories of numbers, but are not otherwise given any special
treatment.

>>>> There will be occasions when "-fwrapv" gives additional optimisation
>>>> opportunities, but on the whole it leads to worse code - sometimes
>>>> significantly so.
>>>>
>>>>
>>>> <http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html>
>
> The funny thing about this blog entry series is that, while it claims
> "optimization opportunities" right and left, it does not give a single
> empirical result to support these claims. And, as it turns out, there
> is a reason for that: the benefits are ridiculously small, so if he
> actually presented the results, he would totally undermine his claims.
>
> In particular, Wang et al.
> <http://people.csail.mit.edu/nickolai/papers/wang-undef-2012-08-21.pdf>
> actually ran GCC and LLVM with and without -fwrapv on SPECint 2006,
> and found that it made a measurable difference in only one place of
> one benchmark, and that would be easy to also achieve by changing the
> source code (one index variable would need to change from int to
> unsigned or long). That place (in an inner loop) made a difference of
> 7.2% (GCC) and 9% (LLVM) for 456.hmmer, so 0.6%-0.7% difference for
> SPECint; no wonder that Chris Lattner did not provide empirical data.
>
> BTW, adding -fwrapv etc. does not provide additional optimization
> opportunities to compilers (it only puts additional requirements on
> compilers), but it does so to programmers, resulting in shorter code
> <http://www.complang.tuwien.ac.at/kps2015/proceedings/KPS_2015_submission_29.pdf>;
> but more importantly, it does not silently "optimize" away bounds
> checks.
>

makes sense.

MitchAlsup

unread,

Jan 4, 2016, 8:20:27 PM1/4/16

to

On Monday, January 4, 2016 at 5:19:19 PM UTC-6, BGB wrote:

Integers are more like a series of cardinals that form a ring, they obey
not very many algebraic rules.

Floating Point is far from Real Numbers in a myriad of ways.

If you think of either of them in a 9-th grade math class sense you will
make avoidable mistakes when programming.

BGB

unread,

Jan 4, 2016, 9:16:50 PM1/4/16

to

yeah, pretty much.

Anton Ertl

unread,

Jan 5, 2016, 6:44:15 AM1/5/16

to

BGB <cr8...@hotmail.com> writes:
>On 1/4/2016 11:06 AM, Anton Ertl wrote:
>> Only broken-by-design compilers consider undefined behaviour to be
>> impossible just because it is undefined. Overflow-is-undefined is not
>> even closed (and trapping-on-overflow is not closed, either), so it's
>> not even a groupoid (aka magma).
>>
>
>could be better regarded as implementation defined.
>even then, typical behavior is as it is.

Even these broken-by-design compilers compile, e.g., signed integer
addition to wraparound addition in the usual case, even without
-fwrapv. Only when such compilers can derive "facts" about the
program from assuming that wraparound does not happen, and use these
"facts" for "optimizations", does the brokenness set in; but when that
happens varies between these compilers. So what do you mean with
"typical behavior is as it is"?

>> What algebra rule claims that? Only one for a field, but integers
>> (not even mathematical integers) are not a field.
>>
>
>ok.
>
>as I understood it, algebra tends to assume: A*B=B*A and A/B=A*(1/B).

The latter does not even hold in general for fields: it does not hold
when B=0.

>thus, presumably:
> ((x/256)*y) -> x*(1/256)*y -> x*y*(1/256) -> ((x*y)/256).
>
>however, this fails hard on integers.

Yes, integers do not have an inverse element for * (i.e., no 1/256).

>as I understand it though, traditional algebra has no real notion of a
>distinct integer type either

Types are a computer science construct. But mathematics has the set
of integers (Z), with operations + and *, which form a ring (not a
field, i.e., no inverse for *).

Nick Maclaren

unread,

Jan 5, 2016, 8:35:57 AM1/5/16

to

In article <2016Jan...@mips.complang.tuwien.ac.at>,
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:

>BGB <cr8...@hotmail.com> writes:
>>
>>as I understood it, algebra tends to assume: A*B=B*A and A/B=A*(1/B).
>
>The latter does not even hold in general for fields: it does not hold
>when B=0.

Which is why division by zero is an error in all sane specifications.
The definition of an inverse for a mathematical field excludes the
value zero as divisor.

>>thus, presumably:
>> ((x/256)*y) -> x*(1/256)*y -> x*y*(1/256) -> ((x*y)/256).
>>
>>however, this fails hard on integers.
>
>Yes, integers do not have an inverse element for * (i.e., no 1/256).
>
>>as I understand it though, traditional algebra has no real notion of a
>>distinct integer type either
>
>Types are a computer science construct. But mathematics has the set
>of integers (Z), with operations + and *, which form a ring (not a
>field, i.e., no inverse for *).

It's not quite that simple. A ring doesn't have a general division
operation, true, but there are definitions of division that are
compatible with integers and are conventional. The one for integers
of characteristic zero is the one we know and love, giving separate
quotient and remainder - or sometimes division is defined only when
the remainder is zero.

Division for integers modulo N is different, and is defined for all
non-zero values if N is prime and some other values when it is not.
However, it is NOT what C unsigned integers deliver, and that is why
it is an error to say that they obey the rules of arithmetic modulo
2^K.

Regards,
Nick Maclaren.

MitchAlsup

unread,

Jan 5, 2016, 9:39:09 AM1/5/16

to

On Monday, January 4, 2016 at 5:19:19 PM UTC-6, BGB wrote:

> as I understood it, algebra tends to assume: A*B=B*A and A/B=A*(1/B).

When B=0: higher levels of algebra assume:

A/B = A*Aleph_naught
A*(1/B) = A*(1/0) = A*Aleph_naught

The problem is integers have no way of representing Aleph_naught, or any
of the higher levels of infinity.

Nick Maclaren

unread,

Jan 5, 2016, 11:15:02 AM1/5/16

to

In article <cc1cb808-ad6a-418a...@googlegroups.com>,

Sorry, Mitch, but no. That is true for the extended non-negative
integers (I forget what they are called) but not for the (affine)
completion of the integer line. That has a single zero, but both
positive and negative integers, and A/0 is still an error.

It IS, however, true for the Riemann sphere completion of the
complex numbers ....

Regards,
Nick Maclaren.

Nick Maclaren

unread,

Jan 5, 2016, 11:16:09 AM1/5/16

to

In article <n6gq14$kb6$1...@dont-email.me>, Nick Maclaren <nm...@cam.ac.uk> wrote:
>
>Sorry, Mitch, but no. That is true for the extended non-negative
>integers (I forget what they are called) but not for the (affine)
>completion of the integer line. That has a single zero, but both
>positive and negative integers, and A/0 is still an error.

I mean "positive and negative infinities", of course.

Regards,
Nick Maclaren.

Anton Ertl

unread,

Jan 5, 2016, 11:51:20 AM1/5/16

to

n...@wheeler.UUCP (Nick Maclaren) writes:
>In article <2016Jan...@mips.complang.tuwien.ac.at>,
>Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>> But mathematics has the set
>>of integers (Z), with operations + and *, which form a ring (not a
>>field, i.e., no inverse for *).
>
>It's not quite that simple. A ring doesn't have a general division
>operation, true, but there are definitions of division that are
>compatible with integers and are conventional. The one for integers
>of characteristic zero is the one we know and love, giving separate
>quotient and remainder - or sometimes division is defined only when
>the remainder is zero.

Either way, there is no multiplicative inverse element for enough
integers to make the integers a field, and that's why the rule that
BGB is thinking of does not hold for integers.

>Division for integers modulo N is different, and is defined for all
>non-zero values if N is prime and some other values when it is not.
>However, it is NOT what C unsigned integers deliver, and that is why
>it is an error to say that they obey the rules of arithmetic modulo
>2^K.

For integers mod 2^k there are still not enough multiplicative
inverses to make integers mod 2^k a field, so they are still just a
ring, with operations + and * (and the inverse of +, but not of *).
Just like for the ring of integers, people use a "division" that's
convenient, but satisfies a/b = a*b^-1 for only two cases, when b=1,
and when b=-1 (for unsigned "division", only for b=1).

BGB

unread,

Jan 5, 2016, 11:56:13 AM1/5/16

to

On 1/5/2016 5:31 AM, Anton Ertl wrote:
> BGB <cr8...@hotmail.com> writes:
>> On 1/4/2016 11:06 AM, Anton Ertl wrote:
>>> Only broken-by-design compilers consider undefined behaviour to be
>>> impossible just because it is undefined. Overflow-is-undefined is not
>>> even closed (and trapping-on-overflow is not closed, either), so it's
>>> not even a groupoid (aka magma).
>>>
>>
>> could be better regarded as implementation defined.
>> even then, typical behavior is as it is.
>
> Even these broken-by-design compilers compile, e.g., signed integer
> addition to wraparound addition in the usual case, even without
> -fwrapv. Only when such compilers can derive "facts" about the
> program from assuming that wraparound does not happen, and use these
> "facts" for "optimizations", does the brokenness set in; but when that
> happens varies between these compilers. So what do you mean with
> "typical behavior is as it is"?
>

signed wraparound behavior typically behaving the same in the optimized
and non-optimized cases. this "appears" to be what MSVC and GCC give on
x86, but admittedly I have never really looked into it in too much
detail, since issues haven't really come up.

have ran into issues where (x<<32) or (x>>32) return x, rather than 0
(which would make more sense), but understandably the compiler would
otherwise need a slower shift operator.

ended up having to have the logic do 2 shifts to compensate (though it
was still faster than the logic it replaced). this was to do branch-free
bitstream operations, which managed to boost my LZ77 decoder from 475
MB/sec to 520 MB/sec on my K10.

logic ends up with x86/ARM-only ifdef's, since it relies on the ability
to fetch an arbitrary 32-bit word from memory faster than it would be to
execute an if-protected branch. though, I have not confirmed yet whether
a branch-free version would actually be faster on ARM (which could
otherwise presumably use conditional-execution for the 32-bit ISA).

last checked, the prior version was 90MB/sec on a 900MHz Cortex A7.

>>> What algebra rule claims that? Only one for a field, but integers
>>> (not even mathematical integers) are not a field.
>>>
>>
>> ok.
>>
>> as I understood it, algebra tends to assume: A*B=B*A and A/B=A*(1/B).
>
> The latter does not even hold in general for fields: it does not hold
> when B=0.
>

yep, good ol' division by 0.
IIRC, algebra tends to ignore it, except when you land on it in your
solved output, then you have a problem.

>> thus, presumably:
>> ((x/256)*y) -> x*(1/256)*y -> x*y*(1/256) -> ((x*y)/256).
>>
>> however, this fails hard on integers.
>
> Yes, integers do not have an inverse element for * (i.e., no 1/256).
>

yep. division looses whatever bits would otherwise be in the remainder.

>> as I understand it though, traditional algebra has no real notion of a
>> distinct integer type either
>
> Types are a computer science construct. But mathematics has the set
> of integers (Z), with operations + and *, which form a ring (not a
> field, i.e., no inverse for *).
>

could be. I prefer types, even if it means having to have casts or
occasional type coercions. I also like having multiple state and the
concept of declaration scope, FWIW.

like, all these math people, trying to avoid name collisions with their
Greek letters and one-letter variable names. occasionally they get fancy
and use more obscure or elaborate symbols.

"With all my set notation you know it to be true!"

may as well go with the side who wants there to be
"tau=6.283185307179586476925286766559...", and faces opposition from the
side who says there must only be pi.

Nick Maclaren

unread,

Jan 5, 2016, 12:14:56 PM1/5/16

to

In article <2016Jan...@mips.complang.tuwien.ac.at>,
Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
>
>>Division for integers modulo N is different, and is defined for all
>>non-zero values if N is prime and some other values when it is not.
>>However, it is NOT what C unsigned integers deliver, and that is why
>>it is an error to say that they obey the rules of arithmetic modulo
>>2^K.
>
>For integers mod 2^k there are still not enough multiplicative
>inverses to make integers mod 2^k a field, so they are still just a
>ring, with operations + and * (and the inverse of +, but not of *).
>Just like for the ring of integers, people use a "division" that's
>convenient, but satisfies a/b = a*b^-1 for only two cases, when b=1,
>and when b=-1 (for unsigned "division", only for b=1).

I didn't say that it made a field (as you say, it doesn't), but it
DOES have defined inverses for some combinations of values. I can't
remember the terminology, because my degree was nearly half a century
ago, but it is an established mathematical usage. And that usage is
NOT what C delivers, which is what you describe.

Anton Ertl

unread,

Jan 5, 2016, 1:07:34 PM1/5/16

to

Sure, the multiplicative inverse of 3 mod 16 is 11 (11*3=1 (mod 16)).

Something along these lines not just what 1/3 in C does not deliver,
it's also not what the hardware (where "division" is available)
delivers, and most importantly, it's not what most programmers would
find useful. But, anyway, the algabraic rules for rings still hold,
and the additional rules for fields don't hold, just as they don't for
mathematical integers. If we had integers mod some prime, we would
have a field, but 1) we don't have that, and 2) most programmers
probably would still not find the multiplicative inverse and the
division derived from it in that field useful.