"Need for Speed - C++ versus Assembly Language"

Lynn McGuire

unread,

May 8, 2017, 1:34:17 PM5/8/17

to

"Need for Speed - C++ versus Assembly Language"

https://www.codeproject.com/Articles/1182676/Need-for-Speed-Cplusplus-versus-Assembly-Language

Neat ! I believe that ANY C++ compiler and linker duo can beat my hand
written assembly language. Its been decades since I wrote any assembly
language.

Lynn

Vir Campestris

unread,

May 8, 2017, 4:34:18 PM5/8/17

to

More to the point - I can write a big programme in C++ and make it work.

Then profile it, and see how I can tune it. Which is more likely to be
by algorithm than tweaking instructions.

I haven't written assembler beyond initial system start for _years_.
Though about 15 years ago I was astonished to find myself writing
machine code...

Andy

Richard

unread,

May 8, 2017, 5:31:23 PM5/8/17

to

[Please do not mail me a copy of your followup]

Vir Campestris <vir.cam...@invalid.invalid> spake the secret code
<xbOdnZFhp-_fS43E...@brightview.co.uk> thusly:

>On 08/05/2017 18:33, Lynn McGuire wrote:
>> "Need for Speed - C++ versus Assembly Language"
>>
>>
>https://www.codeproject.com/Articles/1182676/Need-for-Speed-Cplusplus-versus-Assembly-Language
>>
>>
>> Neat ! I believe that ANY C++ compiler and linker duo can beat my hand
>> written assembly language. Its been decades since I wrote any assembly
>> language.
>>
>More to the point - I can write a big programme in C++ and make it work.

I like that C++ allows me to apply efficient abstractions at low
levels as well as high levels.

Modeling memory-mapped read-only registers with zero overhead compared
to assembly is no big deal for C++. It would be amazingly painful in
Java or C#.
--
"The Direct3D Graphics Pipeline" free book <http://tinyurl.com/d3d-pipeline>
The Terminals Wiki <http://terminals-wiki.org>
The Computer Graphics Museum <http://computergraphicsmuseum.org>
Legalize Adulthood! (my blog) <http://legalizeadulthood.wordpress.com>

jacobnavia

unread,

May 8, 2017, 6:25:57 PM5/8/17

to

Le 08/05/2017 à 22:34, Vir Campestris a écrit :
> Then profile it, and see how I can tune it. Which is more likely to be
> by algorithm than tweaking instructions.

Then, for a given algorithm, you discover that tweaking instructions you
can improve speed by 50, 100% or maybe more.

The compiler must generate a program correct for for ALL possible C++
programs.

The assembler programmer is generating instructions in his/her mind for
THIS program only.

This scope reduction makes the human much more flexible than any
compiler since he/she KNOWS what is he/she doing in the global context
of algorithm implementation.

I am an assembly language programmer and I distribute a compiler system
based on the C language, The compiler generates assembly.

So, I like (and I have done it very often) to outsmart ANY compiler in
just a few lines of ASM. It is fun.

Contrary to what you could believe, I am not saying everyone should
program in ASM. Asm is not for everyone.

The given example is not really a fast asm program. As any language, you
can write fast or slow programs in asm. The examples use too much
shufps, for instance. Can't that be arranged otherwise in the pipeline?

Not any preloading of the data is done using the advanced instructions
of the x86 to preload the data. Why?

Look, asm is not for everyone.

Just for the people that like it.

:-)

jacob

jacobnavia

unread,

May 8, 2017, 6:29:14 PM5/8/17

to

Le 08/05/2017 à 23:31, Richard a écrit :
> Modeling memory-mapped read-only registers with zero overhead compared
> to assembly is no big deal for C++. It would be amazingly painful in
> Java or C#.

Maybe you can explain what you mean with

"memory-mapped read-only registers"

?????

In any case if those registers have zero overhead they can't exist in
the run time. All operations in the circuit take a certain time (also
know as "cycle"). Zero overhead is something that doesn't exist.

jacobnavia

unread,

May 8, 2017, 6:33:24 PM5/8/17

to

Le 08/05/2017 à 19:33, Lynn McGuire a écrit :
> Neat ! I believe that ANY C++ compiler and linker duo can beat my hand
> written assembly language. Its been decades since I wrote any assembly
> language.

It is funny that you do not even understand the irony of that sentence.

Your hand written asm language is bad since you haven't a clue: It has
been decades since you wrote any assembly. Everything has changed since
the last decade.

So, you shouldn't write any asm and stick to your C++. As far as making
a generalization from your own personal incompetence in asm to some
general principle of programming languages... I have some doubts about
its scope.

:-)

jacobnavia

unread,

May 8, 2017, 6:41:52 PM5/8/17

to

Le 08/05/2017 à 19:33, Lynn McGuire a écrit :

In that example, a slow asm program is compared to the compiler output.
Yes, the compiler wins. You can write fast asm programs, but you can
write also so slow ones, that those programs are slower than a compiler.

I am an assembly language programmer and I distribute a compiler system
based on the C language, The compiler generates assembly.

So, I like (and I have done it very often) to outsmart ANY compiler in
just a few lines of ASM. It is fun.

Contrary to what you could believe, I am not saying everyone should
program in ASM. Asm is not for everyone.

The examples use too much shufps, for instance. Can't that be arranged
otherwise in the pipeline?

Yes, a pipeline. Most machines now are deeply pipelined and that is one
of the essential points. Compilers pipeline now, so it is no wonder that
a program that uses half of the register file only and is not pipelined
is slower. Big deal.

Not any preloading of the data is done using the advanced instructions
of the x86 to preload the data. Why?

Look, asm is not for everyone.

Just for the people that like it.

It has been decades since you programmed anything in asm. So, go on
using your c++ compiler and do not speak about things you know nothing
about.

jacob

Cholo Lennon

unread,

May 8, 2017, 11:10:41 PM5/8/17

to

On 05/08/2017 07:41 PM, jacobnavia wrote:
> Le 08/05/2017 à 19:33, Lynn McGuire a écrit :
>> "Need for Speed - C++ versus Assembly Language"
>>
>> https://www.codeproject.com/Articles/1182676/Need-for-Speed-Cplusplus-versus-Assembly-Language
>>
>>
>>
>> Neat ! I believe that ANY C++ compiler and linker duo can beat my hand
>> written assembly language. Its been decades since I wrote any assembly
>> language.
>>
>> Lynn
>
> In that example, a slow asm program is compared to the compiler output.
> Yes, the compiler wins. You can write fast asm programs, but you can
> write also so slow ones, that those programs are slower than a compiler.

That's happens a lot with C++. Everyone who wants to compare her/his
language of choice has the necessity to do it against C++, but usually
the comparisons are not fair due to person bias and because she/he is
really not proficient in C++. Result: Horrible C++ code vs
optimized/beautiful "foo" code

>
> I am an assembly language programmer and I distribute a compiler system
> based on the C language, The compiler generates assembly.
>
> So, I like (and I have done it very often) to outsmart ANY compiler in
> just a few lines of ASM. It is fun.
>
> Contrary to what you could believe, I am not saying everyone should
> program in ASM. Asm is not for everyone.
>
> The examples use too much shufps, for instance. Can't that be arranged
> otherwise in the pipeline?
>
> Yes, a pipeline. Most machines now are deeply pipelined and that is one
> of the essential points. Compilers pipeline now, so it is no wonder that
> a program that uses half of the register file only and is not pipelined
> is slower. Big deal.
>
> Not any preloading of the data is done using the advanced instructions
> of the x86 to preload the data. Why?
>
> Look, asm is not for everyone.
>
> Just for the people that like it.
>
> It has been decades since you programmed anything in asm. So, go on
> using your c++ compiler and do not speak about things you know nothing
> about.
>
> jacob
>
>
>
>

--
Cholo Lennon
Bs.As.
ARG

David Brown

unread,

May 9, 2017, 7:29:12 AM5/9/17

to

On 09/05/17 00:29, jacobnavia wrote:
> Le 08/05/2017 à 23:31, Richard a écrit :
>> Modeling memory-mapped read-only registers with zero overhead compared
>> to assembly is no big deal for C++. It would be amazingly painful in
>> Java or C#.
>
> Maybe you can explain what you mean with
>
> "memory-mapped read-only registers"

On a lot of systems, hardware registers are accessed as though they were
memory at specific fixed locations. (On x86, they often use a different
memory space - specific I/O operations rather than memory load/store
operations. But on many other processors it all just looks like memory
to the cpu.) Sometimes these registers are read/write, sometimes they
are read-only, occasionally they are write-only, and sometimes the very
act of reading them or writing them triggers other events. For example,
on microcontrollers it is common for a serial port (UART) to have a
"data register". Writing a value to that triggers sending the character
out on the serial port, reading from it takes the next character out of
a receive FIFO. Such memory accesses must therefore be done very
carefully - using "volatile" in C or C++.

I have no experience with Java or C#, but I would think the virtual
machine layer between the source code and the actual operations would
make it very difficult to get this to work as the programmer expects.

Scott Lurndal

unread,

May 9, 2017, 9:12:22 AM5/9/17

to

David Brown <david...@hesbynett.no> writes:
>On 09/05/17 00:29, jacobnavia wrote:
>> Le 08/05/2017 à 23:31, Richard a écrit :
>>> Modeling memory-mapped read-only registers with zero overhead compared
>>> to assembly is no big deal for C++. It would be amazingly painful in
>>> Java or C#.
>>
>> Maybe you can explain what you mean with
>>
>> "memory-mapped read-only registers"
>
>On a lot of systems, hardware registers are accessed as though they were
>memory at specific fixed locations. (On x86, they often use a different
>memory space - specific I/O operations rather than memory load/store
>operations.

X86 has three distinct "address spaces" for I/O:

1) One 64kbyte I/O space(s) accessed by the IN & OUT
instructions. This space is used by various legacy
(ISA bus) devices such as the keyboard controller and
serial ports typically provided on the Platform Controller
Hub (PCH) (southbridge) chip. For example, the BIOS/UEFI
firmware writes an 8-bit value to port 0x80 that indicates
which part of the BIOS is currently running for debugging
purposes.
2) One or more regions carved out of the address space
that map to the PCI address space for any plug-in cards
via Base Address Registers (BARs) in the PCI configuration space.
3) The PCI configuration space (accessed indirectly via addresses
0xCF8 and 0xCFC in the I/O space or directly via PCI-Express
extended configuration access method (ECAM) which is mapped
into the physical address space).

As David points out, these registers often have side effects upon
access (for example, some bits in a register may have R/W1C semantics
where a write of a bit will cause that bit to be reset in the register;
typically used with interrupt status registers). This precludes
any speculative operations to these registers by the CPU (which is why
they're mapped into non-cacheable memory types by the host kernel using
the x86/amd MTRR registers)

> Such memory accesses must therefore be done very
>carefully - using "volatile" in C or C++.

volatile only affects compiler optimizations, not the hardware so
volatile may not be sufficient to ensure correct ordering on
a OOO core with a weaker memory model than x86 (e.g. ARM64), in
which case additional precautions (barriers like DSB) must be
taken by software to ensure the external observer (the device)
sees memory accesses in the correct order.

Even x86 requires memory barrier instructions in certain multithreaded
situations (see, for example, the network stack in linux where skb
handling requires fences for certain operations, DAMHIKT).

>
>I have no experience with Java or C#, but I would think the virtual
>machine layer between the source code and the actual operations would
>make it very difficult to get this to work as the programmer expects.

Technically, one can use JNI with java to access hardware - but that
is more like bolting a drone motor on a F-18 and expecting it to
perform well.

Scott Lurndal

unread,

May 9, 2017, 9:14:14 AM5/9/17

to

Cholo Lennon <cholo...@hotmail.com> writes:

>
>That's happens a lot with C++. Everyone who wants to compare her/his
>language of choice has the necessity to do it against C++, but usually
>the comparisons are not fair due to person bias and because she/he is
>really not proficient in C++. Result: Horrible C++ code vs
>optimized/beautiful "foo" code

I challenge you to take any reasonable COBOL program and try to make it
"optimized/beautiful C++" code. Good luck with that.

Cholo Lennon

unread,

May 9, 2017, 9:56:49 AM5/9/17

to

You are being too literal... what I wanted to say is that people usually
put an effort to tune their code (in their favourite language), but not
in C++ which is awful, or just a copy of the code in the other language
(ignoring the way C++ is used). This, of course, results in an unfair
comparison.

David Brown

unread,

May 9, 2017, 10:19:54 AM5/9/17

to

Yes, indeed. You usually also need help from the MMU to mark the area
as non-cacheable, and perhaps other details to avoid store buffers,
write combining, speculative reads, etc. I don't do this kind of thing
on x86 so I don't know the details here, but it was "fun" getting it all
correct on a dual core PowerPC microcontroller.

bitrex

unread,

May 9, 2017, 11:24:57 AM5/9/17

to

On 05/08/2017 06:25 PM, jacobnavia wrote:

> This scope reduction makes the human much more flexible than any
> compiler since he/she KNOWS what is he/she doing in the global context
> of algorithm implementation.
>
> I am an assembly language programmer and I distribute a compiler system
> based on the C language, The compiler generates assembly.
>
> So, I like (and I have done it very often) to outsmart ANY compiler in
> just a few lines of ASM. It is fun.

It's probably true on x86, but I never really knew much x86 assembly.
However back when I was pretty good at 8 bit AVR assembly and also wrote
code in C I found avr-gcc extremely difficult to beat, at least when it
came to relatively short procedural functions. I knew the best way to do
that sequence of say 20 operations. It knew, too. The ISA for those RISC
microprocessors is pretty brief and there's only so many ways to do
things, I guess, and the manufacturer sez its ISA was optimized to help
C compilers output efficient assembly anyway.

The only thing it seemed pathological about not doing sometimes was
not converting switch statements/if-then-else structures into indirect
jump tables when it would've made sense from both an execution speed and
code space perspective.

bitrex

unread,

May 9, 2017, 11:33:58 AM5/9/17

to

On 05/08/2017 05:31 PM, Richard wrote:

>
> Modeling memory-mapped read-only registers with zero overhead compared
> to assembly is no big deal for C++. It would be amazingly painful in
> Java or C#.
>

Basically impossible in Java. If you need to access
architecture-specific machine registers at all Java is definitely the
wrong language. Unless you mean "model" in the sense of writing an
emulator or something...

sp...@potato.field

unread,

May 9, 2017, 11:37:44 AM5/9/17

to

On Tue, 9 May 2017 11:24:42 -0400
bitrex <bit...@de.lete.earthlink.net> wrote:
>On 05/08/2017 06:25 PM, jacobnavia wrote:
>
>> This scope reduction makes the human much more flexible than any
>> compiler since he/she KNOWS what is he/she doing in the global context
>> of algorithm implementation.
>>
>> I am an assembly language programmer and I distribute a compiler system
>> based on the C language, The compiler generates assembly.
>>
>> So, I like (and I have done it very often) to outsmart ANY compiler in
>> just a few lines of ASM. It is fun.
>
>It's probably true on x86, but I never really knew much x86 assembly.

The optimisation techniques used in modern compilers have been developed by
teams of dozens or even hundreds over the years. Any one person who thinks
they can outsmart a modern compiler in assembly optimisation for x86 in all but
a few edge cases is deluding themselves. The instruction set is now so large
and the pipelines so complex that its next to impossible for most people to
really get a good idea of which instructions to use and how to sequence them
to get the best result.

--
Spud

bitrex

unread,

May 9, 2017, 11:38:02 AM5/9/17

to

On 05/09/2017 09:12 AM, Scott Lurndal wrote:

>
> Technically, one can use JNI with java to access hardware - but that
> is more like bolting a drone motor on a F-18 and expecting it to
> perform well.
>

Sort of defeats the point of the whole "walled garden" virtual machine
paradigm of the language, yeah?

If your application requires that you have to access machine-specific
hardware via some method other than the APIs provided then I'd say
you're SOL and shouldn't be using that language.

jacobnavia

unread,

May 9, 2017, 11:48:52 AM5/9/17

to

I have optimized my 128 bit code in asm and it is about 100% to 200%
faster than gcc.

I said (and you cite my words so you must have read it...) that the
scope reduction possible for a human but not for a compiler is the
crucial point that makes the difference.

A compiler must respect calling conventions, for instance. An asm
language programmer not.

bitrex

unread,

May 9, 2017, 11:53:38 AM5/9/17

to

I do know it very much seems that way on 8 bit, at least. ;-)

The few times I have found myself resorting to inline assembly there
recently is that in little embedded applications it sometimes makes
sense to enforce certain things that a universal C/C++ compiler like gcc
wouldn't know about, as you're essentially dealing with programmable
hardware that interfaces with other hardware, not a general purpose
computer.

For example GCC pretty much ignores the "register" keyword, but with a
little hacking in asm you can enforce that yes, this 4 byte global
variable should remain in this set of registers forever and ever. As you
have dozens and dozens of GPRs available occupying a couple of them
permanently doesn't affect the efficiency of the rest of the code at
all, and as the architecture is very simple you know that it's always
going to be more expensive to be reading and writing this particular
variable that you need to be updating every interrupt cycle in and out
of SRAM then just leaving it in place.

Bonita Montero

unread,

May 9, 2017, 12:04:04 PM5/9/17

to

> I have optimized my 128 bit code in asm and it is about 100% to 200%
> faster than gcc.

I think you're thinking about SSE-code. You can use SSE-intrinsics
in C++ and have the same or even better performance of a compiler.

bitrex

unread,

May 9, 2017, 12:08:45 PM5/9/17

to

On 05/09/2017 11:48 AM, jacobnavia wrote:

> I have optimized my 128 bit code in asm and it is about 100% to 200%
> faster than gcc.
>
> I said (and you cite my words so you must have read it...) that the
> scope reduction possible for a human but not for a compiler is the
> crucial point that makes the difference.

It seems like a tautological statement that's true on its face, though.
To take an absurd example if you know your architecture, you know your
register size, you know your instruction set, you know the size of your
data, you're intimately familiar with all those aspects and your data
set is small enough then you could certainly write an asm program that
never once struck out to main memory to do any of its real work. You
could write an asm program to write pseudorandom 8 bit characters to the
display buffer bare-metal faster than any std::string based
implementation ever could.

Okay. So what.

Scott Lurndal

unread,

May 9, 2017, 12:52:02 PM5/9/17

to

Or just use a good auto-vectorizing compiler.

https://en.wikipedia.org/wiki/Automatic_vectorization

jacobnavia

unread,

May 9, 2017, 1:03:15 PM5/9/17

to

Sure but then... you are programming in asm dear!

Bonita Montero

unread,

May 9, 2017, 1:11:29 PM5/9/17

to

>> I think you're thinking about SSE-code. You can use SSE-intrinsics
>> in C++ and have the same or even better performance of a compiler.

> Sure but then... you are programming in asm dear!

That's like saying you're programming in asm when you have a
"a = b + c" becuase you can imagine what's the resulting code.

Bonita Montero

unread,

May 9, 2017, 1:16:25 PM5/9/17

to

>> I think you're thinking about SSE-code. You can use SSE-intrinsics
>> in C++ and have the same or even better performance of a compiler.

> Or just use a good auto-vectorizing compiler.
> https://en.wikipedia.org/wiki/Automatic_vectorization

Automatic vectorization works only on close matches to code-patterns
that match what the compiler knows. Slight differences often prevent
auto-vectorization.
And consider the main-loop of the Stream-Triad-benchmark:

void tuned_STREAM_Triad(STREAM_TYPE scalar)
{
ssize_t j;

for (j = 0; j<STREAM_ARRAY_SIZE; j++)
a[j] = b[j] + scalar*c[j];
}

This loop can be detected by a compiler to be vectorizable easily.
But the compiler doesn't know if the arrays are aligned properly
for the SSE/AVX datatypes. So it can't use the aligned loads/stores.

Scott Lurndal

unread,

May 9, 2017, 1:31:02 PM5/9/17

to

Except STREAM is benchmarking the memory system (specifically bandwidth), and benchmarks
are generally written in such a way as to prevent overly agressive
compiler optimizations.

Bonita Montero

unread,

May 9, 2017, 2:29:32 PM5/9/17

to

> ..., and benchmarks are generally written in such a way

> as to prevent overly agressive compiler optimizations.

Optimization of compilers are arranged in a way to cope with
common benchmarks as well as possible.

Chris Vine

unread,

May 9, 2017, 2:34:14 PM5/9/17

to

Don't be so fucking sexist. Apart from which, you have missed the
point.

jacobnavia

unread,

May 9, 2017, 3:16:30 PM5/9/17

to

Le 09/05/2017 à 20:34, Chris Vine a écrit :
> Don't be so fucking sexist.

????
Sexist because I tell that using intrinsics is using assembly language?

?????

Gareth Owen

unread,

May 9, 2017, 3:34:35 PM5/9/17

to

Don't be obtuse.

Chris Vine

unread,

May 9, 2017, 3:36:32 PM5/9/17

to

Don't snip your post (it's in the record) which was:

"Sure but then... you are programming in asm dear!"

It doesn't sound any better on the repeat.

Robert Wessel

unread,

May 9, 2017, 3:40:35 PM5/9/17

to

On Tue, 9 May 2017 21:16:22 +0200, jacobnavia <ja...@jacob.remcomp.fr>
wrote:

No, the use of "dear" to refer to a woman in this context. Consider,
for example, the last paragraph of the US Department of the Interior's
sexual harassment policy:

https://www.doi.gov/pmb/eeo/Sexual-Harassment

It's a belittlement of the person you're talking to. But unlike
saying "(you're wrong), you idiot" (which is merely a bit rude),
"(you're wrong), dear", is interpreted as "your failing is that you're
a woman".

I know English is not your first language, but you write it well
enough that people wouldn't guess that. And my understanding is that
a similar construct in French does not carry the same negative
connotation.

Chris Vine

unread,

May 9, 2017, 3:54:07 PM5/9/17

to

On Tue, 09 May 2017 14:40:40 -0500
Robert Wessel <robert...@yahoo.com> wrote:
[snip]

> I know English is not your first language, but you write it well
> enough that people wouldn't guess that. And my understanding is that
> a similar construct in French does not carry the same negative
> connotation.

I think you are being much, much too generous. Condescension is
condescension in any language, including French. How would you
translate "dear"? "Cherie" in this context is equally unacceptable.
For interest's sake, what's your acceptable alternative in French?

When faced with a neanderthal, best to face it and not pretend that it
is normal.

Gareth Owen

unread,

May 9, 2017, 4:03:27 PM5/9/17

to

Amen.

jacobnavia

unread,

May 9, 2017, 4:05:26 PM5/9/17

to

Le 09/05/2017 à 21:53, Chris Vine a écrit :
> When faced with a neanderthal, best to face it and not pretend that it
> is normal.

This is completely stupid

1) I did not even know it was a woman I was writing to. I just saw
vaguely "Montero" when clicking "Reply"

2) I did not mean any "condescending" tone. I thought "dear" was just a
way of being familiar

In any case, if I offended someone I present here my excuses to Mrs Montero.

jacobnavia

unread,

May 9, 2017, 4:08:30 PM5/9/17

to

Le 09/05/2017 à 22:03, Gareth Owen a écrit :
>> When faced with a neanderthal, best to face it and not pretend that it
>> is normal.
> Amen.

You, and Chris Vine are always bulling me since years. Insults after
insults, I am used to your stuff.

No one can insult me, unless I give some importance to their words.

Chris Vine

unread,

May 9, 2017, 4:15:21 PM5/9/17

to

That is not an apology. You would not use the word "dear" to a man,
and if you claim that I am wrong, I do not believe you. This is
transparent quibbling on your part: I have never seen you address a man
as "dear" here before.

If you do want to make an apology, the correct response is: "I realise I
was being condescending and sexist, and I am very sorry. I will try
and do better. In the meantime, please accept my sincere apology".

I don't think you have that within you.

Robert Wessel

unread,

May 9, 2017, 4:21:22 PM5/9/17

to

My inadequate French notwithstanding...

Chéri would be odd in that sort of context (it's really only
appropriate between intimates), ma chère would be a better equivalent
for Jacob's usage (I'm obviously assuming good intentions here). And
it appears to connote more respect than affection.

Now *I* think it still sounds sexist, but a bunch of French speakers I
know seem much more relaxed about it.

jacobnavia

unread,

May 9, 2017, 4:26:17 PM5/9/17

to

Le 09/05/2017 à 22:21, Robert Wessel a écrit :

> My inadequate French notwithstanding...
>
> Chéri would be odd in that sort of context (it's really only
> appropriate between intimates), ma chère would be a better equivalent
> for Jacob's usage (I'm obviously assuming good intentions here). And
> it appears to connote more respect than affection.
>
> Now *I* think it still sounds sexist, but a bunch of French speakers I
> know seem much more relaxed about it.
>

It wasn't at all my intention to be sexist. Maybe condescending since it
is abvious (to me at least) that when you use assembler intrinsics you
are no longer programming in C++

Now, Chris Vine and Gareth Owen will seize any opportunity to insult me,
that is well known since years.

Let's stop this polemic. Now, we are no longer discussing asm or c++ but
whether I am sexist or not. Great.

Chris Vine

unread,

May 9, 2017, 4:34:54 PM5/9/17

to

On Tue, 09 May 2017 15:21:26 -0500

Robert Wessel <robert...@yahoo.com> wrote:
> On Tue, 9 May 2017 20:53:52 +0100, Chris Vine
> <chris@cvine--nospam--.freeserve.co.uk> wrote:
>
> >On Tue, 09 May 2017 14:40:40 -0500
> >Robert Wessel <robert...@yahoo.com> wrote:
> >[snip]
> >> I know English is not your first language, but you write it well
> >> enough that people wouldn't guess that. And my understanding is
> >> that a similar construct in French does not carry the same negative
> >> connotation.
> >
> >I think you are being much, much too generous. Condescension is
> >condescension in any language, including French. How would you
> >translate "dear"? "Cherie" in this context is equally unacceptable.
> >For interest's sake, what's your acceptable alternative in French?
> >
> >When faced with a neanderthal, best to face it and not pretend that
> >it is normal.
>
>
> My inadequate French notwithstanding...
>
> Chéri would be odd in that sort of context (it's really only
> appropriate between intimates), ma chère would be a better equivalent
> for Jacob's usage (I'm obviously assuming good intentions here). And
> it appears to connote more respect than affection.
>
> Now *I* think it still sounds sexist, but a bunch of French speakers I
> know seem much more relaxed about it.

It is impressive that you have a number of native French speakers at
your fingertips to get a view in such rapid time.

My step mother-in-law and a niece and nephew of mine are French so I
will have to ask them when I next see them. My French gets me about
without difficulty but is not totally fluent (I don't spend quite
enough time there for that). In the meantime I will keep my
significant reservations about your conclusions until I do so.

Robert Wessel

unread,

May 9, 2017, 4:54:42 PM5/9/17

to

On Tue, 9 May 2017 21:34:36 +0100, Chris Vine

I never said I polled them in response to your message. I've heard
the usage over time, without indication that there are problematic
connotations. And basically never in the English sense of "don't
worry your pretty little head over it, dear". Could I be wrong?
Especially given my limited knowledge? Sure. Which is why I
qualified both my statements.

Ian Collins

unread,

May 9, 2017, 5:06:50 PM5/9/17

to

On 05/ 9/17 10:25 AM, jacobnavia wrote:
> Le 08/05/2017 à 22:34, Vir Campestris a écrit :
>> Then profile it, and see how I can tune it. Which is more likely to be
>> by algorithm than tweaking instructions.
>
> Then, for a given algorithm, you discover that tweaking instructions you
> can improve speed by 50, 100% or maybe more.
>
> The compiler must generate a program correct for for ALL possible C++
> programs.
>
> The assembler programmer is generating instructions in his/her mind for
> THIS program only.

>
> This scope reduction makes the human much more flexible than any
> compiler since he/she KNOWS what is he/she doing in the global context
> of algorithm implementation.

I agree with what you say, but I think you should to qualify it by
saying that the hand rolled code will be faster on the processor (model,
not family) it was written for, but might not be faster on next year's
model.

I have seem quite a lot (too much!) hand rolled code (both ASM and C or
C++) that was a good idea at the time it was written, but a hindrance in
both performance and maintainability now.

--
Ian

woodb...@gmail.com

unread,

May 9, 2017, 6:42:11 PM5/9/17

to

On Tuesday, May 9, 2017 at 1:34:14 PM UTC-5, Chris Vine wrote:

Please don't swear here.

Brian

Chris Vine

unread,

May 9, 2017, 7:26:25 PM5/9/17

to

Brian,

I thought you might post this. I was kind of waiting for it.

You would do much better for your crack-pot approach to religion if you
were to address yourself to the original issue. Sadly, you miss the
point again. You see no problem with condescending, offensive sexism:
you are entranced by the word "fucking", which I applied only after
careful consideration, and in this case appropriately.

Sigh.

Gareth Owen

unread,

May 10, 2017, 2:08:27 AM5/10/17

to

jacobnavia <ja...@jacob.remcomp.fr> writes:

> You, and Chris Vine are always bulling me since years. Insults after
> insults, I am used to your stuff.

You take everyone who disagrees with you as a personal insult.
Get over yourself.

> No one can insult me, unless I give some importance to their words.

Right back at you fella.

jacobnavia

unread,

May 10, 2017, 3:15:29 AM5/10/17

to

Le 09/05/2017 à 23:06, Ian Collins a écrit :
> I agree with what you say, but I think you should to qualify it by
> saying that the hand rolled code will be faster on the processor (model,
> not family) it was written for, but might not be faster on next year's
> model.

That can be the case. It suffices to say that Intel is really an
ecxample here, with shifts becvoming more expensive than multiplies in
some models, for instance.

But this applies to compiled code also.

What I am speaking about is this

int i;

for (i=0; i<32;i++) {
if (data & (1 << i))
break;
}

This searches for the rightmost bit set in "data".

The WHOLE loop can be replaced by a single instruction (either bsf or
bsr, I do not remember right now).

The point is, a human UNDERSTANDS what the machine is doing, and can
optimize things that no compiler is now able to recognize.

David Brown

unread,

May 10, 2017, 5:26:28 AM5/10/17

to

On 09/05/17 17:53, bitrex wrote:
> On 05/09/2017 11:37 AM, sp...@potato.field wrote:
>> On Tue, 9 May 2017 11:24:42 -0400
>> bitrex <bit...@de.lete.earthlink.net> wrote:

>>> On 05/08/2017 06:25 PM, jacobnavia wrote:
>>>
>>>> This scope reduction makes the human much more flexible than any
>>>> compiler since he/she KNOWS what is he/she doing in the global context
>>>> of algorithm implementation.
>>>>

>>>> I am an assembly language programmer and I distribute a compiler system
>>>> based on the C language, The compiler generates assembly.
>>>>
>>>> So, I like (and I have done it very often) to outsmart ANY compiler in
>>>> just a few lines of ASM. It is fun.
>>>
>>> It's probably true on x86, but I never really knew much x86 assembly.
>>
>> The optimisation techniques used in modern compilers have been
>> developed by
>> teams of dozens or even hundreds over the years. Any one person who
>> thinks
>> they can outsmart a modern compiler in assembly optimisation for x86
>> in all but
>> a few edge cases is deluding themselves. The instruction set is now so
>> large
>> and the pipelines so complex that its next to impossible for most
>> people to
>> really get a good idea of which instructions to use and how to
>> sequence them
>> to get the best result.
>>
>
> I do know it very much seems that way on 8 bit, at least. ;-)
>
> The few times I have found myself resorting to inline assembly there
> recently is that in little embedded applications it sometimes makes
> sense to enforce certain things that a universal C/C++ compiler like gcc
> wouldn't know about, as you're essentially dealing with programmable
> hardware that interfaces with other hardware, not a general purpose
> computer.
>
> For example GCC pretty much ignores the "register" keyword, but with a
> little hacking in asm you can enforce that yes, this 4 byte global
> variable should remain in this set of registers forever and ever. As you
> have dozens and dozens of GPRs available occupying a couple of them
> permanently doesn't affect the efficiency of the rest of the code at
> all, and as the architecture is very simple you know that it's always
> going to be more expensive to be reading and writing this particular
> variable that you need to be updating every interrupt cycle in and out
> of SRAM then just leaving it in place.

There is a medium ground here, between C/C++ and assembler - the use of
compiler and target extensions. Someone mentioned that a C compiler
might not do as good a job as an assembler programmer on SIMD
vectorisation, because it does not know about the alignments - thus you
have compiler extensions such as gcc's __attribute__((aligned)) to give
the compiler that information.

In this case, you want a global variable to remain in processor
registers - you can do that with a gcc extension:

register uint8_t glob asm("r8");

(I can't remember the syntax for an asm register variable that uses
multiple registers.)

And while the AVR has quite a number of GPRs, they get used up quickly
because they are all 8 bit - reserving 4 for a global variable is likely
to affect code quality somewhat.

David Brown

unread,

May 10, 2017, 5:37:45 AM5/10/17

to

Human understanding of compiler manuals is also useful when you want
optimal code. For gcc, this is just "__builtin_ffs(data)". Many other
compilers will have similar extensions or intrinsics. You might need to
make a macro that is wrapped in conditional compilation depending on the
compiler (with your standard C code above as a fall-back), but it is
still much more portable than assemble - and will give more
opportunities for compiler optimisation.

jacobnavia

unread,

May 10, 2017, 6:41:45 AM5/10/17

to

Yes, gcc is the best compiler in the universe, David, I know that.

Now, that was an EXAMPLE of course.

But this is moot. Do not use assembly, it is better that you stick to c++.

David Brown

unread,

May 10, 2017, 8:14:43 AM5/10/17

to

If you think that, that's fine. If you prefer to read what I wrote,
rather than making snide remarks, you will see that I gave that as an
example - because it is an example that is easily tested and verified,
and easy for you to look up the manual. I /could/ have picked
CodeWarrioer 10.1 for the PowerPC as an example which has something
similar - but that would involve me looking up the details, and you
would not be able to check them. I am fairly confident that MSVC,
clang, Intel icc, and various other compilers have similar features -
which is why I wrote exactly that.

> Now, that was an EXAMPLE of course.
>

Of course it was. And I showed an example of how an understanding of
your tools can mean you might not need assembly for that kind of
purpose. There will be many other examples where you might at first
think you'd need to write hand-coded assembly for efficiency, but
compilers can generate as good or better code. There will be a few
examples where the hand-written assembly really is significantly better
than even the best compilers can generate, even with compiler-specific
extensions - but such examples are getting fewer and more obscure as
compilers get better.

> But this is moot. Do not use assembly, it is better that you stick to c++.
>

It almost always is better to stick to C or C++. It is very rare that
using assembly makes sense. There are few cases where there is a
significant speedup - and in many cases, it may look like the assembly
code is faster when in fact it is not. Making assembly code that is
faster on a wide variety of targets, rather than just one particular
model of cpu, is particularly hard. Making such code in a way that
interacts efficiently with surrounding code is also a problem - the
hand-written assembly maybe faster in isolation, but in combination with
other code in C or C++, the total result is slower.

One of the few situations where assembly can be faster is precisely your
example - when the cpu supports an instruction that is difficult to
express in plain C, or difficult for a compiler to identify in plain C
(let's forget about builtins and intrinsics for the moment). In that
case, you might well want to make a static inline function that wraps
the assembly instruction. You want the assembly involved to be minimal.
So for example, I have these definitions for some ARM code:

static inline uint16_t swapEndian16(uint16_t x) {
uint16_t y;
asm ("rev16 %[y], %[x]" : [y] "=r" (y) : [x] "r" (x) : );
return y;
}

static inline uint32_t swapEndian32(uint32_t x) {
uint32_t y;
asm ("rev %[y], %[x]" : [y] "=r" (y) : [x] "r" (x) : );
return y;
}

(If it makes you feel better, pretend it is for the CodeWarrior ARM
compiler, not gcc - that compiler supports the same syntax for inline
assembly.)

These minimal assembly wrappers let me take advantage of the best
assembly instructions for the job, while allowing the compiler to
generate the rest of the code.

Scott Lurndal

unread,

May 10, 2017, 8:43:01 AM5/10/17

to

jacobnavia <ja...@jacob.remcomp.fr> writes:
>Le 09/05/2017 à 23:06, Ian Collins a écrit :
>> I agree with what you say, but I think you should to qualify it by
>> saying that the hand rolled code will be faster on the processor (model,
>> not family) it was written for, but might not be faster on next year's
>> model.
>
>That can be the case. It suffices to say that Intel is really an
>ecxample here, with shifts becvoming more expensive than multiplies in
>some models, for instance.
>
>But this applies to compiled code also.
>
>What I am speaking about is this
>
>int i;
>
>for (i=0; i<32;i++) {
> if (data & (1 << i))
> break;
>}
>

Or the programmer can use a compiler intrinsic, such as
GCC's __builtin_ffsll (or __builtin_clz for leftmost bit).

e.g.

static inline int log2(uint64_t x) {
int i = 0;
//while (x>>=1) { i++; }
i = 63 - __builtin_clzll(x);
if (i < 0) i = 0;
return i;
}

Bonita Montero

unread,

May 10, 2017, 8:54:02 AM5/10/17

to

> int i;
>
> for (i=0; i<32;i++) {
> if (data & (1 << i))
> break;
> }

There are intrinsics for this pupose.
And no, this is not assembly.

jacobnavia

unread,

May 10, 2017, 9:06:17 AM5/10/17

to

Le 10/05/2017 à 14:53, Bonita Montero a écrit :
>> int i;
>>
>> for (i=0; i<32;i++) {
>> if (data & (1 << i))
>> break;
>> }
>
> There are intrinsics for this pupose.

Not in all compilers, but anyway, this is an example of a long high
level construct that can be converted to a single instruction.

Byte swapping is also such an example, and there are many others.

> And no, this is not assembly.
>

In the case of an intrinsic certainly, it is not assembler. It is a non
portable construct geared to a single compiler.

jacobnavia

unread,

May 10, 2017, 9:10:14 AM5/10/17

to

Le 10/05/2017 à 14:53, Bonita Montero a écrit :

Yes, there are intrinsics in some compilers for this.

Many other examples are available:

o Carry handling in the four operations.
o Overflow testing
o Interrupts

etc.

David Brown

unread,

May 10, 2017, 9:29:56 AM5/10/17

to

On 10/05/17 15:06, jacobnavia wrote:
> Le 10/05/2017 à 14:53, Bonita Montero a écrit :
>>> int i;
>>>
>>> for (i=0; i<32;i++) {
>>> if (data & (1 << i))
>>> break;
>>> }
>>
>> There are intrinsics for this pupose.
>
> Not in all compilers, but anyway, this is an example of a long high
> level construct that can be converted to a single instruction.
>
> Byte swapping is also such an example, and there are many others.
>

uint32_t endianSwap1(uint32_t x) {
return ((x & 0xff) << 24)
| ((x & 0xff00) << 8)
| ((x & 0xff0000) >> 8)
| ((x & 0xff000000) >> 24);
}

uint32_t endianSwap2(uint32_t x) {
return __builtin_bswap32(x);
}

gcc turns both of these into a "bswap" instruction. Maybe not all
compilers will do so, but it is certainly possible for a compiler to
recognise such patterns.

>> And no, this is not assembly.
>>
>
> In the case of an intrinsic certainly, it is not assembler. It is a non
> portable construct geared to a single compiler.
>

Yes, indeed. No one denies that to get optimal code for a target you
will sometimes need target-specific extensions that may not be portable
to other targets, or may not be portable to other compilers. But in
either case, they are still more portable than assembly - it's a
half-way option.

jacobnavia

unread,

May 10, 2017, 9:56:31 AM5/10/17

to

Le 10/05/2017 à 15:29, David Brown a écrit :
> they are still more portable than assembly

x86 assembly is fully portable to:

MAC OS X
Windows
Linux

That's almost 100 of the PC market.

Scott Lurndal

unread,

May 10, 2017, 10:30:22 AM5/10/17

to

jacobnavia <ja...@jacob.remcomp.fr> writes:
>Le 10/05/2017 à 15:29, David Brown a écrit :
>> they are still more portable than assembly
>
>x86 assembly is fully portable to:
>
>MAC OS X
>Windows
>Linux

Nonsense - linux runs on hundreds of processor types.

>
>That's almost 100 of the PC market.

Which is almost irrelevent now, as the pc market is less
than 10% of the overall computer market.

David Brown

unread,

May 10, 2017, 10:30:25 AM5/10/17

to

On 10/05/17 15:56, jacobnavia wrote:
> Le 10/05/2017 à 15:29, David Brown a écrit :
>> they are still more portable than assembly
>
> x86 assembly is fully portable to:
>
> MAC OS X
> Windows
> Linux

No it is not.

x86 assembly code is not directly portable to different assemblers.
Inline assembly in C is a little better - if you use gcc's format, it is
portable to gcc, icc, clang, and perhaps other compilers.

x86 code is either 32-bit or 64-bit, and not directly portable from one
to the other - even though much of it is the same, there are usually
still changes to be made.

x86 assembly code will work on a range of x86 chips if you stick to a
common subset - but if you are trying to write optimal code (and if you
are not, why are you bothering with assembly?) then you need to
fine-tune it for all sorts of different x86 devices. On one cpu, MMX
instructions might be faster - on another, SSE. On one device,
unrolling a loop might be faster but on a different device, instruction
prefetch buffers may mean the loop format is faster.

>
> That's almost 100 of the PC market.

That is a rapidly declining share of the computing world, and one in
which the small speed optimisations available with assembly is of
declining relevance. x86 assembly is useful for compiler writers,
low-level support libraries (clearly it is useful to /you/), low-level
systems code which C cannot handle (such as working with interrupts),
and occasional libraries where it is worth making an enormous effort for
tiny speed differences. For almost all people programming for x86
systems, if you are using assembly for anything except fun, you are
making a mistake.

On other platforms, especially the embedded world, there is more scope
for useful assembly - but only in a tiny fraction of code.

Bonita Montero

unread,

May 10, 2017, 11:14:21 AM5/10/17

to

>> There are intrinsics for this pupose.

> Not in all compilers, ...

In all relevant compilers, i.e. g++, clang, msvc++ and intel-c++.

> Byte swapping is also such an example, and there are many others.

The above compilers cover everything with intrinsics the c++-language
doesn't supply.

> In the case of an intrinsic certainly, it is not assembler.
> It is a non portable construct geared to a single compiler.

And assembly is portable?

Bonita Montero

unread,

May 10, 2017, 11:37:13 AM5/10/17

to

> Many other examples are available:
> o Carry handling in the four operations.
> o Overflow testing
> o Interrupts

This aren't many examples and these are rarely needed.

Bonita Montero

unread,

May 10, 2017, 12:04:31 PM5/10/17

to

> uint32_t endianSwap1(uint32_t x) {
> return ((x & 0xff) << 24)
> | ((x & 0xff00) << 8)
> | ((x & 0xff0000) >> 8)
> | ((x & 0xff000000) >> 24);
> }

MSVC does the same.

Richard

unread,

May 10, 2017, 12:23:43 PM5/10/17

to

[Please do not mail me a copy of your followup]

Ian Collins <ian-...@hotmail.com> spake the secret code
<enepb0...@mid.individual.net> thusly:

>I have seem quite a lot (too much!) hand rolled code (both ASM and C or
>C++) that was a good idea at the time it was written, but a hindrance in
>both performance and maintainability now.

A good example here is the ancient DOS fractal rendering program
FRACTINT, which I forked and modernized to Win32 as Iterated Dynamics.
<https://github.com/legalizeadulthood/iterated-dynamics>

I eliminated all the assembly code that was there to make it fast on a
286 (I am not kidding) and used the C equivalent code that was there
for the unix port. The whole thing got significantly faster, even
without any profiling. The assembly code used 16-bit instructions
which are on the "legacy compatibility" portion of a modern processor,
not the part that runs fast.

Granted, this is an extreme example as you're not likely to have 20+
year old assembly in your code base. Or maybe you do....
--
"The Direct3D Graphics Pipeline" free book <http://tinyurl.com/d3d-pipeline>
The Terminals Wiki <http://terminals-wiki.org>
The Computer Graphics Museum <http://computergraphicsmuseum.org>
Legalize Adulthood! (my blog) <http://legalizeadulthood.wordpress.com>

Richard

unread,

May 10, 2017, 12:26:20 PM5/10/17

to

[Please do not mail me a copy of your followup]

Chris Vine <chris@cvine--nospam--.freeserve.co.uk> spake the secret code
<20170510002...@bother.homenet> thusly:

>On Tue, 9 May 2017 15:41:48 -0700 (PDT)
>woodb...@gmail.com wrote:
>> On Tuesday, May 9, 2017 at 1:34:14 PM UTC-5, Chris Vine wrote:
>>
>> Please don't swear here.
>>
>> Brian
>
>Brian,
>
>I thought you might post this. I was kind of waiting for it.

Unfortunately this whole sexual harrassment digresion is exactly the
sort of off-topic soap box speech that Brian is giving. You're both
off-topic, lack self control and presume moral superiority.

Scott Lurndal

unread,

May 10, 2017, 1:29:26 PM5/10/17

to

legaliz...@mail.xmission.com (Richard) writes:
>[Please do not mail me a copy of your followup]

>

>Granted, this is an extreme example as you're not likely to have 20+
>year old assembly in your code base. Or maybe you do....

Even older in a couple of cases (mainly the boot code in the hypervisor
which needs to start in real-mode, transition through protected mode,
turn on paging then switch to long-mode).

Alf P. Steinbach

unread,

May 10, 2017, 4:04:08 PM5/10/17

to

On 08-May-17 7:33 PM, Lynn McGuire wrote:
> "Need for Speed - C++ versus Assembly Language"
>
> https://www.codeproject.com/Articles/1182676/Need-for-Speed-Cplusplus-versus-Assembly-Language
>
>
> Neat ! I believe that ANY C++ compiler and linker duo can beat my hand
> written assembly language. Its been decades since I wrote any assembly
> language.

It's not an achievement to write worse assembly code than a compiler (I
haven't looked at the above, but that's your claim and my impression
from the discussion). So I don't really understand your posting.

But sometimes a little assembly is just what you need for speed.

My main and only example is the humble trampoline function, to use as a
C callback. It puts the address of a C++ object in the proper place
(hopefully a register) for the C++ implicit `this` argument, and jumps
– woohoo! we're jumping! – to a member function.

Yes there are umpteen ways to do this in pure C++, including picking up
that C++ object address from a global (maybe with thread local storage).
Whatever. But the trampoline is sort of maximally efficient.

The good puppy over at Stack Overflow, whose secret real name I once
knew, I sort of collected the real names of many of the folks there,
once tried to propose a feature for the standard library, with support
for trampolines. As far as I know nothing came of it, except that he met
some committee members. It may be that without someone to champion the
proposal for him he had to use his real name, I don't know.

Cheers!,

- Alf

David Brown

unread,

May 10, 2017, 4:31:39 PM5/10/17

to

On 10/05/17 18:04, Bonita Montero wrote:
>
> MSVC does the same.
>

Please do not quote my posts without proper attribution. Doing so is a
massive breach of Usenet etiquette.

jacobnavia

unread,

May 10, 2017, 5:14:59 PM5/10/17

to

Yes. Asm is never *needed*, it is fun.

Most programs are high level programs automatically translated, for
instance C++.

Prefetching, pipeline construction are difficult to do for a given C++
program. Since the language doesn't offer any way to do that, you rely
on automatic translation.

Or not.

I don't really understand why this emphasis on C++. Yes, this is a C++
group etc. But...

Many people think that constructing programs that use the bare
instructions of the processor gives you a real perspective of what is
going on there.

Note that I am not in any way trying to say that C++ should be replaced
by asm. Just that asm gives you insights.

jacob

David Brown

unread,

May 10, 2017, 5:48:24 PM5/10/17

to

On 10/05/17 23:14, jacobnavia wrote:
> Le 10/05/2017 à 17:36, Bonita Montero a écrit :
>>> Many other examples are available:
>>> o Carry handling in the four operations.
>>> o Overflow testing
>>> o Interrupts
>>
>> This aren't many examples and these are rarely needed.
>
> Yes. Asm is never *needed*, it is fun.

Sarcasm like this does not work well without additional clues that you
get in spoken conversation.

Most assembly that is written is /not/ needed. (Indeed, I believe that
most code written in C could be better written in other languages.) But
only most - for some purposes, assembly is the better or only choice.

>
> Most programs are high level programs automatically translated, for
> instance C++.
>
> Prefetching, pipeline construction are difficult to do for a given C++
> program. Since the language doesn't offer any way to do that, you rely
> on automatic translation.
>

And often the compiler can do a better job of it than an assembly
programmer can. Failing that, implementation extensions can help (like
__builtin_prefetch in gcc), and failing that, most compilers will let
you make small inline function that wraps a piece of inline assembly.

> Or not.
>
> I don't really understand why this emphasis on C++. Yes, this is a C++
> group etc. But...
>
> Many people think that constructing programs that use the bare
> instructions of the processor gives you a real perspective of what is
> going on there.

Here I agree with you - I think doing some assembly programming is good
for a developer when they are learning to program - it gives them a much
better understanding of what goes on under the hood. But that is for
learning - /real/ work should be done mainly in other languages (not
necessarily C or C++). And for serious low-level or high efficiency C
or C++ development, it is good to be able to understand the generated
assembly code, even if you can't write it.

Gareth Owen

unread,

May 11, 2017, 2:27:46 AM5/11/17

to

David Brown <david...@hesbynett.no> writes:

> On 10/05/17 23:14, jacobnavia wrote:
>> Prefetching, pipeline construction are difficult to do for a given C++
>> program. Since the language doesn't offer any way to do that, you rely
>> on automatic translation.
>>
>
> And often the compiler can do a better job of it than an assembly
> programmer can. Failing that, implementation extensions can help
> (like __builtin_prefetch in gcc), and failing that, most compilers
> will let you make small inline function that wraps a piece of inline
> assembly.

The interesting thing here is that Jacob claims that his hand-rolled
assembler is almost always faster than translated code - and the
original article makes the opposite claim - that peephole optimisations
and knowledge of CPU scheduling issues make the compiler-generated ASM
so baroque that its no human likely to generate it.

But all the code being benchmarked is *right there*.

There's a way for Jacob to prove that the failings of the hand-written
ASM in the article is due to the skill of the programmer, not the
irreducible complexity of the CPU scheduling.

So come on Jacob - prove your point. Show us your ASM code, and how it
outperforms both the original author's ASM and the C++ generated code.

There's literally only one way to prove that humans can write faster ASM
code than compilers on modern processors, and until then, its just
claims.

Gareth (who can run fast than Usain Bolt, but doesn't care to prove it)

David Brown

unread,

May 11, 2017, 3:29:09 AM5/11/17

to

On 11/05/17 08:27, Gareth Owen wrote:
> David Brown <david...@hesbynett.no> writes:
>
>> On 10/05/17 23:14, jacobnavia wrote:
>>> Prefetching, pipeline construction are difficult to do for a given C++
>>> program. Since the language doesn't offer any way to do that, you rely
>>> on automatic translation.
>>>
>>
>> And often the compiler can do a better job of it than an assembly
>> programmer can. Failing that, implementation extensions can help
>> (like __builtin_prefetch in gcc), and failing that, most compilers
>> will let you make small inline function that wraps a piece of inline
>> assembly.
>
> The interesting thing here is that Jacob claims that his hand-rolled
> assembler is almost always faster than translated code - and the

I don't believe he has claimed it is /always/ the case - merely that it
is sometimes the case. He gave some possible situations, many of which
can be handled by today's compilers.

> original article makes the opposite claim - that peephole optimisations
> and knowledge of CPU scheduling issues make the compiler-generated ASM
> so baroque that its no human likely to generate it.
>

It is difficult to judge the quality of code like this. Algorithmic
choices can make a huge difference, and it is hard to separate
differences in algorithm detail from differences in implementation
detail. It is all too easy to say the assembly code was badly written,
or the C++ code was inefficient (what numpty thought that storing the
matrices on the heap was a good plan?).

> But all the code being benchmarked is *right there*.
>
> There's a way for Jacob to prove that the failings of the hand-written
> ASM in the article is due to the skill of the programmer, not the
> irreducible complexity of the CPU scheduling.
>
> So come on Jacob - prove your point. Show us your ASM code, and how it
> outperforms both the original author's ASM and the C++ generated code.

That would be a very time-consuming task, with little to gain.

>
> There's literally only one way to prove that humans can write faster ASM
> code than compilers on modern processors, and until then, its just
> claims.

I think it would be far more informative to consider a case where Jacob
thinks the C vs. assembly speed difference is relevant enough to make it
worth the effort writing in assembly - a case where he already /has/
written the code in assembly. So let us consider 128 bit integer
arithmetic. We define a large calculation, written in C, that uses 128
bit integers extensively. Jacob compiles it to run as an executable
with his compiler that uses his assembly library for the calculations.
The executable should be targeted to run on a modern 64-bit x86
processor, preferably both Windows and Linux versions.

One or more other people write the same calculations, using a 128 bit
integer type that is written in C or C++. We want at least three
variants - one that is pure, standard C (or C++) with only common
implementation assumptions stemming from the hardware. One that is free
to use any extensions available on that compiler /except/ built-in
128-bit types. And one that uses the compiler's built-in 128-bit types.

Then we compare speeds on several different computers.

jacobnavia

unread,

May 11, 2017, 4:35:34 AM5/11/17

to

Le 11/05/2017 à 08:27, Gareth Owen a écrit :
> There's literally only one way to prove that humans can write faster ASM
> code than compilers on modern processors, and until then, its just
> claims

Exactly. Never program in asm, just stick to C++. Nothing is better than
gcc, and will never be.

Amen

Bonita Montero

unread,

May 11, 2017, 7:08:35 AM5/11/17

to

>> There's literally only one way to prove that humans can write faster
>> ASM code than compilers on modern processors, and until then, its
>> just claims

> Exactly. Never program in asm, just stick to C++. Nothing is better
> than gcc, and will never be.

You impute him to make so stereotype statements which are specific
for your statements about assembly.

Scott Lurndal

unread,

May 11, 2017, 8:35:30 AM5/11/17

to

jacobnavia <ja...@jacob.remcomp.fr> writes:
>Le 11/05/2017 à 08:27, Gareth Owen a écrit :
>> There's literally only one way to prove that humans can write faster ASM
>> code than compilers on modern processors, and until then, its just
>> claims
>
>Exactly. Never program in asm, just stick to C++.

This is a reasonable dictum for the vast majority of C++ programmers.

> Nothing is better than gcc, and will never be.

I've never heard this sentiment expressed as such. I think you must
be building a strawman to burn down.

Jerry Stuckle

unread,

May 11, 2017, 10:00:05 AM5/11/17

to

On 5/11/2017 2:27 AM, Gareth Owen wrote:
> David Brown <david...@hesbynett.no> writes:
>
>> On 10/05/17 23:14, jacobnavia wrote:
>>> Prefetching, pipeline construction are difficult to do for a given C++
>>> program. Since the language doesn't offer any way to do that, you rely
>>> on automatic translation.
>>>
>>
>> And often the compiler can do a better job of it than an assembly
>> programmer can. Failing that, implementation extensions can help
>> (like __builtin_prefetch in gcc), and failing that, most compilers
>> will let you make small inline function that wraps a piece of inline
>> assembly.
>
> The interesting thing here is that Jacob claims that his hand-rolled
> assembler is almost always faster than translated code - and the
> original article makes the opposite claim - that peephole optimisations
> and knowledge of CPU scheduling issues make the compiler-generated ASM
> so baroque that its no human likely to generate it.
>

Compilers are written by humans, and so is the code generated by the
compilers.

A good assembler programmer knows the language. An expert assembler
programmer knows the language *and the processor*. He/she can write the
same code the compiler generates, so his/her code is never slower.
However, the compiler still has constraints on it based on its design.
The programmer has no such constraints.

Additionally, the compiler generates code for a specific series of
processors - i.e. 32 or 64 bit. If the programmer knows the code will
only run on one specific processor, he/she can write code specific to
that processor.

With that said, there are very few assembler programmers with the
required level of expertise nowadays. But they are out there.

--
==================
Remove the "x" from my email address
Jerry Stuckle
jstu...@attglobal.net
==================

David Brown

unread,

May 11, 2017, 11:09:34 AM5/11/17

to

On 11/05/17 15:59, Jerry Stuckle wrote:
> On 5/11/2017 2:27 AM, Gareth Owen wrote:
>> David Brown <david...@hesbynett.no> writes:
>>
>>> On 10/05/17 23:14, jacobnavia wrote:
>>>> Prefetching, pipeline construction are difficult to do for a given C++
>>>> program. Since the language doesn't offer any way to do that, you rely
>>>> on automatic translation.
>>>>
>>>
>>> And often the compiler can do a better job of it than an assembly
>>> programmer can. Failing that, implementation extensions can help
>>> (like __builtin_prefetch in gcc), and failing that, most compilers
>>> will let you make small inline function that wraps a piece of inline
>>> assembly.
>>
>> The interesting thing here is that Jacob claims that his hand-rolled
>> assembler is almost always faster than translated code - and the
>> original article makes the opposite claim - that peephole optimisations
>> and knowledge of CPU scheduling issues make the compiler-generated ASM
>> so baroque that its no human likely to generate it.
>>
>
> Compilers are written by humans, and so is the code generated by the
> compilers.

Well, sort of. Compilers are written by humans, yes (though there can
be layers of indirection, such as yacc).

Humans write the code and macros that generate the object code sequences
- they don't write the actual assembly code. When writing the output
macro for translating "x * y", for example, the human-written code will
say something like "Make sure x is in a register - call it rA. Make
sure y is in a register - call it rB. Find a free register rD.
Generate a "mul rD, rA, rB" instruction. The result is in rD, and the
flag register is updated". the compiler will interlace this with other
instructions, depending on processor scheduling and pipelining. It may
lead to instructions to load a register from memory, it may not. Some
types of code lead to complex object code generation - a switch might
lead to a series of comparisons, a binary tree of comparisons, a jump
table, calculated jumps, or a mixture.

The compiler code generation is not just a copy-and-paste of sequences
of hand-written assembly with a few register renames. There was a time,
long ago, when that was the case (and it may still be the case in
simpler compilers), but not now.

The information about what instructions to use are, of course, given by
a human - as is information about timing, pipelines, scheduling, etc.,
that helps the compiler pick between alternative solutions and interlacing.

>
> A good assembler programmer knows the language. An expert assembler
> programmer knows the language *and the processor*.

Agreed.

> He/she can write the
> same code the compiler generates, so his/her code is never slower.

In theory, yes. In practice - no, except for very short sequences or
particular special cases.

> However, the compiler still has constraints on it based on its design.
> The programmer has no such constraints.

In theory, yes - in practice, no. The programmer has constraints -
there are limits to how well he/she can track large numbers of
registers, or processor pipelines, or multi-issue scheduling. Making a
small change in one part of the assembly can have knock-on effects in
the timings in other parts. A human programmer simply cannot keep track
of it all without an inordinate amount of time and effort.

So /sometimes/ an assembler programmer can do better, especially on
short sequences or particular cases that might map well to processor
instructions but poorly to C code. But often it is simply too much
effort - even an expert assembly programmer will not be willing to spend
the time on the tedious detail, and has a high chance of getting at
least something wrong.

In particular, if you insist on writing clear and maintainable code in
an appropriate timeframe, as most professional programmers aim for, then
it is very rare that even an expert assembler programmer will beat a
good compiler. It is perfectly possible to write clear and maintainable
assembly code - but it is rarely the fastest possible result on a modern
chip.

>
> Additionally, the compiler generates code for a specific series of
> processors - i.e. 32 or 64 bit. If the programmer knows the code will
> only run on one specific processor, he/she can write code specific to
> that processor.

Many compilers can generate code specifically for particular target
processors. (Sorry, Jacob, but I must use gcc as the example again - it
is the compiler I know best.) gcc has options to generate code that
will work on a range of cpus (within a family such as x86-32) but have
scheduling optimised for one particular target or subfamily. Or it can
generate code that /requires/ a particular subfamily feature. Or it can
generate a number of implementations for a given function, and pick the
best one at run-time based on the actual cpu that is being used.

Yes, all of that /can/ be done by an assembler programmer - but it is
unrealistic to think that it /would/ be done, except in extreme cases.

>
> With that said, there are very few assembler programmers with the
> required level of expertise nowadays. But they are out there.
>

Agreed. And they are mostly spending their time doing something
/useful/ with those skills, rather than trying to beat compiler output
by a fraction of a percent on one particular processor. For example,
they are involved in writing compilers :-)

Mr Flibble

unread,

May 11, 2017, 2:09:34 PM5/11/17

to

On 09/05/2017 20:40, Robert Wessel wrote:
> On Tue, 9 May 2017 21:16:22 +0200, jacobnavia <ja...@jacob.remcomp.fr>
> wrote:
>
>> Le 09/05/2017 à 20:34, Chris Vine a écrit :
>>> Don't be so fucking sexist.
>>
>> ????
>> Sexist because I tell that using intrinsics is using assembly language?
>>
>> ?????
>
>
> No, the use of "dear" to refer to a woman in this context. Consider,
> for example, the last paragraph of the US Department of the Interior's
> sexual harassment policy:
>
> https://www.doi.gov/pmb/eeo/Sexual-Harassment
>
> It's a belittlement of the person you're talking to. But unlike
> saying "(you're wrong), you idiot" (which is merely a bit rude),
> "(you're wrong), dear", is interpreted as "your failing is that you're
> a woman".
>
> I know English is not your first language, but you write it well
> enough that people wouldn't guess that. And my understanding is that
> a similar construct in French does not carry the same negative
> connotation.

Utter nonsense; I often refer to idiotic blokes (men) with the epithet
"dear" to express my condescension of their fucktardedness. *You* are
being sexist by claiming that "dear" just refers to women.

And English *is* my first language (grade 'A', English O-level).

/Flibble

Gareth Owen

unread,

May 11, 2017, 2:16:29 PM5/11/17

to

sc...@slp53.sl.home (Scott Lurndal) writes:

>> Nothing is better than gcc, and will never be.
>
> I've never heard this sentiment expressed as such. I think you must
> be building a strawman to burn down.

Yup. But given that the last time I dared suggest he may not be right
he accused me of bullying him, I'll gladly settle for this pathetic
strawman argument that everyone immediately spotted as such.

jacobnavia

unread,

May 11, 2017, 2:30:32 PM5/11/17

to

Correction. You wrote this:

Le 09/05/2017 à 22:03, Gareth Owen a écrit :
>> When faced with a neanderthal, best to face it and not pretend that it
>> is normal.
> Amen.

You said Amen. You weren't just saying that I may not be right, you just
tried to start a polemic for the nth time.

Now you propose that I spend a month full time developing an asm program
to prove that I can do better than gcc. Well, I have no interest in
doing that. Just do not believe me and let's close this.

I have no interest in proving you or anyone else about asm. Nobody is
forcing to do that, and you can live the rest of your life never doing
any asm as far as I am concerned.

Gareth Owen

unread,

May 11, 2017, 2:36:11 PM5/11/17

to

Jerry Stuckle <jstu...@attglobal.net> writes:

> Compilers are written by humans, and so is the code generated by the
> compilers.

My roomba was designed by humans, but that doesn't mean a human hoovered
the sitting room.

> A good assembler programmer knows the language. An expert assembler
> programmer knows the language *and the processor*. He/she can write the
> same code the compiler generates, so his/her code is never slower.

In theory, yes of course. In practice, compilers and programmers - yes,
even expert programmers - tend to produce code that looks very
different.

It's like saying that a human with a pen, paper and pocket calculator
can solve a large travelling salesman optimisation problem using the
same algorithm as a computer. Absolutely true, in theory.

But computers are really good at algorithmically solving combinatorical
optimisation problems quickly, and people aren't. Even the experts.
Sure, they can solve them, but they can't do it quickly enough to be
practical.

> However, the compiler still has constraints on it based on its design.
> The programmer has no such constraints.

The programmer does have constraints: time & patience.

jacobnavia

unread,

May 11, 2017, 2:42:07 PM5/11/17

to

Le 11/05/2017 à 20:09, Mr Flibble a écrit :
>
> Utter nonsense; I often refer to idiotic blokes (men) with the epithet
> "dear" to express my condescension of their fucktardedness. *You* are
> being sexist by claiming that "dear" just refers to women.
>
> And English *is* my first language (grade 'A', English O-level).
>
> /Flibble

Well, that is the meaning I had in mind. I stated afterwards that I
never meant anything that has to do with gender but yes, it was slightly
condescending. I thought she missed something obvious: when you program
in asm intrinsics with the compiler, you are doing asm and not c++.

That discussion has nothing to do with women or men or gender
discussions. Anyone can miss the obvious. That has nothing to do with
being male or female...

Gareth Owen

unread,

May 11, 2017, 2:44:34 PM5/11/17

to

jacobnavia <ja...@jacob.remcomp.fr> writes:

> Le 11/05/2017 à 20:16, Gareth Owen a écrit :
>> sc...@slp53.sl.home (Scott Lurndal) writes:
>>
>>>> Nothing is better than gcc, and will never be.
>>>
>>> I've never heard this sentiment expressed as such. I think you must
>>> be building a strawman to burn down.
>>
>> Yup. But given that the last time I dared suggest he may not be right
>> he accused me of bullying him, I'll gladly settle for this pathetic
>> strawman argument that everyone immediately spotted as such.
>>
>
> Correction. You wrote this:
>
> Le 09/05/2017 à 22:03, Gareth Owen a écrit :
>>> When faced with a neanderthal, best to face it and not pretend that it
>>> is normal.
>> Amen.
>
> You said Amen.

I did.

> You weren't just saying that I may not be right, you just tried to
> start a polemic for the nth time.

"Trying to start a polemic"? "Bullying"?

Anyway that wasn't what you described as "Bullying". You used that to
describe some previous (unstated) interaction between us.

On that I have literally no idea what you're talking about.

On this - you were being sexist & condescending, and you got called on it.

"Amen" here just means "I support Chris that sexism should be called
out"

> Now you propose that I spend a month full time developing an asm
> program to prove that I can do better than gcc.

It was MSVC++ actually. But never let the facts get in the way of a
good polemic.

> Just do not believe me and let's close this.

OK. I shall.

Gareth Owen

unread,

May 11, 2017, 2:46:02 PM5/11/17

to

Mr Flibble <flibbleREM...@i42.co.uk> writes:

> Utter nonsense; I often refer to idiotic blokes (men) with the epithet
> "dear" to express my condescension of their fucktardedness. *You* are
> being sexist by claiming that "dear" just refers to women.

You might.

But Jacob doesn't, or at least he very rarely does on Usenet.

Until now.

When it happened to be addressed to a woman, which he assures was
absolutely just a really, really big coincidence.

Anyone want to buy a bridge?

jacobnavia

unread,

May 11, 2017, 3:06:03 PM5/11/17

to

Le 11/05/2017 à 20:45, Gareth Owen a écrit :
> Mr Flibble <flibbleREM...@i42.co.uk> writes:
>
>> Utter nonsense; I often refer to idiotic blokes (men) with the epithet
>> "dear" to express my condescension of their fucktardedness. *You* are
>> being sexist by claiming that "dear" just refers to women.
>
> You might.
>
> But Jacob doesn't, or at least he very rarely does on Usenet.
>

I am rarely condescending, but sometimes when I think that somebody
misses the obvious I can use this "dear" language construct.

I could understand tghat you say that being condescending is bad, what I
would agree to some extent.

What is bothering is the following conclusion:

> Until now.
>
> When it happened to be addressed to a woman, which he assures was
> absolutely just a really, really big coincidence.
>

Since women represent 50% of the planet's population I have a 50% chance
that when I am condescending, I am speaking with a woman.

And here is the trolling: accusing people you do not know nor have any
data whatsoever of having hidden prejudices, etc etc.

> Anyone want to buy a bridge?
>

No, just do it yourself. Stop trolling and discuss your views in a
normal, unemotional tone. That's all it takes.

Gareth Owen

unread,

May 11, 2017, 3:09:56 PM5/11/17

to

jacobnavia <ja...@jacob.remcomp.fr> writes:

> Discuss your views in a normal, unemotional tone. That's all it takes.

Iatre, therapeuson seauton

Jerry Stuckle

unread,

May 11, 2017, 3:45:25 PM5/11/17

to

They cause the assembly code to be generated. It may be done indirectly
as in your example (and good for compilers targeting different
platforms), or it generate machine code directly.

> The compiler code generation is not just a copy-and-paste of sequences
> of hand-written assembly with a few register renames. There was a time,
> long ago, when that was the case (and it may still be the case in
> simpler compilers), but not now.
>

I never said it was a copy-and-paste.

> The information about what instructions to use are, of course, given by
> a human - as is information about timing, pipelines, scheduling, etc.,
> that helps the compiler pick between alternative solutions and interlacing.
>

That is correct - and even the code that determines the results
generated by those rules is created by humans.

>
>>
>> A good assembler programmer knows the language. An expert assembler
>> programmer knows the language *and the processor*.
>
> Agreed.
>
>> He/she can write the
>> same code the compiler generates, so his/her code is never slower.
>
> In theory, yes. In practice - no, except for very short sequences or
> particular special cases.
>

Maybe in your case.

>> However, the compiler still has constraints on it based on its design.
>> The programmer has no such constraints.
>
> In theory, yes - in practice, no. The programmer has constraints -
> there are limits to how well he/she can track large numbers of
> registers, or processor pipelines, or multi-issue scheduling. Making a
> small change in one part of the assembly can have knock-on effects in
> the timings in other parts. A human programmer simply cannot keep track
> of it all without an inordinate amount of time and effort.
>

In practice, yes. There are limits to both registers and pipelines
which a good programmer can handle. The same with scheduling. But I
disagree that a small change in one part of the assembly will have any
noticeable effect on unrelated code.

> So /sometimes/ an assembler programmer can do better, especially on
> short sequences or particular cases that might map well to processor
> instructions but poorly to C code. But often it is simply too much
> effort - even an expert assembly programmer will not be willing to spend
> the time on the tedious detail, and has a high chance of getting at
> least something wrong.
>

In some cases, no amount of effort is "too much". I can think of many
real-time systems where even saving a microsecond is important to the
code. Data collection and analysis of large scientific projects such as
the VLA (large array of radio telescopes), the NSF's LIGO (gravitational
wave detector) and CERT's Large Hadron Detector are just three examples.

> In particular, if you insist on writing clear and maintainable code in
> an appropriate timeframe, as most professional programmers aim for, then
> it is very rare that even an expert assembler programmer will beat a
> good compiler. It is perfectly possible to write clear and maintainable
> assembly code - but it is rarely the fastest possible result on a modern
> chip.
>

It is perfectly possible to write clean and maintainable assembly code -
and make it the fastest possible result. It's done every day at each of
the above projects.

>>
>> Additionally, the compiler generates code for a specific series of
>> processors - i.e. 32 or 64 bit. If the programmer knows the code will
>> only run on one specific processor, he/she can write code specific to
>> that processor.
>
> Many compilers can generate code specifically for particular target
> processors. (Sorry, Jacob, but I must use gcc as the example again - it
> is the compiler I know best.) gcc has options to generate code that
> will work on a range of cpus (within a family such as x86-32) but have
> scheduling optimised for one particular target or subfamily. Or it can
> generate code that /requires/ a particular subfamily feature. Or it can
> generate a number of implementations for a given function, and pick the
> best one at run-time based on the actual cpu that is being used.
>
> Yes, all of that /can/ be done by an assembler programmer - but it is
> unrealistic to think that it /would/ be done, except in extreme cases.
>

As I said - there are many instances in the scientific world where that
is necessary. And there are programmers who do it.

>>
>> With that said, there are very few assembler programmers with the
>> required level of expertise nowadays. But they are out there.
>>
>
> Agreed. And they are mostly spending their time doing something
> /useful/ with those skills, rather than trying to beat compiler output
> by a fraction of a percent on one particular processor. For example,
> they are involved in writing compilers :-)
>
>
>

Yes, they are doing something useful with their skills - and being paid
quite handsomely for it. I only wish I had the level of expertise they do.

David Brown

unread,

May 12, 2017, 4:29:51 AM5/12/17

to

On 11/05/17 21:45, Jerry Stuckle wrote:
> On 5/11/2017 11:09 AM, David Brown wrote:
>> On 11/05/17 15:59, Jerry Stuckle wrote:
>>> On 5/11/2017 2:27 AM, Gareth Owen wrote:
>>>> David Brown <david...@hesbynett.no> writes:
>>>>

>>> However, the compiler still has constraints on it based on its design.
>>> The programmer has no such constraints.
>>
>> In theory, yes - in practice, no. The programmer has constraints -
>> there are limits to how well he/she can track large numbers of
>> registers, or processor pipelines, or multi-issue scheduling. Making a
>> small change in one part of the assembly can have knock-on effects in
>> the timings in other parts. A human programmer simply cannot keep track
>> of it all without an inordinate amount of time and effort.
>>
>
> In practice, yes. There are limits to both registers and pipelines
> which a good programmer can handle. The same with scheduling.

It gets /really/ fun on a processor like MIPS, with branch delay slots
to consider!

> But I
> disagree that a small change in one part of the assembly will have any
> noticeable effect on unrelated code.

"/will/ have a noticeable effect" implies that is always or usually the
case - I said "/can/", because it is something that /can/ happen. I
don't think it is common, but it is absolutely possible. One example
situation is where you are trying to keep all your relevant data in
registers. Changes in one part of the function can mean you need
different priorities in a different part - either earlier or later. It
might not mean /big/ changes, but it will mean changes. In C, you can
make another local variable when you want it - in assembly, if you run
out of registers you need to make changes.

Another case is when you have limited branch or conditional execution
capability. Some processors have support for small and fast versions
that are limited in size. For example, the ARM Thumb-2 instruction set
has an "if-the-else" construct that supports up to four conditional
instructions. Make a change that goes over that limit of four, and you
have to re-structure the code to use conditional branches.

This kind of thing is usually more relevant on RISC cpus than x86
(especially in 32-bit mode) where your register usage is already very
limited in flexibility.

All this might not have a noticeable effect on code performance - but it
does have an effect on the time and effort spent.

>
>> So /sometimes/ an assembler programmer can do better, especially on
>> short sequences or particular cases that might map well to processor
>> instructions but poorly to C code. But often it is simply too much
>> effort - even an expert assembly programmer will not be willing to spend
>> the time on the tedious detail, and has a high chance of getting at
>> least something wrong.
>>
>
> In some cases, no amount of effort is "too much".

True.

> I can think of many
> real-time systems where even saving a microsecond is important to the
> code. Data collection and analysis of large scientific projects such as
> the VLA (large array of radio telescopes), the NSF's LIGO (gravitational
> wave detector) and CERT's Large Hadron Detector are just three examples.

No, those are /not/ good examples. It is true that minimising latencies
here can be very important - but it is not the case that it is done by
using assembly. For these sorts of systems, coding is done carefully in
C or C++ for maintainability, correctness, flexibility, and lower
development costs. When the smallest latencies are needed (and the
smallest variation in latencies, which is usually more important), they
switch to programmable logic. Assembly gives very little benefit over C
or C++ in terms of speed, no benefit in terms of latency variation (it
is still subject to caches, interrupts, etc.), and much higher
development costs and risks. Programmable logic lets them get latency
variation down to a single clock cycle - it is worth the cost.

On the other hand, these same groups need to do massive analysis of the
data afterwards - passing it through filters, fourier transforms, etc.,
that need as high throughput as possible but which can tolerate
variations in timings. This sort of thing is done on arrays of
commodity processors (or perhaps graphics processors). Here it can be
worth making the kernels of these filters in assembly - you know the
target processor, and saving 2% average time will save 2% hardware
budget and 2% electricity costs. It is absolutely worth the effort
going for assembly there.

>
>> In particular, if you insist on writing clear and maintainable code in
>> an appropriate timeframe, as most professional programmers aim for, then
>> it is very rare that even an expert assembler programmer will beat a
>> good compiler. It is perfectly possible to write clear and maintainable
>> assembly code - but it is rarely the fastest possible result on a modern
>> chip.
>>
>
> It is perfectly possible to write clean and maintainable assembly code -
> and make it the fastest possible result. It's done every day at each of
> the above projects.

Possible? Yes. But it is a rare thing - especially on "big"
processors. It is a different matter for small microcontrollers, where
it is far easier to track exactly what you are doing because there is a
small instruction set, no scheduling or multiple issues, very clear and
simple pipelines, no caches, etc.

>
>>>
>>> Additionally, the compiler generates code for a specific series of
>>> processors - i.e. 32 or 64 bit. If the programmer knows the code will
>>> only run on one specific processor, he/she can write code specific to
>>> that processor.
>>
>> Many compilers can generate code specifically for particular target
>> processors. (Sorry, Jacob, but I must use gcc as the example again - it
>> is the compiler I know best.) gcc has options to generate code that
>> will work on a range of cpus (within a family such as x86-32) but have
>> scheduling optimised for one particular target or subfamily. Or it can
>> generate code that /requires/ a particular subfamily feature. Or it can
>> generate a number of implementations for a given function, and pick the
>> best one at run-time based on the actual cpu that is being used.
>>
>> Yes, all of that /can/ be done by an assembler programmer - but it is
>> unrealistic to think that it /would/ be done, except in extreme cases.
>>
>
> As I said - there are many instances in the scientific world where that
> is necessary. And there are programmers who do it.
>

I agree that it /is/ done, I merely say that it is very rarely worth the
effort. Massive scientific processing is an example where it sometimes
/is/ worth the effort.

asetof...@gmail.com

unread,

May 12, 2017, 1:06:34 PM5/12/17

to

Woman and computer it will be 3%

fir

unread,

May 14, 2017, 2:00:13 PM5/14/17

to

W dniu poniedziałek, 8 maja 2017 19:34:17 UTC+2 użytkownik Lynn McGuire napisał:
> "Need for Speed - C++ versus Assembly Language"
>
> https://www.codeproject.com/Articles/1182676/Need-for-Speed-Cplusplus-versus-Assembly-Language
>
> Neat ! I believe that ANY C++ compiler and linker duo can beat my hand
> written assembly language. Its been decades since I wrote any assembly
> language.
>

I must agree with navia and stuckle in this thread

who is expecting that simple lame assembly will beat c/c++ compiler
stright away ?

maybe not me, especially in such
presented case where it really semms
not to be optymised and quite lamely written <<yet very lamely tested - the author of this should realy test a parts of this, give asm output from the compiler and give description on the differences>>

but if put some effort in such
asembly i would expect beating the compiler

- how much is an open question to me however though two topics (maybe amongst the yet more possible points) were mentioned: -- compilers are really bad
in writing vectorised simd code -- programmer can rewrie the oryginal source where compiler cant

- ofc it will cost some effort but this
effort and complexity is not 'insane' as some say (and lie) .. the problem is to learn assembly but after that it is not
such amazingly hard or big amount of work, - anyway there is some amount of effort to do here and each person imo
need to calculate for itself if it is worth doin (it depends on two separate 'standpoints' as i call it - standpoint of a 'producer' or standpoint of the 'optymiser')

As i mentioned once on comp.lang.c i am quite able to beat compiler but i prefer to doing it on a c level as it is easier

(once i give example ehere i started form some code for 'casting' the graphical virtual tetured sky 'dome' thru virtual camera coordinates on a screen on cpu and when i started form somethink like 30-60 ms frame time at start i obtained something 25 ms quick 9-7 ms later and 2 ms after a week)
- note, beating the optymising compiler 15x or more

and still get a views to optymise it yet more ut such optymistatuion is funy as it resembles some kind of hourglass shape - at first phase you sipllify the code to het only muls and adds (yet with some precomputations and tabelarisation and 'linearistations') and at this stage the core of te code gets simpler and 'flatter ' but after that wen i begin to make soem yyet more cases and switces the yet optymised code began to get big stiff and complex ;c
(so i stopped it, i also didnt go to assembly level though i belive assembly would make it faster but i also belive not amazingly faster (like 10x) becouse there is imo not crossable ard limit of memory bandwidth - so i belived asm could only mostly help in improvinge 'cpu flow' of 'cpu register flow' but not in a core problem wich is that processing given amounts of memory must take its time *)

* BTW I still dont know why they not improbve this memory bandwidth like 5 or 10 times ? whats the problem? they cant do somethink like 10 times wider memory chanel/bus or something like that?

Still i belive that revriting all this plasma loops on cpu is profitable and can go better than compilers (the problem is imo though thet CPU thus asm is not hard limit on efficiency if code is already optymised - the bottleneck is ram bandwidth - (and the hell i dont know why ) - if they would improve the bandwidth the generated asm could be main bottleneck again allowing probably again theoretical gains in hand assembly like 10x - wich now i would not expect as i said not becouse generated asm is 'so good' but becouse ram is a limit not a cpu)

I may be wrong somewhat here but this is
how i see it and got experience with

Jerry Stuckle

unread,

May 15, 2017, 12:02:56 AM5/15/17

to

But in that case you're not talking about unrelated parts of the code.
They are quite closely related.

> Another case is when you have limited branch or conditional execution
> capability. Some processors have support for small and fast versions
> that are limited in size. For example, the ARM Thumb-2 instruction set
> has an "if-the-else" construct that supports up to four conditional
> instructions. Make a change that goes over that limit of four, and you
> have to re-structure the code to use conditional branches.
>

Once again, you're not talking unrelated code.

> This kind of thing is usually more relevant on RISC cpus than x86
> (especially in 32-bit mode) where your register usage is already very
> limited in flexibility.
>
> All this might not have a noticeable effect on code performance - but it
> does have an effect on the time and effort spent.
>

But it *can* have an effect on code performance, and in some cases that
is quite important.

>>
>>> So /sometimes/ an assembler programmer can do better, especially on
>>> short sequences or particular cases that might map well to processor
>>> instructions but poorly to C code. But often it is simply too much
>>> effort - even an expert assembly programmer will not be willing to spend
>>> the time on the tedious detail, and has a high chance of getting at
>>> least something wrong.
>>>
>>
>> In some cases, no amount of effort is "too much".
>
> True.
>
>> I can think of many
>> real-time systems where even saving a microsecond is important to the
>> code. Data collection and analysis of large scientific projects such as
>> the VLA (large array of radio telescopes), the NSF's LIGO (gravitational
>> wave detector) and CERT's Large Hadron Detector are just three examples.
>
> No, those are /not/ good examples. It is true that minimising latencies
> here can be very important - but it is not the case that it is done by
> using assembly. For these sorts of systems, coding is done carefully in
> C or C++ for maintainability, correctness, flexibility, and lower
> development costs. When the smallest latencies are needed (and the
> smallest variation in latencies, which is usually more important), they
> switch to programmable logic. Assembly gives very little benefit over C
> or C++ in terms of speed, no benefit in terms of latency variation (it
> is still subject to caches, interrupts, etc.), and much higher
> development costs and risks. Programmable logic lets them get latency
> variation down to a single clock cycle - it is worth the cost.
>

Those are excellent examples, and the people who run those projects pay
top dollar for programmers who can do just what I described. Virtually
all of the time-critical code is assembler. Only non-critical sections
are C, C++ or other languages.

> On the other hand, these same groups need to do massive analysis of the
> data afterwards - passing it through filters, fourier transforms, etc.,
> that need as high throughput as possible but which can tolerate
> variations in timings. This sort of thing is done on arrays of
> commodity processors (or perhaps graphics processors). Here it can be
> worth making the kernels of these filters in assembly - you know the
> target processor, and saving 2% average time will save 2% hardware
> budget and 2% electricity costs. It is absolutely worth the effort
> going for assembly there.
>

Throughput after the fact is not as critical as throughput during the
event. For real time processing, if you miss something, it is gone forever.

>>
>>> In particular, if you insist on writing clear and maintainable code in
>>> an appropriate timeframe, as most professional programmers aim for, then
>>> it is very rare that even an expert assembler programmer will beat a
>>> good compiler. It is perfectly possible to write clear and maintainable
>>> assembly code - but it is rarely the fastest possible result on a modern
>>> chip.
>>>
>>
>> It is perfectly possible to write clean and maintainable assembly code -
>> and make it the fastest possible result. It's done every day at each of
>> the above projects.
>
> Possible? Yes. But it is a rare thing - especially on "big"
> processors. It is a different matter for small microcontrollers, where
> it is far easier to track exactly what you are doing because there is a
> small instruction set, no scheduling or multiple issues, very clear and
> simple pipelines, no caches, etc.
>

I've seen million+ lines of assembler code, even on mainframes. It is
done much more often than you think.

>>
>>>>
>>>> Additionally, the compiler generates code for a specific series of
>>>> processors - i.e. 32 or 64 bit. If the programmer knows the code will
>>>> only run on one specific processor, he/she can write code specific to
>>>> that processor.
>>>
>>> Many compilers can generate code specifically for particular target
>>> processors. (Sorry, Jacob, but I must use gcc as the example again - it
>>> is the compiler I know best.) gcc has options to generate code that
>>> will work on a range of cpus (within a family such as x86-32) but have
>>> scheduling optimised for one particular target or subfamily. Or it can
>>> generate code that /requires/ a particular subfamily feature. Or it can
>>> generate a number of implementations for a given function, and pick the
>>> best one at run-time based on the actual cpu that is being used.
>>>
>>> Yes, all of that /can/ be done by an assembler programmer - but it is
>>> unrealistic to think that it /would/ be done, except in extreme cases.
>>>
>>
>> As I said - there are many instances in the scientific world where that
>> is necessary. And there are programmers who do it.
>>
>
> I agree that it /is/ done, I merely say that it is very rarely worth the
> effort. Massive scientific processing is an example where it sometimes
> /is/ worth the effort.
>

And I'm saying you're wrong - there are times when it is quite
important, and it is done this way. Much more often than you think.

And BTW - another example is in weather prediction. Most of the models
are written in assembler and run on supercomputers. Even at that, to
get an accurate 24 hour forecast would take 72 hours or processing time.

>>>>
>>>> With that said, there are very few assembler programmers with the
>>>> required level of expertise nowadays. But they are out there.
>>>>
>>>
>>> Agreed. And they are mostly spending their time doing something
>>> /useful/ with those skills, rather than trying to beat compiler output
>>> by a fraction of a percent on one particular processor. For example,
>>> they are involved in writing compilers :-)
>>>
>>
>> Yes, they are doing something useful with their skills - and being paid
>> quite handsomely for it. I only wish I had the level of expertise they do.
>>
>

Christian Gollwitzer

unread,

May 15, 2017, 3:32:28 PM5/15/17

to

Am 15.05.17 um 06:02 schrieb Jerry Stuckle:

> And BTW - another example is in weather prediction. Most of the models
> are written in assembler and run on supercomputers.

Do you have a reference for this? Supercomputers are use, no doubt, but
I wouldn't have expected that they write the PDE solvers in assembly.

I found one at https://github.com/yyr/wrf which is written in Fortran.

Christian

Gareth Owen

unread,

May 15, 2017, 3:48:58 PM5/15/17

to

Christian Gollwitzer <auri...@gmx.de> writes:

> Am 15.05.17 um 06:02 schrieb Jerry Stuckle:
>> And BTW - another example is in weather prediction. Most of the models
>> are written in assembler and run on supercomputers.
>
> Do you have a reference for this? Supercomputers are use, no doubt,
> but I wouldn't have expected that they write the PDE solvers in
> assembly.

Most of them are in Fortran - some are in C and C++. I've worked on and
seen codebases for large scale climate and weather modeling, and I've
seen literally no assembler. In fact, few of them are even running on
the same architectures on which they originally run (and many are run on
multiple-archs as most predictions are ensemble averages of deliberately
mildly different runs - different architectures help with that).

It might have been true 30 years ago, which is Jerry's default point of
reference.

This is hardly surprising, as easy portability to the latest-and-fastest
processors is far more critical for performance than squeezing 3% out of
the current processor using assembler.

Jerry Stuckle

unread,

May 15, 2017, 4:12:46 PM5/15/17

to

You won't find these programs on github. And anything you find on
github in this area is not used by serious forecasters.

Jerry Stuckle

unread,

May 15, 2017, 4:18:10 PM5/15/17

to

On 5/15/2017 3:48 PM, Gareth Owen wrote:
> Christian Gollwitzer <auri...@gmx.de> writes:
>
>> Am 15.05.17 um 06:02 schrieb Jerry Stuckle:
>>> And BTW - another example is in weather prediction. Most of the models
>>> are written in assembler and run on supercomputers.
>>
>> Do you have a reference for this? Supercomputers are use, no doubt,
>> but I wouldn't have expected that they write the PDE solvers in
>> assembly.
>
> Most of them are in Fortran - some are in C and C++. I've worked on and
> seen codebases for large scale climate and weather modeling, and I've
> seen literally no assembler. In fact, few of them are even running on
> the same architectures on which they originally run (and many are run on
> multiple-archs as most predictions are ensemble averages of deliberately
> mildly different runs - different architectures help with that).
>

Yes, some code is written in Fortran. But there is also a significant
amount of assembler - I've seen it.

> It might have been true 30 years ago, which is Jerry's default point of
> reference.
>

My reference is fresh - this year, in fact.

> This is hardly surprising, as easy portability to the latest-and-fastest
> processors is far more critical for performance than squeezing 3% out of
> the current processor using assembler.
>

Then you haven't seen the code for the models used on those
supercomputers. I'm referring specifically to GFS and European models,
although others are similar.

fir

unread,

May 15, 2017, 4:40:38 PM5/15/17

to

i once remember thread on big number library it was probably GMP it uses assembly as far as i remember, (some my try to find some info how assembly is faster there)... i also remember some probably fastest fractal/mandelbrot 'explorer' it also uses assembly afair .. i would tend to belive that most fastests cpu solutions tend to go to assembly anyway (but i dont do research in this field)

Gareth Owen

unread,

May 15, 2017, 4:43:31 PM5/15/17

to

Jerry Stuckle <jstu...@attglobal.net> writes:

> Then you haven't seen the code for the models used on those
> supercomputers. I'm referring specifically to GFS and European models,
> although others are similar.

https://www.researchgate.net/publication/228791697_FAMOUS_faster_using_parallel_computing_techniques_to_accelerate_the_FAMOUSHadCM3_climate_model_with_a_focus_on_the_radiative_transfer_algorithm

Here's a paper about HadCM3 model - which is written in C and Fortran.
As discussed here, when the need real performance they use massively
parallel systems like OpenCL, not assembler. Most of the discussion in
this paper talks about porting the speed-critical sections *from*
Fortran.

Gareth Owen

unread,

May 15, 2017, 4:44:51 PM5/15/17

to

fir <profes...@gmail.com> writes:

> i once remember thread on big number library it was probably GMP it
> uses assembly as far as i remember, (some my try to find some info how
> assembly is faster there)... i also remember some probably fastest
> fractal/mandelbrot 'explorer' it also uses assembly afair .. i would
> tend to belive that most fastests cpu solutions tend to go to assembly
> anyway (but i dont do research in this field)

One place where assembly is widely used is FFmpeg, for encoding video.

Chris M. Thomasson

unread,

May 15, 2017, 4:54:22 PM5/15/17

to

Wrt fast fractal explorers, assembly on the cpu is very slow compared to
implementing the fractal in a GPU shader.

fir

unread,

May 15, 2017, 5:10:52 PM5/15/17

to

well thats obvious (that gpu has more processing power) but we talk here about cpu

as to gpu gpu also has its c and has its assebly too ;c i once was interested how much faster assembly coding on gpu is better than c coding on gpu but it is hard to find people able to answer that

ps i posted once on comp.lang.c simple tests related to this topic here (i mean comparing scalar c / asm (well half-asm/intrinsics) and gpu-c, this test is interesting by its simplicity so i maybe repaste it, someone could take it as a base for some benchmark maybe

>>>paste below>>>>

This noght run my first gpu c code that is doing something and made some tests

this is a simple mandelbrot drwing code, first i
run scalar version

int mandelbrot_n( float cRe, float cIm, int max_iter )
{
float re = cRe;
float im = cIm;

float rere=re*re;
float imim=im*im;

for(int n=1; n<=max_iter; n++)
{

im = (re+re)*im + cIm;
re = rere - imim + cRe;

rere=re*re;
imim=im*im;

if ( (rere + imim) > 4.0 )
return n;

}

return 0;

}

for 256 x256 x 1000 iteration it take 90 ms

then i made sse intrinsic version

__attribute__((force_align_arg_pointer))
__m128i mandelbrot_n_sse( __m128 cre, __m128 cim, int max_iter )

{
__m128 re = _mm_setzero_ps();
__m128 im = _mm_setzero_ps();

__m128 _1 = _mm_set_ps1(1.);
__m128 _4 = _mm_set_ps1(4.);

__m128 iteration_counter = _mm_set_ps1(0.);

for(int n=0; n<=max_iter; n++)
{

__m128 re2 = _mm_mul_ps(re, re);
__m128 im2 = _mm_mul_ps(im, im);
__m128 radius2 = _mm_add_ps(re2,im2);

__m128 compare_mask = _mm_cmplt_ps( radius2, _4);
iteration_counter = _mm_add_ps( iteration_counter, _mm_and_ps(compare_mask, _1) );
if (_mm_movemask_ps(compare_mask)==0) break;

__m128 ren = _mm_add_ps( _mm_sub_ps(re2, im2), cre);
__m128 reim = _mm_mul_ps(re, im);

__m128 imn = _mm_add_ps( _mm_add_ps(reim, reim), cim);

re = ren;
im = imn;

}

__m128i n = _mm_cvtps_epi32(iteration_counter);

return n;
}

this run 20 ms (more that 4 times faster, dont know why)

(the procesor i run is anyway old core2 e6550 2.33GHz - i got better machine with avx support but didnt use it here yet)

then i make opencl code

"__kernel void square( \n" \
" __global int* input, \n" \
" __global int* output, \n" \
" const unsigned int count) \n" \
"{ \n" \
" int i = get_global_id(0); \n" \
" if(i < count) \n" \
" { \n" \
" int x = i%256; \n" \
" // if(x>=256) return; \n" \
" int y = i/256; \n" \
" // if(y>=256) return; \n" \
" float cRe = -0.5 + -1.5 + x/256.*3.; \n" \
" float cIm = 0.0 + -1.5 + y/256.*3.; \n" \
" float re = 0; \n" \
" float im = 0; \n" \
" int n = 0; \n" \
" for( n=0; n<=1000; n++) { \n" \
" if( re * re + im * im > 4.0 ) { output[256*y+x] = n + 256*n + 256*256*n; return;} \n" \
" float re_n = re * re - im * im + cRe; \n" \
" float im_n = 2 * re * im + cIm; \n" \
" re = re_n; \n" \
" im = im_n; \n" \
" } \n" \
" output[256*y+x] = 250<<8; \n" \
" } \n" \
"} \n" \
"\n";

this works with not a problem and works at 7 ms
(i got weak gpu gt610)

How to optimise this gpu version? Is it common to write such scalar code on gpu, maybe there is some way of writing something like sse intrinsics here? or other kind of optimisation?

(anyway i must say that thiose critics of gpu /opencl coding i dont fully agree this works
easy and fine - at least for some cases, (esp good is that it has not to much slowdown when
runing gpu from cpu and getting back results
- it seem i can run it in the 1 milisecond
window, so its very fine) i belive that with harder codes it may getting slower, but also belive with better card i may go also better than 7 ms)

fir

unread,

May 15, 2017, 5:29:41 PM5/15/17

to

well you compare different hardware - stronger hardware will just win the race

as to gpu my estimations (which were more to get a base of thought) was that 'average' gpu is 'roughly' '10 times' stroner than 'average' cpu (esp compare the memory bandwidth of gpu vs the one of cpu) - but i also heard it is esp good in executing organized simple 'code-fibers', when many branches and more complex kernels arise it radically slows down being even able to get below cpu (dont know the details - i was once interested in learning opencl but droped it)

Jerry Stuckle

unread,

May 15, 2017, 5:50:20 PM5/15/17

to

As I said - I was referring to the GFS and European models. However,
HadCM3 is a climate prediction model, not a weather forecasting model.
Two entirely different solutions to two entirely different problems.

Weather forecasting models don't try to predict more than about 10 days
in advance, and try to get very accurate within 3-5 days. Climate
prediction models such as HadCM3 are more interested in long-term forecasts.

Not to say HadCM3 is not important - it is. But it's not the same. And
you can do parallel processing in assembler, also. After all, they all
end up as machine language, anyway.