Simplest compiler for C

Mr. Man-wai Chang

unread,

Nov 14, 2016, 5:57:19 AM11/14/16

to

Is it called Tiny C Compiler? is it still alive, and well? :)

Mark Storkamp

unread,

Nov 14, 2016, 9:44:45 AM11/14/16

to

In article <o0c598$feb$5...@dont-email.me>,

"Mr. Man-wai Chang" <toylet...@gmail.com> wrote:

> Is it called Tiny C Compiler? is it still alive, and well? :)

<http://lmgtfy.com/?q=tiny+c+compiler>

<http://lmgtfy.com/?q=simplest+c+compiler>

Mr. Man-wai Chang

unread,

Nov 14, 2016, 9:46:13 AM11/14/16

to

You don't know Tiny C compiler? :)

Rick C. Hodgin

unread,

Nov 14, 2016, 11:15:59 AM11/14/16

to

On Monday, November 14, 2016 at 5:57:19 AM UTC-5, Mr. Man-wai Chang wrote:
> Is it called Tiny C Compiler? is it still alive, and well? :)

I've used Tiny C Compiler. It is very fast in compilation, and runs
pretty slow in execution. For quick-and-dirty jobs it's more than
fast enough. It's fast enough to be used in scripting languages in
lieu of other script code.

Niklaus Wirth's "Algorithms + Data Structures = Programs" and Jan van
de Snepscheut "What Computing is All About" are also nice resources
for learning how to create compilers. And, of course, Tiny C by Marc
Feeley.

https://en.wikipedia.org/wiki/Algorithms_%2B_Data_Structures_%3D_Programs
http://www.springer.com/us/book/9781461276395
https://gist.github.com/KartikTalwar/3095780

Best regards,
Rick C. Hodgin

Mr. Man-wai Chang

unread,

Nov 14, 2016, 12:29:40 PM11/14/16

to

On 15/11/2016 12:15 AM, Rick C. Hodgin wrote:
> I've used Tiny C Compiler. It is very fast in compilation, and runs
> pretty slow in execution. For quick-and-dirty jobs it's more than
> fast enough. It's fast enough to be used in scripting languages in
> lieu of other script code.

You can use it as teaching tool. I wonder whether it runs in modern Linux.

> Niklaus Wirth's "Algorithms + Data Structures = Programs" and Jan van
> de Snepscheut "What Computing is All About" are also nice resources
> for learning how to create compilers. And, of course, Tiny C by Marc
> Feeley.

Thank you, Master. When I am bored, reading them can be entertaining
like puzzles. :)

luser droog

unread,

Nov 14, 2016, 12:40:09 PM11/14/16

to

There is also "Small C" which is published in
Dr. Dobbs Toolbox of C with many related articles
and tools. (It was originally user contributions
to the magazine.) AIUI Tiny C has a simpler design.

And "Smaller C" is also similar. You can find the
author's announcements around usenet, IIRC comp.lang.misc.

Dutch Ingraham

unread,

Nov 14, 2016, 12:44:03 PM11/14/16

to

Definitely a bot...

supe...@casperkitty.com

unread,

Nov 14, 2016, 12:53:47 PM11/14/16

to

On Monday, November 14, 2016 at 10:15:59 AM UTC-6, Rick C. Hodgin wrote:

> I've used Tiny C Compiler. It is very fast in compilation, and runs
> pretty slow in execution. For quick-and-dirty jobs it's more than
> fast enough. It's fast enough to be used in scripting languages in
> lieu of other script code.

How is it for execution if one "helps it out" by writing code which mimics
the sequence of operation of efficient machine code? For example, how would
its code for:

void send_bytes(volatile unsigned char *dest, unsigned char *src, int n)
{
while(n--)
{
*dest = *src++; // Assume dest is an I/O register; no inc needed there
}
}

compare with something like:

void send_bytes(volatile unsigned char *dest, unsigned char *src, int n)
{
if (n)
{
src += n;
n = -n;
{
*dest = src[n]; // Assume dest is an I/O register; no inc needed there
} while(++n);
}
}

On x86, I would expect the compiler to modify the counter and pointer
separately in the first loop, and would not be surprised if a simple
compiler had to do a compare with zero on each loop pass in addition to
the decrement, while the latter variation of the code could be easily
peephole optimized to use a [base+disp] addressing mode, and to eliminate
the compare-with-zero after the increment.

Since a relatively small amount of code needs to be maximally efficient,
having to write a few portions in "clunky" style would seem for many use
cases a small price to pay for faster compilation.

BartC

unread,

Nov 14, 2016, 1:23:27 PM11/14/16

to

On 14/11/2016 17:29, Mr. Man-wai Chang wrote:
> On 15/11/2016 12:15 AM, Rick C. Hodgin wrote:
>> I've used Tiny C Compiler. It is very fast in compilation, and runs
>> pretty slow in execution. For quick-and-dirty jobs it's more than
>> fast enough. It's fast enough to be used in scripting languages in
>> lieu of other script code.
>

> I wonder whether it runs in modern Linux.

I managed to compile 'hello world' with TCC.EXE from Windows, using
'wine', after duplicating needed files in ./lib and ./include. However
that produces HELLO.EXE so you need also 'wine' to run any output.

But 'wine' also seems to slow it down considerably (eg. 150ms on Windows
=> 1000ms on Linux). Although this was virtual Linux which might be a
factor too.

--
Bartc

Rick C. Hodgin

unread,

Nov 14, 2016, 1:31:50 PM11/14/16

to

I have never tried to optimize anything for it. In a benchmark a fellow
C developer sent me in 2014, its timing compared to other compilers on a
simple Mandlebrot calculation (timer resolution only to 0.01 seconds):

GCC -O3 or -O2 1.00 seconds
GCC -O or -O1 1.13 seconds
Java 1.23 seconds
C# 1.34 seconds
VisualBasic.Net 1.52 seconds
FreePascal -O3 1.92 seconds
GCC, no options 1.96 seconds
GCC -Os 2.38 seconds
Borland C 2.80 seconds
TinyC 3.23 seconds
FreeBasic 3.36 seconds
euc -gcc -con 3.58 seconds
Oxygen basic 3.77 seconds
PowerBasic 4.03 seconds

As I recall, it compiles on the fly, always requiring source code. It
also doesn't include any optimizations. It has poor register use, and
could be greatly improved with even a minimal optimizer.

You can read about the tradeoffs here:

https://en.wikipedia.org/wiki/Tiny_C_Compiler#Compiled_program_performance

It's still quite a work beast. :-) It's just not suitable for production
use outside of scripts (in my opinion). I was surprised to read that it
has bested GCC on some tests, however (stated in the Wikipedia link above).

Anton Shepelev

unread,

Nov 14, 2016, 1:38:03 PM11/14/16

to

Mr. Man-wai Chang:

> Is it called Tiny C Compiler? is it still alive, and well? :)

There is also neatcc by a Muslim programmer:

http://litcave.rudi.ir/

Its full source is about 178 Kb. This compiler implements
"a large subset of ANSI C" and has optimisations.

--
() ascii ribbon campaign - against html e-mail
/\ http://preview.tinyurl.com/qcy6mjc [archived]

Anton Shepelev

unread,

Nov 14, 2016, 1:49:23 PM11/14/16

to

Rick C. Hodgin:

> Niklaus Wirth's "Algorithms + Data Structures =

> Programs" and Jan van de Snepscheut "What Comput-

> ing is All About" are also nice resources for
> learning how to create compilers.

Wirth has written a book titled "Compiler construc-
tion":

http://www.ethoberon.ethz.ch/WirthPubl/CBEAll.pdf

BartC

unread,

Nov 14, 2016, 1:57:41 PM11/14/16

to

On 14/11/2016 18:31, Rick C. Hodgin wrote:
> On Monday, November 14, 2016 at 12:53:47 PM UTC-5, supe...@casperkitty.com wrote:

>> Since a relatively small amount of code needs to be maximally efficient,
>> having to write a few portions in "clunky" style would seem for many use
>> cases a small price to pay for faster compilation.
>
> I have never tried to optimize anything for it. In a benchmark a fellow
> C developer sent me in 2014, its timing compared to other compilers on a
> simple Mandlebrot calculation (timer resolution only to 0.01 seconds):
>
> GCC -O3 or -O2 1.00 seconds
> GCC -O or -O1 1.13 seconds
> Java 1.23 seconds
> C# 1.34 seconds
> VisualBasic.Net 1.52 seconds
> FreePascal -O3 1.92 seconds
> GCC, no options 1.96 seconds
> GCC -Os 2.38 seconds
> Borland C 2.80 seconds
> TinyC 3.23 seconds
> FreeBasic 3.36 seconds
> euc -gcc -con 3.58 seconds
> Oxygen basic 3.77 seconds
> PowerBasic 4.03 seconds
>
> As I recall, it compiles on the fly, always requiring source code.

No. It works like any other compiler, in the ability to generate object
files and link into executables. But its speed makes it practical to run
from source each time.

> You can read about the tradeoffs here:
>
> https://en.wikipedia.org/wiki/Tiny_C_Compiler#Compiled_program_performance
>
> It's still quite a work beast. :-) It's just not suitable for production
> use outside of scripts (in my opinion). I was surprised to read that it
> has bested GCC on some tests, however (stated in the Wikipedia link above).

IME, on (my) real programs not benchmarks, TCC generates code which is
around half the speed of gcc. Unless switch statements are used then it
can slow down considerably (I guess it handles those as if-else-if chains).

(On one of my compiler projects - which uses switch statements heavily -
C code compiled with TCC still managed some 350,000 lines per second.
Other C compilers twice that speed or more, while gcc -O3 was nearly
three times the speed. But figures can depend on the exact processor in
use.)

So TCC is not a toy compiler. However, it's a little unsatisfactory that
TCC depends on being compiled with gcc -O3 for its speed!

> It
> also doesn't include any optimizations. It has poor register use, and
> could be greatly improved with even a minimal optimizer.
>

(IMO also it wouldn't be hard to tweak it to generate more reasonable
code, for someone familiar with its workings.)

--
Bartc

Rick C. Hodgin

unread,

Nov 14, 2016, 2:02:30 PM11/14/16

to

On Monday, November 14, 2016 at 1:49:23 PM UTC-5, Anton Shepelev wrote:
> Rick C. Hodgin:
>
> > Niklaus Wirth's "Algorithms + Data Structures =
> > Programs" and Jan van de Snepscheut "What Comput-
> > ing is All About" are also nice resources for
> > learning how to create compilers.
>
> Wirth has written a book titled "Compiler construc-
> tion":
>
> http://www.ethoberon.ethz.ch/WirthPubl/CBEAll.pdf

Thank you, sir!

Rick C. Hodgin

unread,

Nov 14, 2016, 2:07:46 PM11/14/16

to

On Monday, November 14, 2016 at 1:57:41 PM UTC-5, Bart wrote:
> On 14/11/2016 18:31, Rick C. Hodgin wrote:
> > As I recall, it compiles on the fly, always requiring source code.
>
> No. It works like any other compiler, in the ability to generate object
> files and link into executables. But its speed makes it practical to run
> from source each time.

My mistake.

> > You can read about the tradeoffs here:
> >
> > https://en.wikipedia.org/wiki/Tiny_C_Compiler#Compiled_program_performance
> >
> > It's still quite a work beast. :-) It's just not suitable for production
> > use outside of scripts (in my opinion). I was surprised to read that it
> > has bested GCC on some tests, however (stated in the Wikipedia link above).
>
> IME, on (my) real programs not benchmarks, TCC generates code which is
> around half the speed of gcc. Unless switch statements are used then it
> can slow down considerably (I guess it handles those as if-else-if chains).

My experience has been 1/2 speed to 1/4 speed, and typically in that
range. But, I haven't used it much as most of my "C code" is really
"C++ code" in C clothing. It doesn't compile natively without some
little fixups (adding back in keyword struct mostly, and changing
some single-line comments).

> (On one of my compiler projects - which uses switch statements heavily -
> C code compiled with TCC still managed some 350,000 lines per second.
> Other C compilers twice that speed or more, while gcc -O3 was nearly
> three times the speed. But figures can depend on the exact processor in
> use.)
>
> So TCC is not a toy compiler. However, it's a little unsatisfactory that
> TCC depends on being compiled with gcc -O3 for its speed!

I think its greatest asset is the one-pass compiler that allows raw
source code to be used. I always thought that was a pretty awesome
thing.

> > It
> > also doesn't include any optimizations. It has poor register use, and
> > could be greatly improved with even a minimal optimizer.
> >
>
> (IMO also it wouldn't be hard to tweak it to generate more reasonable
> code, for someone familiar with its workings.)

It would be a fun "week of coding" project for a small team. :-)

jacobnavia

unread,

Nov 14, 2016, 2:09:43 PM11/14/16

to

So, you are running a simulation of linux under windows. In that
emulation you run the windows emulator, that eventually runs your program.

No wonder!

Gareth Owen

unread,

Nov 14, 2016, 2:11:21 PM11/14/16

to

And all of us are inside The Matrix.

BartC

unread,

Nov 14, 2016, 2:42:44 PM11/14/16

to

AFAIK native code code ought to run at full speed, as it is not
emulating the x86 processor. And some programs do.

But others (like some of mine!) run slower than they ought, even leaving
wine out of it, but perhaps 30-50% slower. And one seemed to be
sensitive to some cache issues. So something about it (Virtualbox) is
causing problems.

At the moment however I can only generate executable code to run
natively on Linux via gcc, which is yet another factor. (And I didn't
forget the -O3!)

--
Bartc

fir

unread,

Nov 14, 2016, 3:01:03 PM11/14/16

to

speaking about efficiency (i may repeat it though i already said it more than once - though not many people mention that though those really optymising ones should know that):

c beats c in speed like 5-10-20 times

this is becouse tehere is c and c, if you optymise you will
get rid of divisions unroll in elaborate things, precompute,
tabelarize, maybe even linerize by tiling and more -

after that your code is virtually like memset (more precise memcopy with some slight arithmetic and branching) - in short there is c and c

Rick C. Hodgin

unread,

Nov 14, 2016, 3:10:37 PM11/14/16

to

I experience the same variability. I have potentially traced it down
to long uptimes as I typically use them full-screen during use, scale
down to a small window and then save them where they are.

Periodically rebooting seems to resolve some of the variability. I
run Windows Server 2003 in my VM in Linux. I run Linux Mint and Ubuntu
in my VM in Windows. Running Windows in Linux seems to be faster than
running Linux in Windows. But, it's still fast enough as I grew up in
the 80486 DOS days (as I'm pretty sure you did too). :-) Even slower
computers are more than fast enough for most things I find.

fir

unread,

Nov 14, 2016, 3:17:27 PM11/14/16

to

worth to mention (i also said that more than once, i usually dilike repeating myself though in this case i repeat as this is some way
nice 'cool'/'sweat' topic) that from this point you may get further
using simd intrinsics (of hand assembly) to revrite it down and
obtain possibly jet better results (but this needs assembly practice
to do it with bigger pain, also im not sure as to possible obtainable speedups - if that would get 3x i would consider it worth it but if
30% possibly not -- and i was usually not dong it becouse it make code
stiff/rigid (hard to made changes) and even after c-level unroling and
precalculation it is already stiff/rigidized - but topic is cool)

fir

unread,

Nov 14, 2016, 3:21:48 PM11/14/16

to

> to do it with bigger pain,

to do it without bigger pain*

supe...@casperkitty.com

unread,

Nov 14, 2016, 3:40:17 PM11/14/16

to

On Monday, November 14, 2016 at 2:01:03 PM UTC-6, fir wrote:
> c beats c in speed like 5-10-20 times

Perhaps you should have added some qualifiers to your uses of "C" above, to
distinguish what flavors you're referring to?

Some people judge the quality of a compiler based upon the performance of
quality of code it can generate for a program written in strictly-conforming
C; others judge the quality based upon the performance of code it can generate
when wielded by a programmer familiar with the strengths of the underlying
platform.

Today's compilers are much better than those of years past at turning
simply-written code into an optimized version *in cases where the code uses
patterns that they recognize*. In some cases, the performance they can
achieve given simply-written code may be better than the performance they
can achieve when given a "hand-optimized" version. On the other hand, the
non-optimized version may end up performing very poorly if it uses a pattern
that the compiler doesn't recognize.

fir

unread,

Nov 14, 2016, 4:08:49 PM11/14/16

to

W dniu poniedziałek, 14 listopada 2016 21:40:17 UTC+1 użytkownik supe...@casperkitty.com napisał:
> On Monday, November 14, 2016 at 2:01:03 PM UTC-6, fir wrote:
> > c beats c in speed like 5-10-20 times
>
> Perhaps you should have added some qualifiers to your uses of "C" above, to
> distinguish what flavors you're referring to?
>

perhaps not
[if i would do that then though i wanted to say by it would vanish]

fir

unread,

Nov 14, 2016, 4:13:05 PM11/14/16

to

> [if i would do that then though i wanted to say by it would vanish]

thought*

David Brown

unread,

Nov 15, 2016, 3:50:39 AM11/15/16

to

On 14/11/16 20:09, jacobnavia wrote:
>
> So, you are running a simulation of linux under windows. In that
> emulation you run the windows emulator, that eventually runs your program.
>

Virtualisation is not simulation. Hardware interaction and privileged
instructions need to be virtualised, and thus run slower. Normal code
runs at normal speed.

And Wine is not an emulator ("WINE" is an acronym for exactly that).
It creates a windows-like environment for the code, and implements the
Windows API with translations to Linux and X APIs as needed. For some
types of code, this translation makes programs run noticeably slower.
For other types of code, the faster implementations of libraries and
better underlying OS makes Windows code run faster under Wine. And for
the normal code running the application - the straight x86 code - the
speed will be the same as it runs directly on the same processor.

> No wonder!

I imagine the difference is simply that programs under Wine take a
little longer to start up as Wine needs to create a small Windows-like
environment for the program before starting the exe code itself.

David Brown

unread,

Nov 15, 2016, 4:05:41 AM11/15/16

to

On 14/11/16 20:42, BartC wrote:
> On 14/11/2016 19:09, jacobnavia wrote:
>>
>> So, you are running a simulation of linux under windows. In that
>> emulation you run the windows emulator, that eventually runs your
>> program.
>>
>> No wonder!
>
> AFAIK native code code ought to run at full speed, as it is not
> emulating the x86 processor. And some programs do.

Yes.

>
> But others (like some of mine!) run slower than they ought, even leaving
> wine out of it, but perhaps 30-50% slower. And one seemed to be
> sensitive to some cache issues. So something about it (Virtualbox) is
> causing problems.

Make sure you isolate any startup delays from the timing - Wine has to
create a bit of Windows-like environment at startup.

Cache issues /can/ have an effect, though usually the speed difference
of the main x86 code between native and VirtualBox is very minor. One
thing to note is that when an OS is running natively, it has more
information about the processors and can make more intelligent
scheduling decisions. For example, if you have 4 cores with SMT, then
the Linux kernel knows which cores are shared between two threads, and
takes that into account when scheduling up to 8 running processes. But
if you create a virtual machine and tell it to use up to 8 processors,
the kernel in the virtual machine only sees 8 cpus but knows nothing
about the SMT pairs, or cache sizes, etc. It also loses (almost) all
information about processor affinity and cache contents - normally the
Linux kernel will try to re-schedule processes on the same core in order
to minimise cache misses. It can't do that when a Windows host OS has
played with the cpus and caches behind its back. (I have no idea how
smart the cpu scheduler is on Windows these days, I only know about the
Linux one.)

>
> At the moment however I can only generate executable code to run
> natively on Linux via gcc, which is yet another factor. (And I didn't
> forget the -O3!)
>

Often -O2 is faster than -O3, as -O3 does a lot more loop unrolling and
other code transformations that increase the size of code. L1 caches,
prefetches, BTBs, etc., usually mean that the smaller code in -O2 is
faster than the expanded -O3 code. But it can depend on details of the
code - picking the best switches for the very fastest code generation is
more of an art than a science. Unless you are interested in reading the
documentation and doing a lot of trial and error, then stick to these:

-O2 // Basic solid optimisation
-march=native // Tell gcc to take full advantage of /your/ cpu
-ffast-math // Makes floating point much faster, but not IEEE

And try with something a little more demanding than "Hello, world!",
even if it is little more than a big summation loop, to isolate the
actual code performance.

Mr. Man-wai Chang

unread,

Nov 15, 2016, 5:07:24 AM11/15/16

to

On 15/11/2016 2:37 AM, Anton Shepelev wrote:
> Mr. Man-wai Chang:
>
>> Is it called Tiny C Compiler? is it still alive, and well? :)
>
> There is also neatcc by a Muslim programmer:
> http://litcave.rudi.ir/

I read that page. Too many extra functions to be consider "simplest".
It's trying to do a lot of things.

Mr. Man-wai Chang

unread,

Nov 15, 2016, 6:18:29 AM11/15/16

to

On 15/11/2016 1:43 AM, Dutch Ingraham wrote:
>
> Definitely a bot...

Describe this "bot" in C. You have unlimited time to do it. :)

Mr. Man-wai Chang

unread,

Nov 15, 2016, 6:20:49 AM11/15/16

to

On 15/11/2016 2:31 AM, Rick C. Hodgin wrote:
> I have never tried to optimize anything for it. In a benchmark a fellow
> C developer sent me in 2014, its timing compared to other compilers on a
> simple Mandlebrot calculation (timer resolution only to 0.01 seconds):
>
> GCC -O3 or -O2 1.00 seconds

Should we report() and printf() it to 'Guinness World Records', Masters? :)

Mr. Man-wai Chang

unread,

Nov 15, 2016, 7:16:03 AM11/15/16

to

On 15/11/2016 2:37 AM, Anton Shepelev wrote:
>

> Its full source is about 178 Kb. This compiler implements
> "a large subset of ANSI C" and has optimisations.

Can a compiler rely on another compiler and then on even more compilers,
like a stack of cards if not dead bodies, to claim as being the simplest
and smallest? :)

BartC

unread,

Nov 15, 2016, 7:56:57 AM11/15/16

to

On 14/11/2016 18:31, Rick C. Hodgin wrote:

> I have never tried to optimize anything for it. In a benchmark a fellow
> C developer sent me in 2014, its timing compared to other compilers on a
> simple Mandlebrot calculation (timer resolution only to 0.01 seconds):
>
> GCC -O3 or -O2 1.00 seconds
> GCC -O or -O1 1.13 seconds
> Java 1.23 seconds
> C# 1.34 seconds
> VisualBasic.Net 1.52 seconds
> FreePascal -O3 1.92 seconds
> GCC, no options 1.96 seconds
> GCC -Os 2.38 seconds
> Borland C 2.80 seconds
> TinyC 3.23 seconds
> FreeBasic 3.36 seconds
> euc -gcc -con 3.58 seconds
> Oxygen basic 3.77 seconds
> PowerBasic 4.03 seconds

I knocked up some Mandelbrot benchmark too (the sort of code gcc excels
in because it depends almost exclusively on one or two tight, nested
loops). Results were:

gcc -O3 400 msec either 32 or 64 bits
gcc -O0 980 msec 32 bits (870 msec 64 bits)
TCC 2700 msec 32 bits
TCC 1100 msec 64 bits
lccwin opt 1280 msec 32 bits (760 msec 64 bits)
lccwin unopt 870 msec 32 bits (840 msec 64 bits)
DMC opt 810 msec 32 bits
DMC unopt 1240 msec 32 bits

(My compiler 650 msec 64 bits)
(Interpreted 7300 msec 64-bit interpreter)

Had some trouble with gcc -O3 as it was giving timings of 0 msec unless
I did something with the result (a 6Mbyte image), so all include summing
the pixels (/and then printing the answer/).

I don't quite get why TCC64 is that much faster than TCC32.

(And yes, lccwin32 unoptimised was faster than optimised!)

> As I recall, it compiles on the fly, always requiring source code. It
> also doesn't include any optimizations. It has poor register use, and
> could be greatly improved with even a minimal optimizer.

With my compiler, while there is a weak optimiser, it doesn't work for
floating point ops which the inner loops use here. So the code generated
is very simple, but not that terrible either. That's why I think TCC
could do better even without a proper optimiser, just a bit more care.

--
Bartc

Rick C. Hodgin

unread,

Nov 15, 2016, 8:34:12 AM11/15/16

to

Interesting results. Here are some other results from the 2014 test
run by my colleague for other systems. He did note that he did not
go out of his way to optimize any of this code in any of these
languages, but only kept the basic algorithm as it was:

-- Byte Code --
tinyc5 12.37 seconds
Euphoria v3.1 12.45 seconds
my-tc5 13.43 seconds
pe 64-bit 14.26 seconds
my-tc5 16.55 seconds
pe 32-bit 17.96 seconds
micro-e @ -o -g 23.05 seconds
Euphoria v4.04 24.36 seconds
my-tc5 27.62 seconds
micro-e @ -o- -g 34.17 seconds
Pike 44.39 seconds
micro-e @ -o -s 47.41 seconds
onlymandel-real-m32 57.76 seconds
toy2.c 58.39 seconds
Tinypas.c 58.97 seconds
Ruby 60.25 seconds
onlymandel-real-m64 60.42 seconds
Euphoria v4.1 beta 61.40 seconds
onlymandelbrot-m32 65.96 seconds
micro-e @ -o- -s 67.72 seconds
vspl 68.76 seconds
onlymandel-64-m32 71.14 seconds
sal2\try3\sal 71.36 seconds
onlymandel-realfn-m32 72.37 seconds
onlymandel-m64 76.30 seconds
Lua 80.14 seconds
onlymandel-64-m64 83.69 seconds
jwillia basic 109 seconds
hoc 140 seconds
NaaLaa 168 seconds
CInt 201 seconds
Python 274 seconds
Yabasic 278 seconds
SpecBAS 358 seconds

-- AST --
lang\epp3\calc3a 54.87 seconds
tinyp12 55.43 seconds
my-tc5 AST 98.87 seconds

-- Interpreter ---
ChipMunkBasic 442 seconds
BBCBasic 531 seconds
ThinBasic 543 seconds
FBSL 551 seconds
CH 582 seconds
Scriba 618 seconds
SI 1020 seconds
LittleC 2428 seconds
PicoC 2478 seconds
DDS5 3033 seconds

Here's the code he used:

proc mandel() {
accum = 0
count = 0
while (count < 1545) {
left_edge = -420
right_edge = 300
top_edge = 300
bottom_edge = -300
x_step = 7
y_step = 15

max_iter = 200

y0 = top_edge
while (y0 > bottom_edge) {
x0 = left_edge
while (x0 < right_edge) {
y = 0
x = 0
the_char = 32
x_x = 0
y_y = 0
i = 0
while (i < max_iter && x_x + y_y <= 800) {
x_x = int((x * x) / 200)
y_y = int((y * y) / 200)
if (x_x + y_y > 800 ) {
the_char = 48 + i
if (i > 9) {
the_char = 64
}
} else {
temp = x_x - y_y + x0
if ((x < 0 && y > 0) || (x > 0 && y < 0)) {
y = int(-1 * ((-1 * (x * y)) / 100)) + y0
} else {
y = int(x * y / 100) + y0
}
x = temp
}

i = i + 1
}
accum = accum + the_char

x0 = x0 + x_step
}
y0 = y0 - y_step
}
if (count % 300 == 0) {
print accum, " "
}

count = count + 1
}
print accum, "\n"
}

He called it simply with:

mandel()

Here's the C version:

#include <stdio.h>
int main()
{
int left_edge, right_edge, top_edge, bottom_edge, max_iter,
x_step, y_step, y0, x0, x, y, i, x_x, y_y, temp, the_char,
accum, count;

accum = 0;
count = 0;
while (count < 1545) {
left_edge = -420;
right_edge = 300;
top_edge = 300;
bottom_edge = -300;
x_step = 7;
y_step = 15;

max_iter = 200;

y0 = top_edge;
while (y0 > bottom_edge) {
x0 = left_edge;
while (x0 < right_edge) {
y = 0;
x = 0;
the_char = ' ';
x_x = 0;
y_y = 0;
i = 0;
while (i < max_iter && x_x + y_y <= 800) {
x_x = (x * x) / 200;
y_y = (y * y) / 200;
if (x_x + y_y > 800 ) {
the_char = '0' + i;
if (i > 9) {
the_char = '@';
}
} else {
temp = x_x - y_y + x0;
if ((x < 0 && y > 0) || (x > 0 && y < 0)) {
y = (-1 * ((-1 * (x * y)) / 100)) + y0;
} else {
y = x * y / 100 + y0;
}
x = temp;
}

i = i + 1;
}
accum = accum + the_char;

x0 = x0 + x_step;
}
y0 = y0 - y_step;
}
if (count % 300 == 0) {
printf("%d\n", accum);
}

count = count + 1;
}
printf("%d\n", accum);
return 0;

}

> I don't quite get why TCC64 is that much faster than TCC32.
>
> (And yes, lccwin32 unoptimised was faster than optimised!)

That's unexpected.

My goals with CAlive is to produce functionality first, and basic
optimizations shortly after that functionality. I intend to write
a very comprehensive optimizer after everything is functional and
well debugged (so, I'm guessing 2021-2022 is when I will begin the
optimizer). I plan to make it world class, but we'll see. I defer
to the Lord's plans for my life in these matters. I press ahead
and it seems the harder I press forward the more things in life
press back against me. The desire is there ... I'll just have to
see if it's in His will for me to complete the project.

> > As I recall, it compiles on the fly, always requiring source code. It
> > also doesn't include any optimizations. It has poor register use, and
> > could be greatly improved with even a minimal optimizer.
>
> With my compiler, while there is a weak optimiser, it doesn't work for
> floating point ops which the inner loops use here. So the code generated
> is very simple, but not that terrible either. That's why I think TCC
> could do better even without a proper optimiser, just a bit more care.

I almost have everything in CAlive designed. I still have some work to
do on code generation. I plan to begin working on a lesser version of
it as soon as I get my assembler completed. Work will proceed in
parallel with the work on my OS as I am using that simple assembler
and simple C compiler as its core code base. CAlive will be more
comprehensive with LiveCode (edit-and-continue), but my first simple C
compiler will just be build and link.

We'll see though ... I defer all of my plans to a James 4:15
consideration ("If the Lord wills..."), and I would not presume to
impose my own schedule atop the Lord's plans. Only to convey my inner
drives/desires/wishes.

Rick C. Hodgin

unread,

Nov 15, 2016, 9:32:52 AM11/15/16

to

Bartc wrote:
> I don't get why TCC64 is that much faster than TCC32.

I would guess two main reasons:

(1) Larger FPU/SIMD register set
(2) It receives new development focus.

They probably spend more time on optimizing 64-bit code as it's the
future of high-end machines. Just a guess though.

Mr. Man-wai Chang

unread,

Nov 15, 2016, 9:50:36 AM11/15/16

to

On 15/11/2016 10:32 PM, Rick C. Hodgin wrote:
> I would guess two main reasons:
>
> (1) Larger FPU/SIMD register set
> (2) It receives new development focus.
>
> They probably spend more time on optimizing 64-bit code as it's the
> future of high-end machines. Just a guess though.

Masters, are we still talking about C the programming language?

BTW, anyone of you using AMD PC? :)

Rick C. Hodgin

unread,

Nov 15, 2016, 10:46:53 AM11/15/16

to

What world do you want to live in, Mr. Man-wai Chang? The one where
everything is wholly divided and isolated into finite and explicitly
purposed forms? No structural building should ever be beautiful? No
library should also include artwork on the walls?

There are a plethora of people in this group who will speak to you
about any issue of C you'd like to address. The C Standard language,
quirks of a particular C compiler implementation, etc.

I don't know about you, but to me those subjects are interesting, but
they are also boring when that's all there is. We who are C developers
likely share some common interests in things related to C. Each of us
has a domain and range and our domains and ranges overlap in some
common areas. Should we devote a dedicated group to addressing each
and every one of those common overlaps? Or should be have the more
aesthetic world view that we can come together and share in these
related interests here in this group?

That's the group I prefer.

> BTW, anyone of you using AMD PC? :)

Mr. Man-wai Chang, are we still taking about C the programming language?

BartC

unread,

Nov 15, 2016, 11:07:20 AM11/15/16

to

On 15/11/2016 13:34, Rick C. Hodgin wrote:
> On Tuesday, November 15, 2016 at 7:56:57 AM UTC-5, Bart wrote:

>> I knocked up some Mandelbrot benchmark too (the sort of code gcc excels
>> in because it depends almost exclusively on one or two tight, nested
>> loops). Results were:

> Here's the code he used:
>
> proc mandel() {
> accum = 0
> count = 0

(For someone translating the code: "/" means integer divide; "%" means
mod, and everything works with integers. It's not clear what the int()
casts are for unless it's converting to an int in languages where "/"
gives a floating point result.)

> He called it simply with:
>
> mandel()

....

>
> Here's the C version:
>
> #include <stdio.h>
> int main()
> {

....

Excellent benchmark. However my compiler has a problem with it! It's
very slow, over 10 seconds, about the same as TCC.

While even unoptimised gcc is 3 seconds. (Optimised, something under 2
seconds.) I suspect it's all those divides (gcc eliminates those even
with -O0). I'll have to do something about it.

(With interpreted versions, I get:

Python 2: 370 seconds
My interpreter: 35 seconds
pypy 24 seconds
LuaJIT: 6 seconds

However the last two are JIT tracing interpreters and probably end up
executing native code much of the time. Normal CPython, and my
interpreter (even with a bit of ASM acceleration) still execute a
byte-code at a time so are pure interpreters.)

> I almost have everything in CAlive designed. I still have some work to
> do on code generation. I plan to begin working on a lesser version of
> it as soon as I get my assembler completed. Work will proceed in
> parallel with the work on my OS as I am using that simple assembler
> and simple C compiler as its core code base. CAlive will be more
> comprehensive with LiveCode (edit-and-continue), but my first simple C
> compiler will just be build and link.

The next compiler of this kind I'm planning will eliminate building and
linking (compiling all the sources of a project into a ready-to-run
executable in memory).

But that model doesn't sit well with C applications as they expect to
statically link with third party libraries as precompiled object files.
I only deal with those as dynamic shared libraries.

Because everything is C, it is harder to see where to draw the line with
the C sources of your program, and the C of other add-on modules from
elsewhere. With my proposed system anything pre-existing as C source, or
as an object file, is an external, foreign module that must be made into
a separate shared library.

(After all you don't statically link all the code comprising an OS into
your program!)

--
Bartc

Mr. Man-wai Chang

unread,

Nov 15, 2016, 11:13:01 AM11/15/16

to

On 15/11/2016 11:46 PM, Rick C. Hodgin wrote:
>
> What world do you want to live in, Mr. Man-wai Chang? The one where
> everything is wholly divided and isolated into finite and explicitly
> purposed forms? No structural building should ever be beautiful? No
> library should also include artwork on the walls?
>
> There are a plethora of people in this group who will speak to you
> about any issue of C you'd like to address. The C Standard language,
> quirks of a particular C compiler implementation, etc.
>
> I don't know about you, but to me those subjects are interesting, but
> they are also boring when that's all there is. We who are C developers
> likely share some common interests in things related to C. Each of us
> has a domain and range and our domains and ranges overlap in some
> common areas. Should we devote a dedicated group to addressing each
> and every one of those common overlaps? Or should be have the more
> aesthetic world view that we can come together and share in these
> related interests here in this group?
>
> That's the group I prefer.

Please don't get emotional, Master.

> > > (1) Larger FPU/SIMD register set
>>

>> BTW, anyone of you using AMD PC? :)
>
> Mr. Man-wai Chang, are we still taking about C the programming language?

I saw registers. Aren't they related to CPU architecture? Then could we
extend the chat to AMD vs Intel CPUs?

Rick C. Hodgin

unread,

Nov 15, 2016, 11:59:57 AM11/15/16

to

Not a bad time though on your interpreter!

> > I almost have everything in CAlive designed. I still have some work to
> > do on code generation. I plan to begin working on a lesser version of
> > it as soon as I get my assembler completed. Work will proceed in
> > parallel with the work on my OS as I am using that simple assembler
> > and simple C compiler as its core code base. CAlive will be more
> > comprehensive with LiveCode (edit-and-continue), but my first simple C
> > compiler will just be build and link.
>
> The next compiler of this kind I'm planning will eliminate building and
> linking (compiling all the sources of a project into a ready-to-run
> executable in memory).
>
> But that model doesn't sit well with C applications as they expect to
> statically link with third party libraries as precompiled object files.
> I only deal with those as dynamic shared libraries.
>
> Because everything is C, it is harder to see where to draw the line with
> the C sources of your program, and the C of other add-on modules from
> elsewhere. With my proposed system anything pre-existing as C source, or
> as an object file, is an external, foreign module that must be made into
> a separate shared library.
>
> (After all you don't statically link all the code comprising an OS into
> your program!)

I think of code differently than most people do I think. I like code
that can be migrated from any location in memory to any other location,
and I see the input and output as a protocol, and their global variable
requirements as a linking solution requirement, and that's it. So long
as everything is hooked up and documented properly, then a system should
be able to literally dynamically reconfigure itself on-the-fly as needed,
allowing for much more dynamic systems than we see today.

I plan to target that very ability in force with CAlive and the kernel
design I use for my OS. From the ground up it will be part of what I
call a "LiveCode environment," which is a completely dynamically
configurable runtime environment, all governed by the basic set of
rules and protocols required to make that plug-in-module model work.

We'll see though. Right now it's all talk. I have yet to do it. I
could be wrong in my thinking, or in the estimation of my coding
abilities. I may fall down on my face and go boom. :-)

Rick C. Hodgin

unread,

Nov 15, 2016, 12:03:03 PM11/15/16

to

On Tuesday, November 15, 2016 at 11:13:01 AM UTC-5, Mr. Man-wai Chang wrote:
> On 15/11/2016 11:46 PM, Rick C. Hodgin wrote:
> >
> > What world do you want to live in, Mr. Man-wai Chang? The one where
> > everything is wholly divided and isolated into finite and explicitly
> > purposed forms? No structural building should ever be beautiful? No
> > library should also include artwork on the walls?
> >
> > There are a plethora of people in this group who will speak to you
> > about any issue of C you'd like to address. The C Standard language,
> > quirks of a particular C compiler implementation, etc.
> >
> > I don't know about you, but to me those subjects are interesting, but
> > they are also boring when that's all there is. We who are C developers
> > likely share some common interests in things related to C. Each of us
> > has a domain and range and our domains and ranges overlap in some
> > common areas. Should we devote a dedicated group to addressing each
> > and every one of those common overlaps? Or should be have the more
> > aesthetic world view that we can come together and share in these
> > related interests here in this group?
> >
> > That's the group I prefer.
>
> Please don't get emotional, Master.

"And don't call me 'Master.' I'm not a Spanish person. I've never even
been to Spain."

Extra credit to anyone who can recognize the original quote I've adapted
here. I left in a few clues.

> > > > (1) Larger FPU/SIMD register set
> >>
> >> BTW, anyone of you using AMD PC? :)
> >
> > Mr. Man-wai Chang, are we still taking about C the programming language?
>
> I saw registers.

I see skies of blue. And clouds of white. The bright blessed day.
The dark sacred night. And I think to myself: What a wonderful
world.

> Aren't they related to CPU architecture?

Yes. And in the same way ham is related to a ham sandwich.

> Then could we extend the chat to AMD vs Intel CPUs?

Well of course we could, but were we to do so your hypocrisy would
start glowing.

Mr. Man-wai Chang

unread,

Nov 15, 2016, 12:14:33 PM11/15/16

to

On 16/11/2016 1:02 AM, Rick C. Hodgin wrote:
>
> "And don't call me 'Master.' I'm not a Spanish person. I've never even
> been to Spain."

No, I was referring to Star Wars, Master. :)

> I see skies of blue. And clouds of white. The bright blessed day.
> The dark sacred night. And I think to myself: What a wonderful
> world.

Were you impressed and cursed by this song? :)

>> I saw registers.

>> Aren't they related to CPU architecture?
>
> Yes. And in the same way ham is related to a ham sandwich.
>
>> Then could we extend the chat to AMD vs Intel CPUs?
>
> Well of course we could, but were we to do so your hypocrisy would
> start glowing.

I will start a new topic about this, so as not to *pollute* this thread. :)

Rick C. Hodgin

unread,

Nov 15, 2016, 12:24:21 PM11/15/16

to

On Tuesday, November 15, 2016 at 12:14:33 PM UTC-5, Mr. Man-wai Chang wrote:
> On 16/11/2016 1:02 AM, Rick C. Hodgin wrote:
> >
> > "And don't call me 'Master.' I'm not a Spanish person. I've never even
> > been to Spain."
>
> No, I was referring to Star Wars, Master. :)

Well ... you have no cause to call me "Master." For I am not a master,
but am a consistent student. I shall remain so, I presume, until the
day I depart this world.

> > I see skies of blue. And clouds of white. The bright blessed day.
> > The dark sacred night. And I think to myself: What a wonderful
> > world.
>
> Were you impressed and cursed by this song? :)

I laughed. I cried. It became a part of me.

Actually, I look out across the world and even here in the midst of
all the destruction from sin and death, the beautify of God's intended
creation remains. In butterflies, flowers, birds, aquatic life, and
even in each of us. Consider the design required to make a human's
arm work as flawlessly as a human's arm works. We don't have the
technology today to create prosthetic limbs, yet even the tiniest baby
has this grand masterpiece of engineering.

So yes, such things affect me in a multitude of ways when I step back
and consider them richly.

> >> I saw registers.
> >> Aren't they related to CPU architecture?
> >
> > Yes. And in the same way ham is related to a ham sandwich.
> >
> >> Then could we extend the chat to AMD vs Intel CPUs?
> >
> > Well of course we could, but were we to do so your hypocrisy would
> > start glowing.
>
> I will start a new topic about this, so as not to *pollute* this thread. :)

You seem to do that a lot. I am reminded of Ramine.

Mr. Man-wai Chang

unread,

Nov 15, 2016, 12:27:22 PM11/15/16

to

On 16/11/2016 1:24 AM, Rick C. Hodgin wrote:
> Well ... you have no cause to call me "Master." For I am not a master,
> but am a consistent student. I shall remain so, I presume, until the
> day I depart this world.

Everyone can be an eternal student, Master. There is no end to learning,
as time and space changes. :)

>>
>> I will start a new topic about this, so as not to *pollute* this thread. :)
>
> You seem to do that a lot. I am reminded of Ramine.

That's why I called you "Master". I just don't understand the extended
chat here, as it goes *PHILOSOPHICAL*! :)

BartC

unread,

Nov 15, 2016, 4:35:30 PM11/15/16

to

On 15/11/2016 16:59, Rick C. Hodgin wrote:
> On Tuesday, November 15, 2016 at 11:07:20 AM UTC-5, Bart wrote:

>> (With interpreted versions, I get:
>>
>> Python 2: 370 seconds
>> My interpreter: 35 seconds
>> pypy 24 seconds
>> LuaJIT: 6 seconds

> Not a bad time though on your interpreter!

With that program there's an interesting comparison between pure C, and
using some assembly (as apparently modern C compilers are so good you
don't need assembly any more!).

First the timing was reduced to 26 seconds by using better idioms more
appropriate to the language (So 'sqr(x)' instead of 'x*x'; '++i' instead
of 'i=i+1' etc; no point in detecting such patterns if the language
provides a way to express them directly. And I bet the C compiler does
detect them and takes advantage.)

Timings were:
32-bit 64-bit
M/ASM 1: -- 26 seconds ASM layer + tweaks
M/ASM 2: -- 49 seconds ASM layer only
gcc C, gcc-O3: 86 92 seconds Uses label pointers
Std C, gcc-O3: 109 115 seconds

M is my language and my weak compiler, with one ASM module added. That
uses 64-bits.

So, on the same 64-bit platform, adding an ASM layer (which is /extra/
overhead in many cases), and using a rubbish compiler, is still more
than twice as fast as standard C using gcc-O3 (49:115).

(The 26-second figure is when some extra tweaks are applied such as
combining some byte-codes, but that would be an unfair comparison of ASM
vs. C as some of that could be done with C too, but would not be
worthwhile.)

(As an extra note, I believe that CPython for Linux also makes use of
gcc's label pointers, making it a bit faster than on Windows. But that
test still took five minutes...)

--
Bartc

Rick C. Hodgin

unread,

Nov 15, 2016, 6:33:38 PM11/15/16

to

On Tuesday, November 15, 2016 at 4:35:30 PM UTC-5, Bart wrote:
> On 15/11/2016 16:59, Rick C. Hodgin wrote:
> > On Tuesday, November 15, 2016 at 11:07:20 AM UTC-5, Bart wrote:
>
> >> (With interpreted versions, I get:
> >>
> >> Python 2: 370 seconds
> >> My interpreter: 35 seconds
> >> pypy 24 seconds
> >> LuaJIT: 6 seconds
>
> > Not a bad time though on your interpreter!
>
> With that program there's an interesting comparison between pure C, and
> using some assembly (as apparently modern C compilers are so good you
> don't need assembly any more!).
>
> First the timing was reduced to 26 seconds by using better idioms more
> appropriate to the language (So 'sqr(x)' instead of 'x*x'; '++i' instead
> of 'i=i+1' etc; no point in detecting such patterns if the language
> provides a way to express them directly. And I bet the C compiler does
> detect them and takes advantage.)
>
> Timings were:
> 32-bit 64-bit
> M/ASM 1: -- 26 seconds ASM layer + tweaks
> M/ASM 2: -- 49 seconds ASM layer only
> gcc C, gcc-O3: 86 92 seconds Uses label pointers
> Std C, gcc-O3: 109 115 seconds

I'm impressed with your offerings, Bart.

> M is my language and my weak compiler, with one ASM module added. That
> uses 64-bits.
>
> So, on the same 64-bit platform, adding an ASM layer (which is /extra/
> overhead in many cases), and using a rubbish compiler, is still more
> than twice as fast as standard C using gcc-O3 (49:115).
>
> (The 26-second figure is when some extra tweaks are applied such as
> combining some byte-codes, but that would be an unfair comparison of ASM
> vs. C as some of that could be done with C too, but would not be
> worthwhile.)
>
> (As an extra note, I believe that CPython for Linux also makes use of
> gcc's label pointers, making it a bit faster than on Windows. But that
> test still took five minutes...)

If you get a chance to examine the assembly generated by TCC64 and TCC32,
I'd be interested in what you find out (if it's primarily register usage,
or if it's more toward extra optimizations applied in 64-bit code, or if
it's something else).

BartC

unread,

Nov 16, 2016, 9:58:17 AM11/16/16

to

On 15/11/2016 23:33, Rick C. Hodgin wrote:
> On Tuesday, November 15, 2016 at 4:35:30 PM UTC-5, Bart wrote:

> I knocked up some Mandelbrot benchmark too (the sort of code gcc excels
> in because it depends almost exclusively on one or two tight, nested
> loops). Results were:
>

> gcc -O3 400 msec either 32 or 64 bits
> gcc -O0 980 msec 32 bits (870 msec 64 bits)
> TCC 2700 msec 32 bits

> TCC 1100 msec 64 bits [Now getting <900msec]

> If you get a chance to examine the assembly generated by TCC64 and TCC32,
> I'd be interested in what you find out (if it's primarily register usage,
> or if it's more toward extra optimizations applied in 64-bit code, or if
> it's something else).

Here's the bit of C code forming the inner loop:

static int mandel (double x,double y,int max_iters) {
double a,b,a2;
int i;
a = b = 0.0;
for (i=1; i<=max_iters; ++i) {
a2 = (a*a-b*b)+x;
b = a*b*2.0+y;
a = a2;
if (a*a+b*b >= 4.0) {
return i-1;
}
}
return max_iters;
}

The TCC32 code for part of the body of the loop is:

0040104C DD45F8 fld qword ptr [ebp-8]
0040104F DC4DF8 fmul qword ptr [ebp-8]
00401052 DD55D8 fst qword ptr [ebp-28h]
00401055 DDD8 fstp st(0)
00401057 DD45F0 fld qword ptr [ebp-10h]
0040105A DC4DF0 fmul qword ptr [ebp-10h]
0040105D DC6DD8 fsubr qword ptr [ebp-28h]
00401060 DC4508 fadd qword ptr [ebp+8]
00401063 DD55E8 fst qword ptr [ebp-18h]
00401066 DDD8 fstp st(0)
00401068 DD45F8 fld qword ptr [ebp-8]
0040106B DC4DF0 fmul qword ptr [ebp-10h]
0040106E DD55D0 fst qword ptr [ebp-30h]
00401071 DDD8 fstp st(0)
00401073 DD0508504000 fld qword ptr [405008h]
00401079 DC4DD0 fmul qword ptr [ebp-30h]
0040107C DC4510 fadd qword ptr [ebp+10h]
0040107F DD55F0 fst qword ptr [ebp-10h]
00401082 DDD8 fstp st(0)
00401084 DD45E8 fld qword ptr [ebp-18h]
00401087 DD55F8 fst qword ptr [ebp-8]
0040108A DDD8 fstp st(0)
0040108C DD45F8 fld qword ptr [ebp-8]
0040108F DC4DF8 fmul qword ptr [ebp-8]
00401092 DD55C8 fst qword ptr [ebp-38h]
00401095 DDD8 fstp st(0)
00401097 DD45F0 fld qword ptr [ebp-10h]
0040109A DC4DF0 fmul qword ptr [ebp-10h]
0040109D DC45C8 fadd qword ptr [ebp-38h]
004010A0 DD55C0 fst qword ptr [ebp-40h]
004010A3 DDD8 fstp st(0)
004010A5 DD0510504000 fld qword ptr [405010h]
004010AB DD45C0 fld qword ptr [ebp-40h]
004010AE DAE9 fucompp

And the TCC64 code:

[0001511] c745ec01000000 movl $0x1,0xffffffec(%rbp) modrm:
[0001518] eb76 jmp 1638
[0001520] f20f1045f8 movsd 0xfffffff8(%rbp),%xmm0 modrm:
[0001525] f20f5945f8 mulsd 0xfffffff8(%rbp),%xmm0 modrm:
[0001530] f20f104df0 movsd 0xfffffff0(%rbp),%xmm1 modrm:
[0001535] f20f594df0 mulsd 0xfffffff0(%rbp),%xmm1 modrm:
[0001540] f20f5cc1 subsd %xmm1,%xmm0 modrm: 3/1/0
[0001544] f20f584510 addsd 0x10(%rbp),%xmm0 modrm: 1/5/0
[0001549] f20f1145e0 movsd %xmm0,0xffffffe0(%rbp) modrm:
[0001554] f20f1045f8 movsd 0xfffffff8(%rbp),%xmm0 modrm:
[0001559] f20f5945f0 mulsd 0xfffffff0(%rbp),%xmm0 modrm:
[0001564] f20f58c0 addsd %xmm0,%xmm0 modrm: 3/0/0
[0001568] f20f584518 addsd 0x18(%rbp),%xmm0 modrm: 1/5/0
[0001573] f20f1145f0 movsd %xmm0,0xfffffff0(%rbp) modrm:
[0001578] f20f1045e0 movsd 0xffffffe0(%rbp),%xmm0 modrm:
[0001583] f20f1145f8 movsd %xmm0,0xfffffff8(%rbp) modrm:
[0001588] f20f1045f8 movsd 0xfffffff8(%rbp),%xmm0 modrm:
[0001593] 660f28c8 movapd %xmm0,%xmm1 modrm: 3/0/1
[0001597] f20f594df8 mulsd 0xfffffff8(%rbp),%xmm1 modrm:
[0001602] f20f1045f0 movsd 0xfffffff0(%rbp),%xmm0 modrm:
[0001607] f20f5945f0 mulsd 0xfffffff0(%rbp),%xmm0 modrm:
[0001612] f20f58c1 addsd %xmm1,%xmm0 modrm: 3/1/0
[0001616] 660f2e05f8590000 ucomisd 23032,%xmm0 modrm: 0/5/0

So XMM versus x87 (still surprising as my own tests didn't show that
much difference).

--
Bartc

supe...@casperkitty.com

unread,

Nov 16, 2016, 11:44:49 AM11/16/16

to

On Wednesday, November 16, 2016 at 8:58:17 AM UTC-6, Bart wrote:
> Here's the bit of C code forming the inner loop:
>
> static int mandel (double x,double y,int max_iters) {
> double a,b,a2;
> int i;
> a = b = 0.0;
> for (i=1; i<=max_iters; ++i) {
> a2 = (a*a-b*b)+x;
> b = a*b*2.0+y;
> a = a2;
> if (a*a+b*b >= 4.0) {
> return i-1;
> }
> }
> return max_iters;
> }

The loop computes a*a and b*b twice. How would performance be affected
if each loop computed a*a, b*b, and a*b, checked if a*a+b*b was out of
range, and if not computed the new values of a and b from those? I would
think that placing the multiply a*b between the addition b*b and the use
of that value in the comparison would ease pipeline contention.

Rick C. Hodgin

unread,

Nov 16, 2016, 11:54:27 AM11/16/16

to

Nice catch! The double computation occurs after the first iteration.
I would not have caught that manually with my eye.

Rick C. Hodgin

unread,

Nov 16, 2016, 11:56:59 AM11/16/16

to

Is there a TCC64 switch that forces it to use the FPU? Or a TCC32
switch which forces it to use XMM? Also, has anyone ever introduced
you to your new best friend in GDB:

set disassembly-flavor intel

? :-)

fir

unread,

Nov 16, 2016, 12:10:56 PM11/16/16

to

fpu is said to be obsolete (and i rether belive it)
1) this fpu code seem to be terrible - each line accessing to ram, no registers involved - disaster
2) this xmm code seems to be the same in that respect

extremally bad

supe...@casperkitty.com

unread,

Nov 16, 2016, 12:12:01 PM11/16/16

to

On Wednesday, November 16, 2016 at 10:54:27 AM UTC-6, Rick C. Hodgin wrote:

> On Wednesday, November 16, 2016 at 11:44:49 AM UTC-5, supercat wrote:
> > The loop computes a*a and b*b twice. How would performance be affected
> > if each loop computed a*a, b*b, and a*b, checked if a*a+b*b was out of
> > range, and if not computed the new values of a and b from those? I would
> > think that placing the multiply a*b between the addition b*b and the use
> > of that value in the comparison would ease pipeline contention.
>
> Nice catch! The double computation occurs after the first iteration.
> I would not have caught that manually with my eye.

Looking at tcc's output, it also looks like the biggest problem is that it
makes no effort to cut down on loads and stores. A single-pass compiler
may have limited opportunities to cut down on such things if it has to
allow for the possibility of code taking the address of a variable for
which code has already been generated, but perhaps using the "register"
keyword might help? If the compiler reserves some registers for such
variables and maps as many register-qualified variables as will fit to
those registers, it could gain the ability to generate much more efficient
code in a single pass. Back in the 1980s, Turbo C could accommodate two
register-qualified variables of 16-bit or "near"-pointer type, using exactly
that approach.

BartC

unread,

Nov 16, 2016, 12:19:44 PM11/16/16

to

On 16/11/2016 16:44, supe...@casperkitty.com wrote:
> On Wednesday, November 16, 2016 at 8:58:17 AM UTC-6, Bart wrote:
>> Here's the bit of C code forming the inner loop:
>>
>> static int mandel (double x,double y,int max_iters) {
>> double a,b,a2;
>> int i;
>> a = b = 0.0;
>> for (i=1; i<=max_iters; ++i) {
>> a2 = (a*a-b*b)+x;
>> b = a*b*2.0+y;
>> a = a2;
>> if (a*a+b*b >= 4.0) {
>> return i-1;
>> }
>> }
>> return max_iters;
>> }
>
> The loop computes a*a and b*b twice.

I did play around at one point with remembering the a*a and b*b for the
next iteration. I think it slowed it down! But that might have been
before I ported it to a compiled language.

Trying that tweak now, gcc is 10% faster, but Pelles C is 25% slower and
lccwin32 is 40% slower! But I'd need to do more careful checks (and
write the result to an image file to see if still produces something
that looks like a Mandelbrot set; that was excluded from the test as
writing out the ASCII ppm file dominated the timings.).

How would performance be affected
> if each loop computed a*a, b*b, and a*b, checked if a*a+b*b was out of
> range, and if not computed the new values of a and b from those? I would
> think that placing the multiply a*b between the addition b*b and the use
> of that value in the comparison would ease pipeline contention.
>

(I was going to post the code the other day but noticed the result -
summing all elements of the resulting image - was slightly different
each time! So I'd have to check it out first. Then process into
self-contained C.)

--

Rick C. Hodgin

unread,

Nov 16, 2016, 12:34:23 PM11/16/16

to

It is obsolete. It's all still heavily used by old and new software.

> 1) this fpu code seem to be terrible - each line accessing to ram, no registers involved - disaster

Look at the FPU instruction set and you'll see why.

> 2) this xmm code seems to be the same in that respect
>
> extremally bad

Look at the SIMD instruction set and you'll see why.

Rick C. Hodgin

unread,

Nov 16, 2016, 12:36:44 PM11/16/16

to

It might be hard for 32-bit code generation because there are already
a limited number of registers available, and some 32-bit ops require
their fixed register use and cannot be altered via encoding options.

For 64-bit it should be easily handled.

BartC

unread,

Nov 16, 2016, 2:09:46 PM11/16/16

to

It was a mix-up of 0-based and 1-based indexing. The code is below, set
up to generate a 3K x 2K pixel greyscale image. Uncomment lines at end
to write a .ppm file into test.ppm:

------------------------------
// (Parts auto-generated)
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

typedef int32_t int32;
typedef unsigned char byte;

enum {imwidth = 3072};
enum {imheight = 2048};

static byte image[imheight][imwidth];

static int32 mandel (double,double,int32);
static void create_fractal (double,double,double,double,int32);
static void writepgm (char *,int32);

static int32 mandel (double x,double y,int32 max_iters) {
double a,b,a2;
int32 i;

a = b = 0.0;

for (i=0; i<max_iters; ++i) {

a2 = (a*a-b*b)+x;
b = a*b*2.0+y;
a = a2;
if (a*a+b*b >= 4.0) {

return i;
}
}
return max_iters;
}

static void create_fractal (double min_x,double max_x,
double min_y,double max_y,int32 iters) {
double pixel_size_x, pixel_size_y;
int32 x, y;
double i, r;

pixel_size_x = (max_x-min_x)/imwidth;
pixel_size_y = (max_y-min_y)/imheight;
i = min_y;
for (y=0; y<imheight; ++y) {
i += pixel_size_y;
r = min_x;
for (x=0; x<imwidth; ++x) {
r += pixel_size_x;
image[y][x] = mandel(r,i,iters);
}
}
}

static void writepgm (char * file,int32 maxpixel) {
int32 x;
int32 y;
void * f;
f = fopen(file,"w");
fprintf(f,"%s\n","P2");
fprintf(f,"%d %d\n",imwidth,imheight);
fprintf(f,"%d\n",maxpixel);
for (y=0; y<imheight; ++y) {
for (x=0; x<imwidth; ++x) {
fprintf(f,"%u%s",image[y][x]," ");
}
fprintf(f,"\n");
}
fclose(f);
}

int main (void) {
static int32 maxpixel;
int32 x, y;
int32 sum=0;
maxpixel = 20;

create_fractal(-2.0,1.0,-1.0,1.0,maxpixel);

for (y=0; y<imheight; ++y) {
for (x=0; x<imwidth; ++x) {
sum += image[y][x];
}
}

printf("%d\n",sum);

// Uncomment next two lines to output as PPM image:
// puts("Writing test.ppm:");
// writepgm("test.ppm",maxpixel);

}
------------------------------

Algorithm derived from here (not in C):
http://nbviewer.jupyter.org/gist/harrism/f5707335f40af9463c43

--
Bartc

supe...@casperkitty.com

unread,

Nov 16, 2016, 2:41:29 PM11/16/16

to

On Wednesday, November 16, 2016 at 11:36:44 AM UTC-6, Rick C. Hodgin wrote:

> On Wednesday, November 16, 2016 at 12:12:01 PM UTC-5, supercat wrote:
> > Looking at tcc's output, it also looks like the biggest problem is that it
> > makes no effort to cut down on loads and stores. A single-pass compiler
> > may have limited opportunities to cut down on such things if it has to
> > allow for the possibility of code taking the address of a variable for
> > which code has already been generated, but perhaps using the "register"
> > keyword might help? If the compiler reserves some registers for such
> > variables and maps as many register-qualified variables as will fit to
> > those registers, it could gain the ability to generate much more efficient
> > code in a single pass. Back in the 1980s, Turbo C could accommodate two
> > register-qualified variables of 16-bit or "near"-pointer type, using exactly
> > that approach.
>
> It might be hard for 32-bit code generation because there are already
> a limited number of registers available, and some 32-bit ops require
> their fixed register use and cannot be altered via encoding options.

Turbo C used SI and DI for register variables, and the calling convention
specified that any function using those registers must save them on entry
and restore them on exit. AX and DX would conflict with multiply, BX with
addressing [at least one of BX, SI, and DI needed to be left free], and CX
with shifting, but the only operations that would need SI and DI are the
string-move instructions which aren't used for general-purpose code
generation. Given that the 80386 allows EDX to be used for addressing, it
would not be necessary to leave EBX free for that purpose, so it could
be added as another register variable. The 8087's floating-point registers
could also be used to hold floating-point values.

While eliminating all of the loads and stores within the loop would likely
be better than merely eliminating most of them, eliminating even some of the
loads and stores could offer a significant performance win relatively
cheaply.

Rick C. Hodgin

unread,

Nov 16, 2016, 2:45:34 PM11/16/16

to

It would be fairly easy to do manually. Give it a try. The FPU regs
are constantly shifting as things are processed through the stack, so
that will be a bit rigorous to keep everything persistent, but it's
possible.

David Brown

unread,

Nov 16, 2016, 5:12:51 PM11/16/16

to

When working with floating point, a gcc option that can sometimes make a
very significant difference is "-ffast-math". This enables a range of
floating point optimizations, but breaks strict IEEE compliance (which
is why it is not enabled automatically with any -O flags). IEEE
floating point puts a lot of restrictions on things like re-ordering of
floating point operations. It would be very interesting to see how that
affects the code and its speed. (Of course, if any of the other
compilers have equivalent flags, try them too.)

Chris M. Thomasson

unread,

Nov 16, 2016, 5:22:03 PM11/16/16

to

Damn. I see this flag as having negative effects on my cipher based on
floating point. I basically need a standard representation of floating
point in order to get it to decrypt correctly, without loss across
different compilers. So, IEEE needs to be strictly honored in order for
things to work. This is why I named it Funny Fractal Encryption. The
Funny part is the nasty nest of floating point issues!

;^o

James R. Kuyper

unread,

Nov 16, 2016, 6:03:08 PM11/16/16

to

On 11/16/2016 05:21 PM, Chris M. Thomasson wrote:
...

> Damn. I see this flag as having negative effects on my cipher based on
> floating point. I basically need a standard representation of floating
> point in order to get it to decrypt correctly, without loss across
> different compilers. So, IEEE needs to be strictly honored in order for
> things to work. This is why I named it Funny Fractal Encryption. The
> Funny part is the nasty nest of floating point issues!

Is strictly honoring IEEE requirements sufficient? IEEE says, in many
contexts, for an expression that has a mathematical value which cannot
be represented exactly, that the result is either the nearest higher or
nearest lower representable value, chosen in an implementation-defined
manner. In other contexts, the result is either the nearest
representable value, or the larger or smaller representable value
immediately adjacent to the nearest representable value, chosen in an
implementation-defined manner. If encrypting on a platform which makes
one choice in each of these cases, and decrypting on a platform which
makes a different choice, would you be able to recover the original
data? My experience with fractals is that the calculations typically
cause exponential growth of small differences, so I would expect that
such differences would be catastrophic for your algorithm.

This is why some people recommended against using floating point
calculations as the basis for your encryption algorithm.
(6.4.4.2p3)

Rick C. Hodgin

unread,

Nov 16, 2016, 9:54:28 PM11/16/16

to

I was going to sit down tonight and manually go through this, but I
went into Visual Studio 2010 and wanted to see what type of 32-bit
code it would produce when optimized. It did the same basic thing
you suggest, Supercat:

--- c:\temp\test.cpp begin ---------------------------------
1: // test.cpp
2: #include <stdio.h>
3: #include <time.h>
4:
5: int mandel(double x, double y, int max_iters)
6: {
7: double a, b, a2;
8: int i;
9:
10: a = 0.0;
11: b = 0.0;
12: for (i = 1; i <= max_iters; i++)
13: {
14: a2 = (a*a - b*b) + x;
15: b = 2.0*a*b + y;
16: a = a2;
17: if (a*a+b*b >= 4.0)
18: return i - 1;
19: }
20:
21: return max_iters;
22: }
23:
24: int main(int argc, char* argv[])
25: {
00D51000 push ebp
00D51001 mov ebp,esp
00D51003 and esp,0FFFFFFC0h
00D51006 sub esp,34h
00D51009 push ebx
00D5100A push esi
00D5100B push edi
26: int max;
27: double x, y;
28: clock_t start;
29: clock_t end;
30:
31:
32: // Iterate for a white
33: start = clock();
00D5100C mov edi,dword ptr [__imp__clock (0D5209Ch)]
00D51012 call edi
34: for (x = 2.0f, max = 0; x > -3.0; x -= 0.0005)
00D51014 fld qword ptr [__real@4000000000000000 (0D52110h)]
00D5101A fld st(0)
00D5101C xor esi,esi
00D5101E fldz
00D51020 mov ebx,eax
00D51022 fld st(0)
37: max += mandel(x, y, 40);
00D51024 lea ecx,[esi+1]
00D51027 fmul st,st(1)
00D51029 fstp qword ptr [esp+38h]
00D5102D fld qword ptr [__real@c000000000000000 (0D52130h)]
00D51033 fld st(1)
00D51035 fld qword ptr [esp+38h]
00D51039 jmp main+67h (0D51067h)
34: for (x = 2.0f, max = 0; x > -3.0; x -= 0.0005)
00D5103B fldz
37: max += mandel(x, y, 40);
00D5103D mov ecx,1
00D51042 fld qword ptr [__real@c000000000000000 (0D52130h)]
00D51048 fld st(1)
00D5104A fld qword ptr [esp+38h]
00D5104E jmp main+67h (0D51067h)
35: {
36: for (y = -2.0; y < 2.0; y += 0.0005)
00D51050 fstp st(1)
37: max += mandel(x, y, 40);
00D51052 mov ecx,1
00D51057 fldz
00D51059 fxch st(1)
00D5105B fld st(1)
00D5105D fld qword ptr [esp+38h]
00D51061 jmp main+67h (0D51067h)
00D51063 fxch st(1)
00D51065 fxch st(3)
00D51067 fld st(1)
00D51069 fmul st,st(2)
00D5106B fsubrp st(1),st
00D5106D fadd st,st(4)
00D5106F fxch st(1)
00D51071 fmul st,st(5)
00D51073 fmulp st(3),st
00D51075 fxch st(2)
00D51077 fadd st,st(1)
00D51079 fld st(2)
00D5107B fld st(1)
00D5107D fmul st,st(2)
00D5107F fld st(4)
00D51081 fmulp st(5),st
00D51083 fadd st(4),st
00D51085 fxch st(4)
00D51087 fcomp qword ptr [__real@4010000000000000 (0D52108h)]
00D5108D fnstsw ax
00D5108F test ah,1
00D51092 je main+105h (0D51105h)
00D51094 inc ecx
00D51095 cmp ecx,28h
00D51098 jle main+63h (0D51063h)
00D5109A fstp st(1)
00D5109C mov eax,28h
00D510A1 fstp st(0)
00D510A3 fstp st(1)
00D510A5 fld qword ptr [__real@3f40624dd2f1a9fc (0D52128h)]
00D510AB add esi,eax
00D510AD fadd st(1),st
00D510AF fxch st(1)
00D510B1 fcom st(3)
00D510B3 fnstsw ax
00D510B5 test ah,5
00D510B8 jnp main+50h (0D51050h)
34: for (x = 2.0f, max = 0; x > -3.0; x -= 0.0005)
00D510BA fstp st(0)
00D510BC fsubp st(1),st
00D510BE fld qword ptr [__real@c008000000000000 (0D52120h)]
00D510C4 fcomp st(1)
00D510C6 fnstsw ax
00D510C8 test ah,5
00D510CB jnp main+3Bh (0D5103Bh)
00D510D1 fstp st(0)
00D510D3 fstp st(0)
38:
39: }
40: end = clock();
00D510D5 call edi
41:
42: printf("Max: %d, time: %lf\n", max, (double)(end - start) / CLOCKS_PER_SEC);
00D510D7 sub eax,ebx
00D510D9 mov dword ptr [esp+38h],eax
00D510DD fild dword ptr [esp+38h]
00D510E1 sub esp,8
00D510E4 fdiv qword ptr [__real@408f400000000000 (0D52118h)]
00D510EA fstp qword ptr [esp]
00D510ED push esi
00D510EE push offset string "Max: %d, time: %lf\n" (0D520F4h)
00D510F3 call dword ptr [__imp__printf (0D520A4h)]
00D510F9 add esp,10h
43: return 0;
44: }
00D510FC pop edi
00D510FD pop esi
00D510FE xor eax,eax
00D51100 pop ebx
00D51101 mov esp,ebp
00D51103 pop ebp
00D51104 ret
--- c:\temp\test.cpp end ----------------------------------

I'll write a manual one and post it. We could move to alt.lang.asm if
you'd prefer.

David Brown

unread,

Nov 17, 2016, 3:57:41 AM11/17/16

to

You need repeatable results that are not easily affected by things like
code re-ordering. IEEE compliance (standard on gcc, unless you use
-ffast-math or -Ofast on later versions) is your best bet when you want
to use the compiler's native floating point support. However, as James
says it is not really sufficient. Ideally, you should probably write
your own floating point routines (using integer arithmetic) to be sure
of exactly what they do.

(Personally, I think floating point is completely unsuitable for
encryption systems or lossless compression. But as long as you are
enjoying yourself and learning something with your "Funny Fractal"
stuff, then keep at it!)

BartC

unread,

Nov 17, 2016, 6:26:04 AM11/17/16

to

On 16/11/2016 22:12, David Brown wrote:
> On 16/11/16 18:19, BartC wrote:

>> Trying that tweak now, gcc is 10% faster, but Pelles C is 25% slower and
>> lccwin32 is 40% slower!

> When working with floating point, a gcc option that can sometimes make a
> very significant difference is "-ffast-math". This enables a range of
> floating point optimizations, but breaks strict IEEE compliance (which
> is why it is not enabled automatically with any -O flags). IEEE
> floating point puts a lot of restrictions on things like re-ordering of
> floating point operations. It would be very interesting to see how that
> affects the code and its speed. (Of course, if any of the other
> compilers have equivalent flags, try them too.)

That didn't make any discernible difference (to gcc).

Perhaps just as well as the earlier part of the thread was more about
how far ahead gcc was in code speed compared to compilers such as TCC; I
didn't want to make the gap even bigger!

[Written yesterday but forgot to post it.]

--
Bartc

Rick C. Hodgin

unread,

Nov 17, 2016, 7:45:06 AM11/17/16

to

On Wednesday, November 16, 2016 at 9:54:28 PM UTC-5, Rick C. Hodgin wrote:
> I was going to sit down tonight and manually go through this, but I
> went into Visual Studio 2010 and wanted to see what type of 32-bit
> code it would produce when optimized. It did the same basic thing
> you suggest, Supercat:
>

> [snip]

>
> I'll write a manual one and post it. We could move to alt.lang.asm if
> you'd prefer.

I've got it down to less than 20 instructions, but it's computing
something incorrectly right now. I have to debug it.

It uses all 8 st(x) registers, with the upper five being consistently
used to store x, y, 2, a, b. All computations take place in the lower
three. I did have to use one fxch instruction to re-order st(1) and
st(0), but apart from that it's all natural.

We'll see. I'll debug it later tonight and post.

supe...@casperkitty.com

unread,

Nov 17, 2016, 11:42:46 AM11/17/16

to

My main point of curiosity is what degree of semantic analysis a compiler
would need to apply to let a programmer achieve performance equivalent to
that of a more sophisticated compiler, if the programmer is willing to
tweak the code for optimal performance.

Did you happen to test whether the "register" keyword does anything? I
didn't see any explicit mention of it in the tcc documentation, but the
approach exemplified by Turbo C in the 1980s would seem fairly simple and
straightforward, and provide a lot of bang for the buck.

Rick C. Hodgin

unread,

Nov 17, 2016, 12:17:26 PM11/17/16

to

On Thursday, November 17, 2016 at 11:42:46 AM UTC-5, supe...@casperkitty.com wrote:
> On Thursday, November 17, 2016 at 6:45:06 AM UTC-6, Rick C. Hodgin wrote:
> > On Wednesday, November 16, 2016 at 9:54:28 PM UTC-5, Rick C. Hodgin wrote:
> > > I'll write a manual one and post it. We could move to alt.lang.asm if
> > > you'd prefer.
> >
> > I've got it down to less than 20 instructions, but it's computing
> > something incorrectly right now. I have to debug it.
> >
> > It uses all 8 st(x) registers, with the upper five being consistently
> > used to store x, y, 2, a, b. All computations take place in the lower
> > three. I did have to use one fxch instruction to re-order st(1) and
> > st(0), but apart from that it's all natural.
> >
> > We'll see. I'll debug it later tonight and post.
>
> My main point of curiosity is what degree of semantic analysis a compiler
> would need to apply to let a programmer achieve performance equivalent to
> that of a more sophisticated compiler, if the programmer is willing to
> tweak the code for optimal performance.

Interesting. I don't see how you could separate the two. I think any
compiler capable of generating enough useful information to allow a
competent assembly language programmer to manually tweak the code to
produce optimal results through the compilation and its optimizations
would have to be on the same par as a compiler that could, by knowing
what the needs are (or even being able to hint at them), optimize the
code in that way by itself anyway.

I do like the ability to debug the optimization process, however, such
as flagging why certain optimizations were not considered or, in the
case of speculative optimization pathways, ultimately discarded.

> Did you happen to test whether the "register" keyword does anything?
> I didn't see any explicit mention of it in the tcc documentation, but
> the approach exemplified by Turbo C in the 1980s would seem fairly
> simple and straightforward, and provide a lot of bang for the buck.

I don't have that compiler Bart used. The one I have honors the keyword
register, but I believe it's simply ignored in more recent compilers (I
think even since 2003? maybe).

I've found this entire exercise to be fascinating. To think through
the needs of the optimizer at this level, where the redundant multiplies
are removed at the tail end of the first and each pass for input into
the following pass ... it's really opened my mind to considering code
optimization in a new way. I honestly never would've seen that on my
own without having gone through part of the compiler design and then
seen it in output assembly, and then putting it together. To see it
there in source code, I would not have picked up on it.

I've had to really consider the depths of what the optimizer will
have to consider in producing optimal code. It's going to be a very
comprehensive and complex processing engine, and one that must be
distinctly targeted for each machine.

It's also made me consider the idea of a stack-memory FPU for my own
CPU design. We only have 8 registers in the FPU, and they operate
on a stack. So I had the idea of an overflow onto a real stack,
which allowed far more than 8 registers, but when you get above 8 it
is simply slower, but no slower than it would be referencing memory
manually, and it would actually be faster because internally the
hardware could buffer an addition 24 registers, for example, in a
very close level-0 cache, so it would be just as easy load in a large
block of constant data, and then reference it.

It might also be interesting to introduce a freeze frame portion,
which doesn't migrate through the stack, but is a hard upper area
which is loaded and then referenced continually without adjusting
each operand through the stack.

Food for thought.

Rick C. Hodgin

unread,

Nov 17, 2016, 12:42:53 PM11/17/16

to

On Thursday, November 17, 2016 at 12:17:26 PM UTC-5, Rick C. Hodgin wrote:
> It's also made me consider the idea of a stack-memory FPU for my own
> CPU design. We only have 8 registers in the FPU, and they operate
> on a stack. So I had the idea of an overflow onto a real stack,
> which allowed far more than 8 registers, but when you get above 8 it
> is simply slower, but no slower than it would be referencing memory
> manually, and it would actually be faster because internally the
> hardware could buffer an addition 24 registers, for example, in a
> very close level-0 cache, so it would be just as easy load in a large
> block of constant data, and then reference it.
>
> It might also be interesting to introduce a freeze frame portion,
> which doesn't migrate through the stack, but is a hard upper area
> which is loaded and then referenced continually without adjusting
> each operand through the stack.
>
> Food for thought.

Here's the post on comp.arch related to it:

https://groups.google.com/d/msg/comp.arch/-0Fi33j0vWQ/nUiiCO2VBQAJ

supe...@casperkitty.com

unread,

Nov 17, 2016, 12:57:30 PM11/17/16

to

On Thursday, November 17, 2016 at 11:17:26 AM UTC-6, Rick C. Hodgin wrote:

> On Thursday, November 17, 2016 at 11:42:46 AM UTC-5, supercat wrote:
> > My main point of curiosity is what degree of semantic analysis a compiler
> > would need to apply to let a programmer achieve performance equivalent to
> > that of a more sophisticated compiler, if the programmer is willing to
> > tweak the code for optimal performance.
>
> Interesting. I don't see how you could separate the two. I think any
> compiler capable of generating enough useful information to allow a
> competent assembly language programmer to manually tweak the code to
> produce optimal results through the compilation and its optimizations
> would have to be on the same par as a compiler that could, by knowing
> what the needs are (or even being able to hint at them), optimize the
> code in that way by itself anyway.

A compiler wouldn't need to do any semantic analysis to assign to registers
as many register-qualified variables of certain types as will fit into
registers, and then use those rather than using values in memory all the
time. A compiler that does fancier semantic analysis could almost certainly
optimize better than one which does not, but the $50,000 question is how
*much* better. If someone with a dinky compiler could with relatively little
work produce code that was only 20% slower than one with an optimizing
compiler, and if the dinky compiler had a much shorter build cycle, the
dinky compiler might well be more usable than the fancier one.

> I do like the ability to debug the optimization process, however, such
> as flagging why certain optimizations were not considered or, in the
> case of speculative optimization pathways, ultimately discarded.

Compilers and programmers will "know" different things about a program and
the inputs it will receive; optimizations using either along cannot be
nearly as effective as optimizations that combine them. Optimizations based
upon things the programmer tells the compiler would seem to be in just about
every way cheaper and better than aggressive optimizations based upon UB,
but for some reason seem less fashionable.

Rick C. Hodgin

unread,

Nov 17, 2016, 1:08:25 PM11/17/16

to

On Thursday, November 17, 2016 at 12:57:30 PM UTC-5, supe...@casperkitty.com wrote:
> On Thursday, November 17, 2016 at 11:17:26 AM UTC-6, Rick C. Hodgin wrote:
> > On Thursday, November 17, 2016 at 11:42:46 AM UTC-5, supercat wrote:
> > > My main point of curiosity is what degree of semantic analysis a compiler
> > > would need to apply to let a programmer achieve performance equivalent to
> > > that of a more sophisticated compiler, if the programmer is willing to
> > > tweak the code for optimal performance.
> >
> > Interesting. I don't see how you could separate the two. I think any
> > compiler capable of generating enough useful information to allow a
> > competent assembly language programmer to manually tweak the code to
> > produce optimal results through the compilation and its optimizations
> > would have to be on the same par as a compiler that could, by knowing
> > what the needs are (or even being able to hint at them), optimize the
> > code in that way by itself anyway.
>
> A compiler wouldn't need to do any semantic analysis to assign to registers
> as many register-qualified variables of certain types as will fit into
> registers, and then use those rather than using values in memory all the
> time.

Any compiler capable of placing values in registers through use of the
register keyword should be able to determine by itself the optimal use
of all registers, automatically assigning them to permanent register
homes during optimization, loaded in the prologue code, and stored in
the epilogue code (if required).

> A compiler that does fancier semantic analysis could almost certainly
> optimize better than one which does not, but the $50,000 question is how
> *much* better. If someone with a dinky compiler could with relatively little
> work produce code that was only 20% slower than one with an optimizing
> compiler, and if the dinky compiler had a much shorter build cycle, the
> dinky compiler might well be more usable than the fancier one.

Oh! You're talking about not having an optimizer, but only this simple
ability to insert a variable into a register and keep it there, as by
the developer's decision, and not the compilers.

I don't see that as any kind of possibility without having an optimizer
because the distance in code from that ability (to rearrange things to
use an explicit register for a value, rather than referencing it in
memory all the time), to a full optimizer capable of performing a host
of speculative tests to determine which arrangement is best, is not far
at all.

To be able to place values in registers like that, and have a code
generation engine which "works around" that constraint ... that's a
non-trivial ability in and of itself.

> > I do like the ability to debug the optimization process, however, such
> > as flagging why certain optimizations were not considered or, in the
> > case of speculative optimization pathways, ultimately discarded.
>
> Compilers and programmers will "know" different things about a program and
> the inputs it will receive; optimizations using either along cannot be
> nearly as effective as optimizations that combine them. Optimizations based
> upon things the programmer tells the compiler would seem to be in just about
> every way cheaper and better than aggressive optimizations based upon UB,
> but for some reason seem less fashionable.

I think experience has proven that's not true. It's very rare these
days that a human being can produce more optimal code than a compiler
can consistently. And, given the amount of time and expertise it takes
on behalf of the human doing the hand-optimization ... it's not really
worth it except for in those cases where that extra few percentage
points of performance increase really make a difference.

I am amazed sometimes at the optimization choices made by compilers on
various types of code. They include things I would never have come up
with on my own. However, having then seen them in code they are then
brought into my thinking, opening up new possibilities.

What I want to do with the RDC Optimizer is create a generic engine
which can be programmed with hard constraints and variables, and then
produce the optimal output based on factors. These could be code size
code speed, minimal use of a register set, maximal use of one, and so
on.

What I want to create is a stand-alone system that can handle this type
of optimization in a generic manner, such that it can be used by any
project. All that's required is the correct profile to be loaded for
the target.

A lot of work ahead of me. I'm still waiting to see if I can actually
do it. I think I can. And I can visualize how I'm going to do it.
I'm just not on the other side of having done it yet.

I am working on my assembler again, however. I took several weeks off
because I had pushed myself to the point of exhaustion.

supe...@casperkitty.com

unread,

Nov 17, 2016, 2:35:46 PM11/17/16

to

On Thursday, November 17, 2016 at 12:08:25 PM UTC-6, Rick C. Hodgin wrote:
> Oh! You're talking about not having an optimizer, but only this simple
> ability to insert a variable into a register and keep it there, as by
> the developer's decision, and not the compilers.
>
> I don't see that as any kind of possibility without having an optimizer
> because the distance in code from that ability (to rearrange things to
> use an explicit register for a value, rather than referencing it in
> memory all the time), to a full optimizer capable of performing a host
> of speculative tests to determine which arrangement is best, is not far
> at all.

Optimizations without the keyword would require a multi-pass compiler.
The cost of pass to determine register usage prior to code generation
may not be huge, but if compilation speed is a priority, allowing variables
to be fixed in registers using the "register" keyword would allow a
fraction of the benefits of register optimization to be obtained at a lower
compilation-time cost than would be needed for most sophisticated analysis.

> To be able to place values in registers like that, and have a code
> generation engine which "works around" that constraint ... that's a
> non-trivial ability in and of itself.

Not really. If a processor has enough registers that the code generator
could get by even if some were unusable, all the generator needs to do is
(1) allow a field in the symbol table to indicate whether a variable is in
a register or memory, and (2) have a code generator support instruction
forms that can use a register instead of memory. If the code generator
would use registers as a rolling-window value stack, declaring variables
as "register" might in some cases increase the number of register spills in
code that isn't using those variables, but in other cases it might offer
a big savings in variable-access costs without causing any extra register
spills.

> supercat wrote:
> > Compilers and programmers will "know" different things about a program and
> > the inputs it will receive; optimizations using either along cannot be
> > nearly as effective as optimizations that combine them. Optimizations based
> > upon things the programmer tells the compiler would seem to be in just about
> > every way cheaper and better than aggressive optimizations based upon UB,
> > but for some reason seem less fashionable.
>
> I think experience has proven that's not true. It's very rare these
> days that a human being can produce more optimal code than a compiler
> can consistently. And, given the amount of time and expertise it takes
> on behalf of the human doing the hand-optimization ... it's not really
> worth it except for in those cases where that extra few percentage
> points of performance increase really make a difference.

I didn't claim that humans, alone, could produce better code. My claim was
that generation of optimal code requires exploitation of *both* the things
that compilers know and humans may not, and the things humans would know but
compilers wouldn't know unless told. If a program can indicate to the
compiler e.g. "This variable should be less than 23 here, and I wouldn't
mind if the program terminated abnormally if and when the code figures out
that it won't be or wasn't", such a construct would allow many optimizations
to be made that wouldn't be possible based upon UB alone, while also being
much safer and making it easier for anyone reading the code to know the
programmer intent behind it.

Rick C. Hodgin

unread,

Nov 17, 2016, 3:02:42 PM11/17/16

to

On Thursday, November 17, 2016 at 2:35:46 PM UTC-5, supe...@casperkitty.com wrote:
> On Thursday, November 17, 2016 at 12:08:25 PM UTC-6, Rick C. Hodgin wrote:
> > Oh! You're talking about not having an optimizer, but only this simple
> > ability to insert a variable into a register and keep it there, as by
> > the developer's decision, and not the compilers.
> >
> > I don't see that as any kind of possibility without having an optimizer
> > because the distance in code from that ability (to rearrange things to
> > use an explicit register for a value, rather than referencing it in
> > memory all the time), to a full optimizer capable of performing a host
> > of speculative tests to determine which arrangement is best, is not far
> > at all.
>
> Optimizations without the keyword would require a multi-pass compiler.
> The cost of pass to determine register usage prior to code generation
> may not be huge, but if compilation speed is a priority, allowing variables
> to be fixed in registers using the "register" keyword would allow a
> fraction of the benefits of register optimization to be obtained at a lower
> compilation-time cost than would be needed for most sophisticated analysis.

Are there still people targeting a one-pass compiler? I had people in
this C group tell me that I wouldn't be able to compile CAlive source
code in one pass, as if that was a reason to reconsider some of my
syntax allowances. My reply was, "So?"

In this day and age with $25 quad-core CPUs running over 2 GHz, and
16GB memory sticks that today cost $70 each, and 2TB hard drives which
cost $70 each ... what limitations on compute are we talking about?

I have no intention at any point in the future designing something
which targets a minimal machine. My goals are for the machines we
have now, and will have in the future.

I recognize your comment above as being true, but it's also true that
in order to have a good buggy whip it needs a crosshatch of some kind
on the handle so it doesn't come out when you're driving in the rain.
While that's also true ... it's just not very relevant any longer.

> > To be able to place values in registers like that, and have a code
> > generation engine which "works around" that constraint ... that's a
> > non-trivial ability in and of itself.
>
> Not really. If a processor has enough registers that the code generator
> could get by even if some were unusable, all the generator needs to do is
> (1) allow a field in the symbol table to indicate whether a variable is in
> a register or memory, and (2) have a code generator support instruction
> forms that can use a register instead of memory. If the code generator
> would use registers as a rolling-window value stack, declaring variables
> as "register" might in some cases increase the number of register spills in
> code that isn't using those variables, but in other cases it might offer
> a big savings in variable-access costs without causing any extra register
> spills.

My apologies. I thought we were discussing the 32-bit code generation
of TCC32 on this madel() function, and in this particular case the x86.
The 80386 up has a very limited register set, and certain operations
can only be conducted in certain registers. Within that constraint,
it's very difficult to generate code which honors developer-specified
register assignments.

But, for AMD64 or some RISC machines with 32+ registers, it would be
much easier and could be handled in a first pass effort, but there is
then also the issue of spill and fill as you don't want to trump some
values used in another function that may have also been optimizing
something to use registers ... so now you're into a depth of call
analysis to see how you can structure registers to affect the fewest
things in the least way. A single-pass compiler could not do that,
and the benefits of assigning registers could be easily undone by
the spill/fill overhead requirements.

I don't think you'll find it can be done easily or well on nearly all
applications, and that in some cases it would even hurt performance.

> > supercat wrote:
> > > Compilers and programmers will "know" different things about a program and
> > > the inputs it will receive; optimizations using either along cannot be
> > > nearly as effective as optimizations that combine them. Optimizations based
> > > upon things the programmer tells the compiler would seem to be in just about
> > > every way cheaper and better than aggressive optimizations based upon UB,
> > > but for some reason seem less fashionable.
> >
> > I think experience has proven that's not true. It's very rare these
> > days that a human being can produce more optimal code than a compiler
> > can consistently. And, given the amount of time and expertise it takes
> > on behalf of the human doing the hand-optimization ... it's not really
> > worth it except for in those cases where that extra few percentage
> > points of performance increase really make a difference.
>
> I didn't claim that humans, alone, could produce better code. My claim was
> that generation of optimal code requires exploitation of *both* the things
> that compilers know and humans may not, and the things humans would know but
> compilers wouldn't know unless told. If a program can indicate to the
> compiler e.g. "This variable should be less than 23 here, and I wouldn't
> mind if the program terminated abnormally if and when the code figures out
> that it won't be or wasn't", such a construct would allow many optimizations
> to be made that wouldn't be possible based upon UB alone, while also being
> much safer and making it easier for anyone reading the code to know the
> programmer intent behind it.

My personal view is the issue of optimization is an essential component
of any program that will enter into production, and isn't required to
be changed often, or even need to be hot-patched. It only makes sense
as it is a maximal use of available resources with the least expenditure
of energy. But ... that being said, I don't see computers ever getting
slower. I see electrical engineers working on newer technologies which
continue to increase the width of CPU engines, and the speed at which
they operate, reducing energy footprints and expanding capabilities.

I see a future where even simple things like greeting cards which play
music when you open them have a quad-core 64-bit CPU or greater. Why?
Because it will be so inexpensive to manufacture that model in bulk
that it will be used for a wide range of things.

The goals I have for my compiler are to make it function properly, be
feature rich, with later considerations given to speed of compilation,
and speed of code execution. However, I do intend to make the product
eventually have world-class performance, with my sincere heart's desire
to make it be the best in the world.

Right now I'm targeting 16+ core machines with nearly unlimited memory
and hard disk space, with 10+ Gbps Internet speeds. I believe we'll
have those machines in the 2020s, and that's when I see my compiler
truly hitting its stride.

We'll see though. Something could happen to me before then. I leave
my plans with the Lord (James 4:15) and press on toward them knowing
that /if/ I succeed it will only be because of Him and His allowance.
I for one would enjoy doing a great many other things as well. I just
happen to have skills in this area many other people don't have, and I
want to use what I've been given to offer up something to Him and to
all of mankind, so they can all benefit from that which He first gave
me. I see this compiler, called CAlive for a reason, being the vehicle
which moves that effort forward.

But, we'll see. :-)

BartC

unread,

Nov 17, 2016, 3:08:49 PM11/17/16

to

On 17/11/2016 19:35, supe...@casperkitty.com wrote:
> On Thursday, November 17, 2016 at 12:08:25 PM UTC-6, Rick C. Hodgin wrote:
>> Oh! You're talking about not having an optimizer, but only this simple
>> ability to insert a variable into a register and keep it there, as by
>> the developer's decision, and not the compilers.
>>
>> I don't see that as any kind of possibility without having an optimizer
>> because the distance in code from that ability (to rearrange things to
>> use an explicit register for a value, rather than referencing it in
>> memory all the time), to a full optimizer capable of performing a host
>> of speculative tests to determine which arrangement is best, is not far
>> at all.
>
> Optimizations without the keyword would require a multi-pass compiler.
> The cost of pass to determine register usage prior to code generation
> may not be huge, but if compilation speed is a priority, allowing variables
> to be fixed in registers using the "register" keyword would allow a
> fraction of the benefits of register optimization to be obtained at a lower
> compilation-time cost than would be needed for most sophisticated analysis.

I've tried using such keywords (in the compiler for my C-like language).
It didn't really work.

Local variables can have restricted, nested or overlapping lifetimes,
allowing the same register to be used for several variables. That's hard
to do when it must be specified manually. (An example is the loop
variable in a for-loop.)

If too many are specified, then registers must be saved and restored,
when doing calls for example, whether they are currently in use or not.

Then some variables you might want to keep in registers are actually
globals.

But the main problem is not wanting to clutter up the code with
optimisation hints. (It's bad enough tying things down with static
declarations everywhere!)

An optimiser that can work with normal, unhinted code is preferable.

(What I also like is to have the language be able express things more
naturally, so given:

char s[50], t[50];

if would be nicer to be able to write:

s = t;

than for the compiler to have to figure out that:

char *p=s, *q=t;

for (i=0; i<sizeof(s); ++i) *p++ = *q++;

could be converted to:

memcpy(s, t, sizeof(s));

(You could just write memcpy anyway, but you are still doing the
compiler's job for it, and you are nor expressing things the most
natural way.)

>> I think experience has proven that's not true. It's very rare these
>> days that a human being can produce more optimal code than a compiler
>> can consistently. And, given the amount of time and expertise it takes
>> on behalf of the human doing the hand-optimization ... it's not really
>> worth it except for in those cases where that extra few percentage
>> points of performance increase really make a difference.
>
> I didn't claim that humans, alone, could produce better code.

I took my Mandelbrot benchmark and tried some manual, inline ASM. I
improved the timing by just over 10% (even with a*a and b*b evaluated
once per iteration).

That is, the timing reduced from 650ms to 570ms, using my own compiler.
When converted to C and compiled with gcc, it was 400ms or less
depending on how the code was tweaked.

This is XMM-based code; I haven't looked yet to see what gcc does.
(Probably it inlines that function too but most of the execution time is
inside that loop.)

(Writing manual XMM-based ASM was horrible; I'd rather use HLL and
forget the 10-12% penalty, even if gcc wasn't an option!)

--
Bartc

Rick C. Hodgin

unread,

Nov 17, 2016, 3:21:40 PM11/17/16

to

On Thursday, November 17, 2016 at 3:08:49 PM UTC-5, Bart wrote:
> (Writing manual XMM-based ASM was horrible; I'd rather use HLL and
> forget the 10-12% penalty, even if gcc wasn't an option!)

I agree completely.

It's hard enough navigating an FPU stack, but to then think in more
than one dimension, about more than one computation, when it's not
a mere mirror of the operation such as by processing four identical
items at a time ... it's just intense. I also don't know the SIMD
instructions well enough to be able to use them in the most efficient
manner.

I do look forward to writing the optimizer for those engines though.
I think that may actually be my greatest programming achievement.

supe...@casperkitty.com

unread,

Nov 17, 2016, 4:24:05 PM11/17/16

to

On Thursday, November 17, 2016 at 2:02:42 PM UTC-6, Rick C. Hodgin wrote:

> On Thursday, November 17, 2016 at 2:35:46 PM UTC-5, supercat wrote:
> > Optimizations without the keyword would require a multi-pass compiler.
> > The cost of pass to determine register usage prior to code generation
> > may not be huge, but if compilation speed is a priority, allowing variables
> > to be fixed in registers using the "register" keyword would allow a
> > fraction of the benefits of register optimization to be obtained at a lower
> > compilation-time cost than would be needed for most sophisticated analysis.
>
> Are there still people targeting a one-pass compiler? I had people in
> this C group tell me that I wouldn't be able to compile CAlive source
> code in one pass, as if that was a reason to reconsider some of my
> syntax allowances. My reply was, "So?"

The thread was benchmarking, among other things, tcc. I may be mistaken,
but I was under the impression that tcc is a single-pass compiler which
places a substantial design emphasis on compilation speed.

> In this day and age with $25 quad-core CPUs running over 2 GHz, and
> 16GB memory sticks that today cost $70 each, and 2TB hard drives which
> cost $70 each ... what limitations on compute are we talking about?

Primarily situations in which it would be necessary to be able to compile
and generate code on the fly. For example, if one wanted to write a virtual
machine with the ability to perform just-in-time compilation, a design that
generates C code and compiles that could be more versatile than one that
tries to generate machine code, but if the compiler isn't very fast the
system might spend more time trying to optimize code than it would have
spent running an non-optimized version.

> > Not really. If a processor has enough registers that the code generator
> > could get by even if some were unusable, all the generator needs to do is
> > (1) allow a field in the symbol table to indicate whether a variable is in
> > a register or memory, and (2) have a code generator support instruction
> > forms that can use a register instead of memory. If the code generator
> > would use registers as a rolling-window value stack, declaring variables
> > as "register" might in some cases increase the number of register spills in
> > code that isn't using those variables, but in other cases it might offer
> > a big savings in variable-access costs without causing any extra register
> > spills.
>
> My apologies. I thought we were discussing the 32-bit code generation
> of TCC32 on this madel() function, and in this particular case the x86.
> The 80386 up has a very limited register set, and certain operations
> can only be conducted in certain registers. Within that constraint,
> it's very difficult to generate code which honors developer-specified
> register assignments.

Turbo C was able to get by reserving SI and DI for register variables
when targeting the 8088/8086. I see no reason version targeting the
80386 instruction set shouldn't be able to reserve at least ESI and EDI,
and possibly EBX as well. Are there any cases where simple code generation
would need more than EAX, ECX (for shift), EDX, and ESP/EBP (stack and
frame)? On the floating-point side, would there be any difficulty keeping
track of which x87 registers hold clean or dirty variables and simply caching
the register values on an as-convenient basis?

> But, for AMD64 or some RISC machines with 32+ registers, it would be
> much easier and could be handled in a first pass effort, but there is
> then also the issue of spill and fill as you don't want to trump some
> values used in another function that may have also been optimizing
> something to use registers ... so now you're into a depth of call
> analysis to see how you can structure registers to affect the fewest
> things in the least way. A single-pass compiler could not do that,
> and the benefits of assigning registers could be easily undone by
> the spill/fill overhead requirements.

Multi-pass compilers are often useful, but there may be other use cases
involving dynamic code generation where compiler performance is critical.

> My personal view is the issue of optimization is an essential component
> of any program that will enter into production, and isn't required to
> be changed often, or even need to be hot-patched. It only makes sense
> as it is a maximal use of available resources with the least expenditure
> of energy. But ... that being said, I don't see computers ever getting
> slower. I see electrical engineers working on newer technologies which
> continue to increase the width of CPU engines, and the speed at which
> they operate, reducing energy footprints and expanding capabilities.

Mobile changes things quite a bit. I don't think battery technology will
ever improve to the point that devices that do a lot will be able to run
so long that consumers wouldn't prefer that they run even longer.

> I see a future where even simple things like greeting cards which play
> music when you open them have a quad-core 64-bit CPU or greater. Why?
> Because it will be so inexpensive to manufacture that model in bulk
> that it will be used for a wide range of things.

What kind of battery is that greeting card going to have? The number of
computations that can be performed per joule of energy has increased over
the years, but slower processors can often perform more computations per
joule than faster ones.

Rick C. Hodgin

unread,

Nov 17, 2016, 4:56:06 PM11/17/16

to

On Thursday, November 17, 2016 at 4:24:05 PM UTC-5, supe...@casperkitty.com wrote:
> On Thursday, November 17, 2016 at 2:02:42 PM UTC-6, Rick C. Hodgin wrote:
> > On Thursday, November 17, 2016 at 2:35:46 PM UTC-5, supercat wrote:
> > > Optimizations without the keyword would require a multi-pass compiler.
> > > The cost of pass to determine register usage prior to code generation
> > > may not be huge, but if compilation speed is a priority, allowing variables
> > > to be fixed in registers using the "register" keyword would allow a
> > > fraction of the benefits of register optimization to be obtained at a lower
> > > compilation-time cost than would be needed for most sophisticated analysis.
> >
> > Are there still people targeting a one-pass compiler? I had people in
> > this C group tell me that I wouldn't be able to compile CAlive source
> > code in one pass, as if that was a reason to reconsider some of my
> > syntax allowances. My reply was, "So?"
>
> The thread was benchmarking, among other things, tcc. I may be mistaken,
> but I was under the impression that tcc is a single-pass compiler which
> places a substantial design emphasis on compilation speed.

It does. Its goals were to be compatible, fast, small, and not the
best in terms of performance. It met all of its goals. :-)

> > In this day and age with $25 quad-core CPUs running over 2 GHz, and
> > 16GB memory sticks that today cost $70 each, and 2TB hard drives which
> > cost $70 each ... what limitations on compute are we talking about?
>
> Primarily situations in which it would be necessary to be able to compile
> and generate code on the fly. For example, if one wanted to write a virtual
> machine with the ability to perform just-in-time compilation, a design that
> generates C code and compiles that could be more versatile than one that
> tries to generate machine code, but if the compiler isn't very fast the
> system might spend more time trying to optimize code than it would have
> spent running an non-optimized version.

To be honest, I would rather have a slower JIT that's able to produce
better code because the compilation will be run one, stuck into a cache
where it's then run many times. In addition, having more efficient
code generation would result in less energy use long-term.

> > > Not really. If a processor has enough registers that the code generator
> > > could get by even if some were unusable, all the generator needs to do is
> > > (1) allow a field in the symbol table to indicate whether a variable is in
> > > a register or memory, and (2) have a code generator support instruction
> > > forms that can use a register instead of memory. If the code generator
> > > would use registers as a rolling-window value stack, declaring variables
> > > as "register" might in some cases increase the number of register spills in
> > > code that isn't using those variables, but in other cases it might offer
> > > a big savings in variable-access costs without causing any extra register
> > > spills.
> >
> > My apologies. I thought we were discussing the 32-bit code generation
> > of TCC32 on this madel() function, and in this particular case the x86.
> > The 80386 up has a very limited register set, and certain operations
> > can only be conducted in certain registers. Within that constraint,
> > it's very difficult to generate code which honors developer-specified
> > register assignments.
>
> Turbo C was able to get by reserving SI and DI for register variables
> when targeting the 8088/8086. I see no reason version targeting the
> 80386 instruction set shouldn't be able to reserve at least ESI and EDI,
> and possibly EBX as well. Are there any cases where simple code generation
> would need more than EAX, ECX (for shift), EDX, and ESP/EBP (stack and
> frame)?

I think the compiler would be able to gain more benefits in multiple
instructions by local usages which occupy those values, rather than
sticking a value in a register and keeping it there which may force
some additional memory accesses because there aren't enough spare
registers to cycle data through in other cases.

Modern CPUs are also OoO, so you can have the re-load from L1 data
cache moved temporally a few instructions so that by the time it's
actually referenced, it's been loaded. You can issue the memory read
in one instruction, stuff some other instructions in-between, such
that by the time you actually use what was read it's been retrieved.

There are also some hints you can provide to the CPU about what to
prefetch into which cache, allowing oft-used memory to be brought
into L1 and explicitly kept there for a time.

> On the floating-point side, would there be any difficulty keeping
> track of which x87 registers hold clean or dirty variables and simply caching
> the register values on an as-convenient basis?

No. It's non-trivial, but it's not also not overtly complex. In the
same way, the ebp register can be freed by not using a stack frame.
The compiler could keep track of relative references as data is pushed
onto or off the stack. That's what happens when you use the compiler
option to remove stack frame pointers. ebp also has the added side-
effect of being beneficial because its default segment is SS:, which
means without any opcode override bytes it's accessing data directly
on the stack, though in many OSes all segment registers are mapped
to the same value and they use paging to isolate visible portions.

> > But, for AMD64 or some RISC machines with 32+ registers, it would be
> > much easier and could be handled in a first pass effort, but there is
> > then also the issue of spill and fill as you don't want to trump some
> > values used in another function that may have also been optimizing
> > something to use registers ... so now you're into a depth of call
> > analysis to see how you can structure registers to affect the fewest
> > things in the least way. A single-pass compiler could not do that,
> > and the benefits of assigning registers could be easily undone by
> > the spill/fill overhead requirements.
>
> Multi-pass compilers are often useful, but there may be other use cases
> involving dynamic code generation where compiler performance is critical.

I can't see a real benefit for compiler performance being critical. And
given the nature of multiple cores existing, I would go so far as to say
that if initial launch speed was so crucial that it required the fastest
possible performance, then go ahead and introduce a 1-pass compiler pass,
as part of a larger mechanism that does more advanced analysis as an n-
pass compiler, to then generate the "rough" code which begins running
initially, to be followed-up with the "smooth" optimized code once the
additional passes are completed.

But I would consider that to be such a niche application that I can
honestly not see a use for it.

> > My personal view is the issue of optimization is an essential component
> > of any program that will enter into production, and isn't required to
> > be changed often, or even need to be hot-patched. It only makes sense
> > as it is a maximal use of available resources with the least expenditure
> > of energy. But ... that being said, I don't see computers ever getting
> > slower. I see electrical engineers working on newer technologies which
> > continue to increase the width of CPU engines, and the speed at which
> > they operate, reducing energy footprints and expanding capabilities.
>
> Mobile changes things quite a bit. I don't think battery technology will
> ever improve to the point that devices that do a lot will be able to run
> so long that consumers wouldn't prefer that they run even longer.

Consider the power envelope's we're dealing with. The original 4004 CPU
was built using a 10,000 nm process technology. It operated at a clock
speed of around 160 KHz when first introduced (IIRC). It required 15V,
had 2300 transistors, and consumed 0.45 watts, or 1.95 milliwatts per
transistor (195,000 nanowatts). It was introduced in 1971.

Flash forward to 2016, 45 years later, and we now have CPUs like Intel's
Core-M CPU which have 1.3 billion transistors, and consume 4.5 watts,
or 0.000003 milliwats per transistor (3 nanowatts) on a 14 nm process
technology. They now have 10 nm, and are pushing toward 8 and 6.

That's a tremendous increase in 45 years. Where are we headed? Physics
is going to give us a hard wall in a few years in silicon feature sizes.
We won't be able to get down much below 6nm because there isn't enough
signal to indicate the switch. So, electrical engineers are going to
have to figure something else out. Whether it's carbon nanotubes, or
photonic, or some other thing ... I don't know, but it's coming. Our
ingenuity and resource dollars will keep going.

> > I see a future where even simple things like greeting cards which play
> > music when you open them have a quad-core 64-bit CPU or greater. Why?
> > Because it will be so inexpensive to manufacture that model in bulk
> > that it will be used for a wide range of things.
>
> What kind of battery is that greeting card going to have? The number of
> computations that can be performed per joule of energy has increased over
> the years, but slower processors can often perform more computations per
> joule than faster ones.

Considering the power trend, decreasing 65,000x in 45 years, a rate of
about 1444x per year, where will be in 10 years? We'll be to the point
where that tiny 64-bit CPU built on atoms is able to have the same 1.5+
billion transistors for 1/10,000th the power budget, resulting in a
birthday card that consumes more power to amplify the digital signal
into the range of audible hearing than it does to power the entire
digital engine. And, that digital engine can be written in C# running
atop .NET 9.5, in interpreted mode, because it also contains a gigabyte
of on-die memory. :-)

Who knows ... my point is if you would've told someone from the 1950s
that they'd someday have greeting card they could open up and it plays
a song for them, they'd think you were crazy. So, when I say it will
someday soon be played on a quad-core 64-bit CPU where it takes more
power to amplify the digital signal to audible sound than it does to
run the computing device generating the sound ... it's not so far-
fetched as you might think.

supe...@casperkitty.com

unread,

Nov 17, 2016, 6:15:23 PM11/17/16

to

On Thursday, November 17, 2016 at 3:56:06 PM UTC-6, Rick C. Hodgin wrote:
> Consider the power envelope's we're dealing with. The original 4004 CPU
> was built using a 10,000 nm process technology. It operated at a clock
> speed of around 160 KHz when first introduced (IIRC). It required 15V,
> had 2300 transistors, and consumed 0.45 watts, or 1.95 milliwatts per
> transistor (195,000 nanowatts). It was introduced in 1971.

Processors which use active pull-ups and passive pull-downs (PMOS, used
in the 4004), and those which use active pull-downs and passive pull-ups
(NMOS, used in the Z80 and 6502) are going to be a lot more power-hungry
than those which use active pull-ups and active pull-downs (CMOS). The RCA
COSMAC CDP1802 microprocessor (introduced in 1976) is rather slow, but if
one doesn't need to run too many instructions all one need do to get its
power consumption down below 1uA is stop feeding it clocks when it doesn't
need to do anything.

> Flash forward to 2016, 45 years later, and we now have CPUs like Intel's
> Core-M CPU which have 1.3 billion transistors, and consume 4.5 watts,
> or 0.000003 milliwats per transistor (3 nanowatts) on a 14 nm process
> technology. They now have 10 nm, and are pushing toward 8 and 6.
>
> That's a tremendous increase in 45 years. Where are we headed? Physics
> is going to give us a hard wall in a few years in silicon feature sizes.
> We won't be able to get down much below 6nm because there isn't enough
> signal to indicate the switch. So, electrical engineers are going to
> have to figure something else out. Whether it's carbon nanotubes, or
> photonic, or some other thing ... I don't know, but it's coming. Our
> ingenuity and resource dollars will keep going.

Power consumption has been going down, and battery capacity has been going
up, but the effect of that is to make things that wouldn't be practical,
practical. No matter how much they improve, there will always be some
applications where it would be helpful to improve battery life even further.

> Considering the power trend, decreasing 65,000x in 45 years, a rate of
> about 1444x per year, where will be in 10 years?

A change of 65000x over 45 years would be about 27% per year, not 1444x.
Further, much of the change occurred between 1971 and 1976. The COSMAC
could run more than twice the speed of the 4004 while drawing under 5mW,
so if your figures are right that would imply that the improvements since
then are closer to 320x than to 65000x.

> Who knows ... my point is if you would've told someone from the 1950s
> that they'd someday have greeting card they could open up and it plays
> a song for them, they'd think you were crazy. So, when I say it will
> someday soon be played on a quad-core 64-bit CPU where it takes more
> power to amplify the digital signal to audible sound than it does to
> run the computing device generating the sound ... it's not so far-
> fetched as you might think.

Such a thing would have been possible in 1950 using a small clockwork-
powered record player. A little thick, but not outrageously so. An
electronic song player small enough to fit in a cigarette pack would
have been entirely feasible by 1970. Such a thing wouldn't be using a
CPU, of course, but even today most greeting cards still use specialized
circuitry rather than a CPU.

Chris M. Thomasson

unread,

Nov 17, 2016, 8:01:54 PM11/17/16

to

On 11/16/2016 3:02 PM, James R. Kuyper wrote:
> On 11/16/2016 05:21 PM, Chris M. Thomasson wrote:
> ...
>> Damn. I see this flag as having negative effects on my cipher based on
>> floating point. I basically need a standard representation of floating
>> point in order to get it to decrypt correctly, without loss across
>> different compilers. So, IEEE needs to be strictly honored in order for
>> things to work. This is why I named it Funny Fractal Encryption. The
>> Funny part is the nasty nest of floating point issues!
>
> Is strictly honoring IEEE requirements sufficient? IEEE says, in many
> contexts, for an expression that has a mathematical value which cannot
> be represented exactly, that the result is either the nearest higher or
> nearest lower representable value, chosen in an implementation-defined
> manner.

Well, then I have to explicitly document what implementation-defined
representation is compatible.

> In other contexts, the result is either the nearest
> representable value, or the larger or smaller representable value
> immediately adjacent to the nearest representable value, chosen in an
> implementation-defined manner.

Ditto.

> If encrypting on a platform which makes
> one choice in each of these cases, and decrypting on a platform which
> makes a different choice, would you be able to recover the original
> data?

There could be a real chance of loss, especially true for very large files.

> My experience with fractals is that the calculations typically
> cause exponential growth of small differences, so I would expect that
> such differences would be catastrophic for your algorithm.

There just might be a way around this by explicitly documenting the
algorithms exact requirements.

> This is why some people recommended against using floating point
> calculations as the basis for your encryption algorithm.
> (6.4.4.2p3)

Yes. However, one of the reasons why I chose fractals as a basis for a
cipher is because they are so darn highly sensitive to initial
conditions. This is a double edged sword indeed!

Humm...

Chris M. Thomasson

unread,

Nov 18, 2016, 4:05:39 AM11/18/16

to

This makes me think of 128-bit floats in Jacobs LCC compiler...

David Brown

unread,

Nov 18, 2016, 4:41:19 AM11/18/16

to

On 17/11/16 22:23, supe...@casperkitty.com wrote:

> Multi-pass compilers are often useful, but there may be other use cases
> involving dynamic code generation where compiler performance is critical.
>

Why do you think that multi-pass compilation is slower than single-pass?

The reason single-pass compilation existed was because of memory
limitations, not processing time. Compilers simply could not read
entire files at a time and hold all the relevant data structures in
memory, and generate all the code. So they would read a bit of the
source code and construct the necessary data structures (symbol maps,
etc.), generating output code or assembly as they went along.

No modern code generator for a serious language will be single-pass.
That applies to JIT or other speed-critical compilation just as much as
for any other type of compilation. Multi-pass compilation and holding
more complete data structures in memory is, in fact, vital to getting
faster compilation speeds.

So for Rick's tools, where compilation speed is relevant during
edit-while-debug work, it is essential that the compiler is /not/ single
pass - it should be keeping everything in memory so that it can handle
small code changes without re-compiling everything.

Rick C. Hodgin

unread,

Nov 18, 2016, 6:22:17 AM11/18/16

to

I didn't have a chance to work on it last night. I'll try to get it
tonight. The version I have right now recomputes a*a and b*b each
pass. I'll make a second version which saves that operation.

Mr. Man-wai Chang

unread,

Nov 18, 2016, 6:23:29 AM11/18/16

to

On 18/11/2016 7:22 PM, Rick C. Hodgin wrote:
>
> I didn't have a chance to work on it last night. I'll try to get it
> tonight. The version I have right now recomputes a*a and b*b each
> pass. I'll make a second version which saves that operation.

Take it easy. Most existing C compilers are working fine, doing their
jobs.... and C programmers are not complaining.

Rick C. Hodgin

unread,

Nov 18, 2016, 6:36:20 AM11/18/16

to

Feel free to live life in granny gear if you like. I desire to kick it
up into overdrive and press the "Eco" button for maximum efficiency and
speed.

Mr. Man-wai Chang

unread,

Nov 18, 2016, 6:52:36 AM11/18/16

to

On 18/11/2016 7:36 PM, Rick C. Hodgin wrote:
>
> Feel free to live life in granny gear if you like. I desire to kick it
> up into overdrive and press the "Eco" button for maximum efficiency and
> speed.

Honestly and frankly speaking:

Could C programmers come to this newsgroup to complain about C compilers? :)

BartC

unread,

Nov 18, 2016, 7:12:18 AM11/18/16

to

I keep hearing this nonsense about TCC's speed being due to its being
single pass.

Multi-pass used to mean rereading either the entire source file, or
intermediate files that have been generated, but that's not really
relevant any more except perhaps to large compilers such as gcc and
clang which may perhaps do just and could explain their sluggishness.

(That is, they consist of discrete parts which communicate via files or
pipes.)

With the stuff I do, source code is read into memory, parsed into some
internal structure (eg. an AST for the code and ST for the symbols),
then multiple passes may be done on that data. That can still be fast.

Dealing with C has some extra challenges, as the preprocessor could be a
separate pass (but I believe it can be done as it goes along). While
recompiling the same header multiple times for different modules could
be necessary even for a so-called single pass compiler.

--
Bartc

David Brown

unread,

Nov 18, 2016, 7:24:31 AM11/18/16

to

Modern gcc has many dozens of passes for different types of
optimisations, and these are done within the same process and with
everything in memory.

The pre-processing part ("cpp"), the "C to assembly" part ("cc1"), and
the "assembly to object file" part ("as" from binutils), are all
discrete programs that communicate via files or pipes. And you are
correct that this architecture may be an influence on the slower
compilation speed of gcc - though I doubt if it is a major point, as
most of the hard work is done in the cc1 part. The difference will be
more relevant on Windows - on *nix, multiple processes connected by
files or pipes is a very efficient, while Windows has a lot more
overhead here.

Jerry Stuckle

unread,

Nov 18, 2016, 9:45:52 AM11/18/16

to

No, multi-pass compilers existed, even on small machines. The first
programming I did was in FORTRAN on an IBM 1410 with 4,000 bytes of
core. Coding was done on punch cards.

But the compiler was a multi-pass compiler even then. It stored
intermediate code on disk and worked from there. A little later we were
running FORTRAN on an early IBM 360 with no disks - just tape drives
(and tape operating system). Again, multi-pass, and fun to watch the
tapes spin back and forth. I don't remember how much memory this one
had, but again it wasn't much.

No, it wasn't lack of memory. It was a design decision by K & R.

--
==================
Remove the "x" from my email address
Jerry Stuckle
jstu...@attglobal.net
==================

james...@verizon.net

unread,

Nov 18, 2016, 1:38:38 PM11/18/16

to

On Wednesday, November 16, 2016 at 6:03:08 PM UTC-5, James R. Kuyper wrote:
...

> Is strictly honoring IEEE requirements sufficient? IEEE says, in many
> contexts, for an expression that has a mathematical value which cannot
> be represented exactly, that the result is either the nearest higher or
> nearest lower representable value, chosen in an implementation-defined

> manner. In other contexts, the result is either the nearest

> representable value, or the larger or smaller representable value
> immediately adjacent to the nearest representable value, chosen in an
> implementation-defined manner.

I was working from memory when I wrote that, and my memory is less than perfect. What I described are the C rules for conversions of integers or floating point constants in the source code to floating point values.
What IEEE says about those cases is a bit more complicated, and I'm not sure whether it's equivalent. However what it says about most other operations is that they must produce the same result as if the exact value was computed with infinite precision, and then rounded to a representable value according to the current rounding rules.

Therefore, I think you could ensure consistent floating point results if you avoid any use of floating point constants (good luck with that!), and require that your code be compiled using an implementation of C that

1) fully conforms to C99 or later,

2) pre#defines __STDC_IEC_559__, requiring full conformance with Annex F. Note F.2 allows only one format each for the 'float' and 'double' types, but 'long double' has much more freedom - to retain precise control over your results, I recommend making no use of long double.

3) #efines the macro for your preferred rounding direction in <fenv.h>. Call fesetround() to force use of that method, which might not be the default.

4) #defines FLT_EVAL_METHOD in <float.h> to either 0 or 1. 2 would allow the use of long double for intermediate results of calculations involving doubles, which should be avoided for the same reason that long double itself should be avoided.

However, I'm not willing to guarantee that this is sufficient - it's not something I've ever tried to do. When I use floating point, a certain amount of inaccuracy is always assumed to unavoidable. - I wouldn't ordinarily use it where exactly reproducible results are needed across multiple platforms.

Rick C. Hodgin

unread,

Nov 19, 2016, 8:55:41 PM11/19/16

to

Finally got around to it. Here are the result, and the code I used.
I really have to hand it to Microsoft. They've had significant
compiler improvements since Visual Studio 2010:

// Visual Studio 2015 Winner:
// Debug: Max : 366085784, time: 3.267
// Max custom: 366085784, time: 2.154 <--- Custom
// Release: Max : 366085784, time: 1.441 <--- Compiler
// Max custom: 366085784, time: 1.767

// Visual Studio 2012
// Debug: Max : 366085784, time: 3.470
// Max custom: 366085784, time: 2.182 <--- Custom
// Release: Max : 366085784, time: 1.751 <--- Compiler
// Max custom: 366085784, time: 1.782

// Visual Studio 2010
// Debug: Max : 366085784, time: 3.655
// Max custom: 366085784, time: 2.276 <--- Custom
// Release: Max : 366085784, time: 2.343
// Max custom: 366085784, time: 2.066 <--- Custom

Source code:

http://pastebin.com/VKu0RCiM

Am curious to hear what other compilers do.

Chris M. Thomasson

unread,

Nov 20, 2016, 6:39:07 PM11/20/16

to

Thank you James! This is exactly the type of information I am looking for.