Compiler speed testing

James Harris

unread,

Nov 5, 2022, 11:03:47 AM11/5/22

to

Any thoughts on what would be in a good piece of source code to use to
test a compiler's speed?

I guess I could do with something that's (1) realistic, not just an
assignment statement repeated over and over again, (2) large enough to
give lengthy timings, (3) easy enough to make, and (4) it would be valid
over multiple generations of the compiler. But what should be in it?

Here are some comparative timings taken just now. They read and compile
my parser (as it's the largest source file in the compiler). It is about
18k in size and from it the compiler produces am asm file of a little
over 100k.

cda 566 ms
cdb 540 ms
cdc 600 ms
cdd 24 ms

The reason for the gratifying jump in performance at the end is that I
added input and output buffering to cdd but it's got me wondering about
testing the times taken by future compilers.

I should say that I don't expect compile times always to improve. The
run times of later compilers would likely go up as I add facilities and
down as I switch to other mechanisms. But it would still be something
I'd like to keep an eye on.

--
James Harris

Bart

unread,

Nov 5, 2022, 12:51:24 PM11/5/22

to

Your figures are too confused to comment on meaningfully. For a start,
I'd like to know the reason for that 25 times speedup! What was it
spending 96% of its time doing in those earlier versions?

I don't believe there's any real buffering involved in 18KB input.

But I'm also used to working with line counts rather than file sizes;
how many lines was the input, and how many lines was the output?

Generating ASM source is usually somewhat slower than writing binary,
but it should still be fast enough (it becomes an issue if you want the
fastest possible speeds).

I can tell you that my current compiler manages some 700K lines per
second (this is source code to EXE) on the test file in my second link
below.

If I tell it to generate ASM, it drops to 260K lines per second (taking
2.8 seconds to generate 2.2M lines. (The input file is 9MB, and the
generated ASM is 87MB; the EXE was 8MB.)

Since I now once again support a C target, I can accelerate it further
by some 30-40%, so just pushing 1Mlps on that first test.

But ultimately, compilation just has to be fast enough not to be a
nuisance (beyond that it's just a sport).

Your 24ms sounds fine. 600ms would be mildly annoying for me, but is
still tolerable. For a while I ran an /interpreted/ compiler taking up
to half a second per module, but that was with independent compilation.

Since I now work with whole program compilers, the speed is a lot more
important. Currently, no applications of mine take more than 100ms, once
file-cacheing comes into play.

Here are links to tests that I'm sure I posted before:

https://github.com/sal55/langs/blob/master/Compilertest1.md

https://github.com/sal55/langs/blob/master/Compilertest3.md

These were all done on a slower machine. My current one (where I can get
1Mlps) uses an 'AMD Ryzen 3 2650U 2.6GHz', which while faster, I think
is still low-end.

James Harris

unread,

Nov 5, 2022, 1:54:16 PM11/5/22

to

In a word, IO. It was reading and writing one character at a time -
which was enough to start with. You may remember we discussed this a few
years ago:

https://groups.google.com/g/comp.lang.misc/c/nABLfzd08dA/m/WImDDyDUCAAJ

>
> I don't believe there's any real buffering involved in 18KB input.

It was definitely buffering. With the new compiler if I change the
buffer size to 1 then it takes as long as it did before. Here's how the
run time changes with different buffer sizes:

bufsize, approx time in ms
1, 650
2, 340
3, 240
4, 180
8, 110
16, 70
32, 45
64, 35
128, 28
256, 26
512, 23
1024, 23
2048, 22
4096, 21

>
> But I'm also used to working with line counts rather than file sizes;
> how many lines was the input, and how many lines was the output?

Good question.

Input: 678 lines including blank lines
Output: 6742 lines including blank lines

>
> Generating ASM source is usually somewhat slower than writing binary,
> but it should still be fast enough (it becomes an issue if you want the
> fastest possible speeds).
>
>
> I can tell you that my current compiler manages some 700K lines per
> second (this is source code to EXE) on the test file in my second link
> below.
>
> If I tell it to generate ASM, it drops to 260K lines per second (taking
> 2.8 seconds to generate 2.2M lines. (The input file is 9MB, and the
> generated ASM is 87MB; the EXE was 8MB.)

Based on the above I make mine 32k lines per second.

>
> Since I now once again support a C target, I can accelerate it further
> by some 30-40%, so just pushing 1Mlps on that first test.
>
>
> But ultimately, compilation just has to be fast enough not to be a
> nuisance (beyond that it's just a sport).

Yes, the compiler is certainly 'fast enough' for now but it would be fun
to try to improve the speed once it is more mature.

There's a lot I need to do to the compiler so that the language it can
compile is more complete (there's a lot missing at the moment) but IO
buffering was one thing which I thought should have made a big
difference fairly easily.

>
> Your 24ms sounds fine. 600ms would be mildly annoying for me, but is
> still tolerable. For a while I ran an /interpreted/ compiler taking up
> to half a second per module, but that was with independent compilation.
>
> Since I now work with whole program compilers, the speed is a lot more
> important. Currently, no applications of mine take more than 100ms, once
> file-cacheing comes into play.
>
>
>
> Here are links to tests that I'm sure I posted before:
>
>
> https://github.com/sal55/langs/blob/master/Compilertest1.md
>
> https://github.com/sal55/langs/blob/master/Compilertest3.md
>
> These were all done on a slower machine. My current one (where I can get
> 1Mlps) uses an 'AMD Ryzen 3 2650U 2.6GHz', which while faster, I think
> is still low-end.

Surprisingly, that processor doesn't come up at

https://www.cpubenchmark.net/singleCompare.php

From /proc/cpuinfo I have the following.

Intel(R) Pentium(R) Silver J5005 CPU @ 1.50GHz

--
James Harris

Bart

unread,

Nov 5, 2022, 4:15:05 PM11/5/22

to

On 05/11/2022 17:54, James Harris wrote:
> On 05/11/2022 16:51, Bart wrote:

>> Your figures are too confused to comment on meaningfully. For a start,
>> I'd like to know the reason for that 25 times speedup! What was it
>> spending 96% of its time doing in those earlier versions?
>
> In a word, IO. It was reading and writing one character at a time -
> which was enough to start with. You may remember we discussed this a few
> years ago:
>
> https://groups.google.com/g/comp.lang.misc/c/nABLfzd08dA/m/WImDDyDUCAAJ
>
>>
>> I don't believe there's any real buffering involved in 18KB input.
>
> It was definitely buffering. With the new compiler if I change the
> buffer size to 1 then it takes as long as it did before.

> bufsize, approx time in ms
> 1, 650
> 2, 340

I assumed that such a small file would be loaded in one go by the OS
anyway. Then any calls you do, even reading a character at a time, would
read from the OS's in-memory buffer.

So to take 0.65 seconds to read 18KB seems puzzling. 28KB per second?
That's roughly the transfer rate from a floppy disk! Yet this is memory
to memory on a modern PC with GHz clock rates. Something funny is going on.

A loop like this:

to 18'000 do
c:=fgetc(f)
od

in /interpreted/ code (calling the C function) is too fast to measure.

But reading a 7.8MB input file with such a loop, a character at a time
in scripting code, takes 0.47 seconds.

>> If I tell it to generate ASM, it drops to 260K lines per second
>> (taking 2.8 seconds to generate 2.2M lines. (The input file is 9MB,
>> and the generated ASM is 87MB; the EXE was 8MB.)
>
> Based on the above I make mine 32k lines per second.

This is the sort of speed of compilers like gcc. Is yours still written
in Python? I thought you had it self-hosted.

>> These were all done on a slower machine. My current one (where I can
>> get 1Mlps) uses an 'AMD Ryzen 3 2650U 2.6GHz', which while faster, I
>> think is still low-end.
>
> Surprisingly, that processor doesn't come up at
>
> https://www.cpubenchmark.net/singleCompare.php
>
> From /proc/cpuinfo I have the following.
>
> Intel(R) Pentium(R) Silver J5005 CPU @ 1.50GHz

Probably because it's 3650U not 2650U. The rating on that site shows it
at 3900 compared with 3050 of your device.

So not as low end as I expected (I assumed it was, being the second
cheapest PC in the shop). My friend has a laptop with an i5 in it and my
brief tests with that showed it to be 50% faster than my machine.

Bart

unread,

Nov 5, 2022, 4:44:33 PM11/5/22

to

On 05/11/2022 20:15, Bart wrote:

>>> These were all done on a slower machine. My current one (where I can
>>> get 1Mlps) uses an 'AMD Ryzen 3 2650U 2.6GHz', which while faster, I
>>> think is still low-end.
>>
>> Surprisingly, that processor doesn't come up at
>>
>> https://www.cpubenchmark.net/singleCompare.php
>>
>> From /proc/cpuinfo I have the following.
>>
>> Intel(R) Pentium(R) Silver J5005 CPU @ 1.50GHz
>
>
> Probably because it's 3650U not 2650U. The rating on that site shows it

It's not, it's 3250U. I know it's right this time because I did copy+paste!

Bart

unread,

Nov 5, 2022, 4:52:15 PM11/5/22

to

On 05/11/2022 20:15, Bart wrote:

> On 05/11/2022 17:54, James Harris wrote:
>> On 05/11/2022 16:51, Bart wrote:
>
>>> Your figures are too confused to comment on meaningfully. For a
>>> start, I'd like to know the reason for that 25 times speedup! What
>>> was it spending 96% of its time doing in those earlier versions?
>>
>> In a word, IO. It was reading and writing one character at a time -
>> which was enough to start with. You may remember we discussed this a
>> few years ago:
>>
>> https://groups.google.com/g/comp.lang.misc/c/nABLfzd08dA/m/WImDDyDUCAAJ
>>
>>>
>>> I don't believe there's any real buffering involved in 18KB input.
>>
>> It was definitely buffering. With the new compiler if I change the
>> buffer size to 1 then it takes as long as it did before.
>
> > bufsize, approx time in ms
> > 1, 650
> > 2, 340
>
> I assumed that such a small file would be loaded in one go by the OS
> anyway. Then any calls you do, even reading a character at a time, would
> read from the OS's in-memory buffer.
>
> So to take 0.65 seconds to read 18KB seems puzzling. 28KB per second?
> That's roughly the transfer rate from a floppy disk! Yet this is memory
> to memory on a modern PC with GHz clock rates. Something funny is going on.

I should have read your link first. There I compared the speed to a
serial port at 2400 baud. At least it has improved from that!

I think with source files, unless you're trying to run on a restricted
machine that only has a few KB of RAM, just grab the whole file at once
into a buffer or a string directly accessible by your lexer.

Otherwise you're going to be getting these silly speeds. Loading and
scanning 18KB of data should be too fast to measure.

James Harris

unread,

Nov 5, 2022, 5:22:00 PM11/5/22

to

On 05/11/2022 20:15, Bart wrote:

> On 05/11/2022 17:54, James Harris wrote:
>> On 05/11/2022 16:51, Bart wrote:
>
>>> Your figures are too confused to comment on meaningfully. For a
>>> start, I'd like to know the reason for that 25 times speedup! What
>>> was it spending 96% of its time doing in those earlier versions?
>>
>> In a word, IO. It was reading and writing one character at a time -
>> which was enough to start with. You may remember we discussed this a
>> few years ago:
>>
>> https://groups.google.com/g/comp.lang.misc/c/nABLfzd08dA/m/WImDDyDUCAAJ
>>
>>>
>>> I don't believe there's any real buffering involved in 18KB input.
>>
>> It was definitely buffering. With the new compiler if I change the
>> buffer size to 1 then it takes as long as it did before.
>
> > bufsize, approx time in ms
> > 1, 650
> > 2, 340
>
> I assumed that such a small file would be loaded in one go by the OS
> anyway. Then any calls you do, even reading a character at a time, would
> read from the OS's in-memory buffer.
>
> So to take 0.65 seconds to read 18KB seems puzzling. 28KB per second?
> That's roughly the transfer rate from a floppy disk! Yet this is memory
> to memory on a modern PC with GHz clock rates. Something funny is going on.

Don't forget all the context switching to and from kernel mode - for
both read (18k) and write (100k).

>
> A loop like this:
>
>    to 18'000 do
>       c:=fgetc(f)
>    od
>
> in /interpreted/ code (calling the C function) is too fast to measure.

Two points:

1. That only reads - c. 18k. To be a fair test you would also have to
write c. 100k.

2. I'd expect fgetc to be buffered.

You can see the difference if you run under strace. It will show the
individual syscalls. Without buffering there should be something like
118,000 reads/writes. With buffering of 512 there should only be about
230 such syscalls - about 1/500th of the number. I suspect that kernel
calls and returns is where a lot of the time will be going if there's no
buffering.

Still not convinced? Take a look at dd. When run on a file with 112,000
bytes:

time dd if=infile of=/dev/null bs=1
time dd if=infile of=/dev/null bs=512

The first takes 370 ms. The second just 5ms. QED, I think. :)

Perhaps more interesting is where an individual compiler spends its
time: how long in lexing, statement parsing, expression parsing, IR
generation, IR alterations, optimisation, code gen, etc. At some point I
may add code to gather such info.

>
> But reading a 7.8MB input file with such a loop, a character at a time
> in scripting code, takes 0.47 seconds.
>
>>> If I tell it to generate ASM, it drops to 260K lines per second
>>> (taking 2.8 seconds to generate 2.2M lines. (The input file is 9MB,
>>> and the generated ASM is 87MB; the EXE was 8MB.)
>>
>> Based on the above I make mine 32k lines per second.
>
> This is the sort of speed of compilers like gcc. Is yours still written
> in Python? I thought you had it self-hosted.

The compiler is self hosted. The first one, which I now call cda, was
written in asm. The others, cdb to cdd, are written in my language and
compiled via asm. No other language is used in compilation.

What you may be remembering is that Python is used for running test scripts.

>
>>> These were all done on a slower machine. My current one (where I can
>>> get 1Mlps) uses an 'AMD Ryzen 3 2650U 2.6GHz', which while faster, I
>>> think is still low-end.
>>
>> Surprisingly, that processor doesn't come up at
>>
>> https://www.cpubenchmark.net/singleCompare.php
>>
>> From /proc/cpuinfo I have the following.
>>
>> Intel(R) Pentium(R) Silver J5005 CPU @ 1.50GHz
>
>
> Probably because it's 3650U not 2650U. The rating on that site shows it
> at 3900 compared with 3050 of your device.

I see your other post that it's the 3250U.

As for comparing, unless your compiler is multithreaded it's probably
best to use the CPU's single-thread rating. Mine comes in at 1206. Yours
at 1812 - about 50% faster.

--
James Harris

James Harris

unread,

Nov 5, 2022, 5:33:14 PM11/5/22

to

Always!

;)

> There I compared the speed to a
> serial port at 2400 baud. At least it has improved from that!
>
> I think with source files, unless you're trying to run on a restricted
> machine that only has a few KB of RAM, just grab the whole file at once
> into a buffer or a string directly accessible by your lexer.

There's absolutely no need to do that. I just tried it by making each
buffer 256k. Both files are smaller than the buffers so there would be
only one read and one write. But it made no clear difference to the run
time. A buffer of 256k and a buffer of 0.5k both came in with runs
around 20 to 30 ms.

--
James Harris

Bart

unread,

Nov 5, 2022, 6:10:07 PM11/5/22

to

On 05/11/2022 21:21, James Harris wrote:

> On 05/11/2022 20:15, Bart wrote:

>> So to take 0.65 seconds to read 18KB seems puzzling. 28KB per second?
>> That's roughly the transfer rate from a floppy disk! Yet this is
>> memory to memory on a modern PC with GHz clock rates. Something funny
>> is going on.
>
> Don't forget all the context switching to and from kernel mode - for
> both read (18k) and write (100k).

Yeah but, why?

>>
>> A loop like this:
>>
>>     to 18'000 do
>>        c:=fgetc(f)
>>     od
>>
>> in /interpreted/ code (calling the C function) is too fast to measure.
>
> Two points:
>
> 1. That only reads - c. 18k. To be a fair test you would also have to
> write c. 100k.

Nope, still zero, even with an fputc loop run 100K times. These are
still tiny files.

Whatever fgetc/fputc are doing behind the scenes with buffering, it's
being done right.

> 2. I'd expect fgetc to be buffered.
>
> You can see the difference if you run under strace. It will show the
> individual syscalls. Without buffering there should be something like
> 118,000 reads/writes. With buffering of 512 there should only be about
> 230 such syscalls - about 1/500th of the number. I suspect that kernel
> calls and returns is where a lot of the time will be going if there's no
> buffering.
>
> Still not convinced? Take a look at dd. When run on a file with 112,000
> bytes:
>
> time dd if=infile of=/dev/null bs=1
> time dd if=infile of=/dev/null bs=512
>
> The first takes 370 ms. The second just 5ms. QED, I think. :)

OK. TBH I still don't understand what's going on (I don't know what 'dd'
is). Are you specifically doing file reads by doing as low-level systems
calls as possible, and requesting it's all done a character at a time?

If so, why? I just use C's `fread()` to read an entire file at once; job
done. If all the 100s of apps on my PC all used the same slow file read
methods, the machine would grind to a halt.

> Perhaps more interesting is where an individual compiler spends its
> time: how long in lexing, statement parsing, expression parsing, IR
> generation, IR alterations, optimisation, code gen, etc. At some point I
> may add code to gather such info.

Here are some figures from a recent test (any smaller input, is too fast
to measure accurately):

Load modules 0 msec Load all sources (here, 9MB mostly in 1 file)
Parsing 216 msec Parse and create AST
Name Resolve 47 msec Scan AST and resolve names
Type Analysis 84 msec
Codegen 131 msec To x64 representation
'SS' 138 msec To x64 machine code
Write EXE 8 msec (varies 0-16 ms) Build EXE image & write (7.6MB)

Total ~630 msec

This benefits from file-cacheing. Maybe the OS has unfinished business
with committing the EXE to disk after it returns from an `fclose` call;
I don't know. The elapsed time is some 0.7 seconds for the whole job.

File-loading should simply not be an issue. Usually a compiler is run on
a file that has just been edited, so it should still be in memory.

Compilations from 'cold' are uncommon. But file ops are anyway not under
your control (or maybe they are for you!).

James Harris

unread,

Nov 6, 2022, 3:36:54 AM11/6/22

to

On 05/11/2022 22:10, Bart wrote:
> On 05/11/2022 21:21, James Harris wrote:
>> On 05/11/2022 20:15, Bart wrote:
>
>>> So to take 0.65 seconds to read 18KB seems puzzling. 28KB per second?
>>> That's roughly the transfer rate from a floppy disk! Yet this is
>>> memory to memory on a modern PC with GHz clock rates. Something funny
>>> is going on.
>>
>> Don't forget all the context switching to and from kernel mode - for
>> both read (18k) and write (100k).
>
> Yeah but, why?

I don't understand the question but see below on timings.

..

>> Still not convinced? Take a look at dd. When run on a file with
>> 112,000 bytes:
>>
>> time dd if=infile of=/dev/null bs=1
>> time dd if=infile of=/dev/null bs=512
>>
>> The first takes 370 ms. The second just 5ms. QED, I think. :)
>
> OK. TBH I still don't understand what's going on (I don't know what 'dd'
> is). Are you specifically doing file reads by doing as low-level systems
> calls as possible, and requesting it's all done a character at a time?

In the example dd copies infile to /dev/null (which drops anything sent
to it). bs= sets the block size, i.e. how much can be returned from each
read() syscall.

The one which executes about 230 read() syscalls to do so executes in 5
ms. The one which reads the same amount of data but uses 118,000
syscalls takes over 70 times as long. As the input file is the same in
each case the comparison shows how slow syscalls can be, each one adding
about 3 microseconds, if I've calculated it correctly, to the overall task.

>
> If so, why? I just use C's `fread()` to read an entire file at once; job
> done. If all the 100s of apps on my PC all used the same slow file read
> methods, the machine would grind to a halt.

If your parser need to backtrack in the source that makes sense. But as
shown in my prior post there's no appreciable speed advantage.

>
>
>> Perhaps more interesting is where an individual compiler spends its
>> time: how long in lexing, statement parsing, expression parsing, IR
>> generation, IR alterations, optimisation, code gen, etc. At some point
>> I may add code to gather such info.
>
>
> Here are some figures from a recent test (any smaller input, is too fast
> to measure accurately):
>
> Load modules 0 msec Load all sources (here, 9MB mostly in 1 file)

Are you sure that's not mapping the file into virtual memory rather than
read it?

> Parsing       216 msec Parse and create AST
> Name Resolve   47 msec Scan AST and resolve names
> Type Analysis 84 msec
> Codegen       131 msec To x64 representation
> 'SS'          138 msec To x64 machine code
> Write EXE       8 msec (varies 0-16 ms) Build EXE image & write (7.6MB)
>
> Total        ~630 msec
>
> This benefits from file-cacheing. Maybe the OS has unfinished business
> with committing the EXE to disk after it returns from an `fclose` call;
> I don't know. The elapsed time is some 0.7 seconds for the whole job.
>
> File-loading should simply not be an issue. Usually a compiler is run on
> a file that has just been edited, so it should still be in memory.

Yes, the whole file could be cached.

>
> Compilations from 'cold' are uncommon. But file ops are anyway not under
> your control (or maybe they are for you!).
>
>

--
James Harris

Bart

unread,

Nov 6, 2022, 9:58:06 AM11/6/22

to

On 06/11/2022 08:36, James Harris wrote:

> On 05/11/2022 22:10, Bart wrote:

>> If so, why? I just use C's `fread()` to read an entire file at once;
>> job done. If all the 100s of apps on my PC all used the same slow file
>> read methods, the machine would grind to a halt.
>
> If your parser need to backtrack in the source that makes sense. But as
> shown in my prior post there's no appreciable speed advantage.

Backtracking isn't the reason. In a fast tokeniser, you want to traverse
source code by simply incrementing a pointer, for example:

doswitch lxsptr++^
when 'a'..'z', '_', '$' then
...

What you don't want is to have to call into the OS for every character.

A basic lexer (recognising tokens but not doing identifier lookups) can
get through some 300Mcps on my machine.

But a loop scanning a file with fgetc(), and nothing else, manages only
60Mcps. And that's my apparently fast fgetc.

Here's that test program that manages 300Mcps on my machine:

https://github.com/sal55/langs/blob/master/clex.c

This is machine generated from my language, and ought to be work on
Linux. (However if your Linux doesn't use 1000 ticks per second for
clock(), the figures shown will be out by 1000; this should be obvious;
maybe fix it on line 1000)

Instructions to build and run are at the top. While ostensibly for C
syntax, it'll probably cope with anything including your parser code. If
it does, prepare a version which is 100 or 1000 times bigger (just
duplicate).

>>
>>
>>> Perhaps more interesting is where an individual compiler spends its
>>> time: how long in lexing, statement parsing, expression parsing, IR
>>> generation, IR alterations, optimisation, code gen, etc. At some
>>> point I may add code to gather such info.
>>
>>
>> Here are some figures from a recent test (any smaller input, is too
>> fast to measure accurately):
>>
>> Load modules 0 msec Load all sources (here, 9MB mostly in 1 file)
>
> Are you sure that's not mapping the file into virtual memory rather than
> read it?

It's just doing fread(). However this is not the correct load time. It
excludes loading the project file for the program, which is the lead
module, usually very small. But here, the lead module is the primary 9MB
source file.

But even properly timed, I still usually get a time of 0ms (resolution
is 8/16msec anyway). After all, this read mainly comes down to copying
9MB of data from one memory buffer to another; how long can that take?

If create a test-file which is 10 times the size (90MB) then it takes
30msec. This is 5000 times the size of your parser module. As I said,
loading should take no time at all.

(I like to exclude file load and write times, firstly because I'm timing
my compiler, not OS operations. But also because the compiler can be
invoked on source code that is already in memory, as part of some
resident application, and it may generate code in memory to be run
directly. But as it happens these are tiny overheads anyway.)

James Harris

unread,

Nov 9, 2022, 6:54:14 AM11/9/22

to

On Sunday, 6 November 2022 at 14:58:06 UTC, Bart wrote:
> On 06/11/2022 08:36, James Harris wrote:
> > On 05/11/2022 22:10, Bart wrote:
>
> >> If so, why? I just use C's `fread()` to read an entire file at once;
> >> job done. If all the 100s of apps on my PC all used the same slow file
> >> read methods, the machine would grind to a halt.

I should say I want my compiler to be able to run on a range of hardware so I don't want to assume that the memory size will always be greater than the source size - though I can see definite advantages to your approach, especially in terms of speed or if the hardware lacks paging.

> >
> > If your parser need to backtrack in the source that makes sense. But as
> > shown in my prior post there's no appreciable speed advantage.
> Backtracking isn't the reason. In a fast tokeniser, you want to traverse
> source code by simply incrementing a pointer, for example:
>
> doswitch lxsptr++^
> when 'a'..'z', '_', '$' then
> ...

Understood. Presumably you find the beginning and end of the token and then, where appropriate, copy it from there to a symbol table (or have the symbol table point at it where it stands).

I am assuming from our other discussion that lxsptr++^ retrieves the character which lxsptr points at and then advances lxsptr to point to the next character position.

>
> What you don't want is to have to call into the OS for every character.

For sure. If each syscall adds 2 or 3 microseconds, say, to the work required they can soon add up.

>
> A basic lexer (recognising tokens but not doing identifier lookups) can
> get through some 300Mcps on my machine.

Surely such an algorithm may come down to little more than how fast the machine can scan memory. To make a good test of a compiler's performance a load of other things would need to be included in the source it had to compile. Such things as many identifiers of different lengths, symtab insertions, retrievals, scope creation and destruction, different constructs: loops, ifs, switches, exceptions, etc. It would make sense also to have plenty of inefficient source code for the optimiser to do its magic on.

Such source is probably best generated by a program - which would be satisfyingly contrarian. ;)

--
James

Bart

unread,

Nov 9, 2022, 9:12:33 AM11/9/22

to

On 09/11/2022 11:54, James Harris wrote:
> On Sunday, 6 November 2022 at 14:58:06 UTC, Bart wrote:
>> On 06/11/2022 08:36, James Harris wrote:
>>> On 05/11/2022 22:10, Bart wrote:
>>
>>>> If so, why? I just use C's `fread()` to read an entire file at once;
>>>> job done. If all the 100s of apps on my PC all used the same slow file
>>>> read methods, the machine would grind to a halt.
>
> I should say I want my compiler to be able to run on a range of hardware so I don't want to assume that the memory size will always be greater than the source size - though I can see definite advantages to your approach, especially in terms of speed or if the hardware lacks paging.

I think that these days everyone runs their compilers on decent
hardware, like PCs with plenty of ram and storage. Even if the target is
some embedded device.

Even the original Raspberry Pi had some 0.5GB IIRC.

I'm not planning to run my tools on any 64KB or smaller system anytime
soon. I passed that stage decades ago.

So, what is the largest source file you're ever likely to have to deal
with? Even a 1Mloc file will only be 30-40MB, barely 1% of a typical
PC's memory.

The largest input I deal with is 50,000 lines which represents all the
source files of the application, for my /whole-program/ compiler.

This totals approx 1MB, or 1/8000th of my machine's RAM. So it would be
ludicrous to start using tiny buffers on the off-chance that I would one
day have to deal with inputs 1000s of times bigger representing one
single program.

Also bear in mind that a smaller target usually means a correspondingly
smaller program.

> Understood. Presumably you find the beginning and end of the token and then, where appropriate, copy it from there to a symbol table (or have the symbol table point at it where it stands).
>
> I am assuming from our other discussion that lxsptr++^ retrieves the character which lxsptr points at and then advances lxsptr to point to the next character position.

Yes.

>>
>> What you don't want is to have to call into the OS for every character.
>
> For sure. If each syscall adds 2 or 3 microseconds, say, to the work required they can soon add up.
>
>>
>> A basic lexer (recognising tokens but not doing identifier lookups) can
>> get through some 300Mcps on my machine.
>
> Surely such an algorithm may come down to little more than how fast the machine can scan memory.

The 300Mcps figure is how many characters it gets through while it's
only doing tokenising. The actual figure was 200-360Mcps during basic
tokening, depending on input (sqlite3.c is 40% comments so that gives a
faster throughput).

A loop that only scan characters (say adding up their values to ensure
it's not optimised out) manages up to 2500Mcps (here it's independent of
style of content).

My point was that if you introduce a bottleneck like a complex OS call
per-character, it can have a significant impact, one that is trivially
avoided.

> To make a good test of a compiler's performance a load of other things would need to be included in the source it had to compile. Such things as many identifiers of different lengths, symtab insertions, retrievals, scope creation and destruction, different constructs: loops, ifs, switches, exceptions, etc. It would make sense also to have plenty of inefficient source code for the optimiser to do its magic on.
>
> Such source is probably best generated by a program - which would be satisfyingly contrarian. ;)

Such source is best taken from real programs, as synthesised ones will
never have that variety. Machine-generated input is best for development
as you can then concentrate on specific token types, or short vs long
names, or random vs similar identifiers, that sort of thing.

(Have a look at sqlite3.c here:
https://github.com/sal55/langs/blob/master/Parsing/sqlite3.c. This is
quite brutal code for a C compiler. Also
https://github.com/sal55/langs/blob/master/fann4.zip shows my own
740Kloc test input.)

David Brown

unread,

Nov 9, 2022, 9:44:59 AM11/9/22

to

I agree with everything Bart says here. It has been perhaps 50 years
since it has been necessary to assume a compiler might not be able to
handle a whole file at a time. (And that means having enough memory to
do the compilation, optimisation and generation of the object file - not
just hold the source code.)

For smaller targets, people use bigger development systems - again, for
the last 50 years or so. Cross-compilation is the norm for small targets.

Memory space on build machines only matters when you have huge programs,
and even then it is usually linking rather than pure compilation (and
especially link-time optimisation) that you need a more powerful build
system. So if you are building Firefox or LibreOffice, with link-time
optimisation, you'll want a system with as many cores and as much ram as
you can afford. For more "normal" programs, you can happily assume the
hosting computer has unlimited memory.

Bart

unread,

Nov 10, 2022, 5:56:01 AM11/10/22

to

On 09/11/2022 11:54, James Harris wrote:

> Understood. Presumably you find the beginning and end of the token and then, where appropriate, copy it from there to a symbol table (or have the symbol table point at it where it stands).

I did use this latter idea for a while: the ST points into the source
code. But it had too many problems:

* The rest of the program likes zero-terminated strings, but I can't
inject a zero here, because it could overwrite part of the next token. I
had a scheme where it worked a token in hand, or used some convoluted
logic to get around it, but it got too much

* It means the source text is modified, either because, due to
case-insensitivity, `Abc` is changed to `abc`, or also because there
could be an injected 0 byte. In either case this caused problems (for
example in reporting an error and displaying that line of source code,
and in some cases where the same source has to be retokenised).

In the end I took the hit and copied the identifier onto the heap. The
compiler is still plenty fast.

There are similar, and more severe, issues with:

* Scanning numeric constants with separators. I liked to remove the
separators to form a compact token (so `1_234_567` becomes `1234567`),
to simplify later conversion to its value (you don't always know while
scanning whether it's int or float for example). But this really messes
up the source code if done in-place.

* Scanning string constants with embedded escape sequences. I wanted the
final string, with escapes converted, to be in the same place. (The new
string will never be longer than the original - I think).

anti...@math.uni.wroc.pl

unread,

Nov 10, 2022, 1:45:27 PM11/10/22

to

James Harris <james.h...@gmail.com> wrote:
>
> time dd if=infile of=/dev/null bs=1
> time dd if=infile of=/dev/null bs=512
>
> The first takes 370 ms. The second just 5ms. QED, I think. :)

I was going to write that this is too much, system calls should
be much faster. But then I checked and on two machines I got
timings rougly agreeing with yours (almost the same time on
a slow one and about halt on a rather fast one). But I also
checked on third and I got much lower time: 24ms for 100k file.

The third machine runs old OS, two first have newer system. So
it looks that we see security patches in action. The claim was
that slowdown of "typical" programs will be moderate, but 'bs=1'
is atypical and give some idea how bad it can get.

--
Waldek Hebisch

anti...@math.uni.wroc.pl

unread,

Nov 10, 2022, 2:20:36 PM11/10/22

to

Bart <b...@freeuk.com> wrote:
> >> On 06/11/2022 08:36, James Harris wrote:
> >>> On 05/11/2022 22:10, Bart wrote:
> >>
> >>>> If so, why? I just use C's `fread()` to read an entire file at once;
> >>>> job done. If all the 100s of apps on my PC all used the same slow file
> >>>> read methods, the machine would grind to a halt.
> >
> > I should say I want my compiler to be able to run on a range of hardware so I don't want to assume that the memory size will always be greater than the source size - though I can see definite advantages to your approach, especially in terms of speed or if the hardware lacks paging.
>
> I think that these days everyone runs their compilers on decent
> hardware, like PCs with plenty of ram and storage. Even if the target is
> some embedded device.
>
> Even the original Raspberry Pi had some 0.5GB IIRC.
>
> I'm not planning to run my tools on any 64KB or smaller system anytime
> soon. I passed that stage decades ago.
>
> So, what is the largest source file you're ever likely to have to deal
> with? Even a 1Mloc file will only be 30-40MB, barely 1% of a typical
> PC's memory.
>
> The largest input I deal with is 50,000 lines which represents all the
> source files of the application, for my /whole-program/ compiler.
>
> This totals approx 1MB, or 1/8000th of my machine's RAM. So it would be
> ludicrous to start using tiny buffers on the off-chance that I would one
> day have to deal with inputs 1000s of times bigger representing one
> single program.

You seem to care about speed. In the past buffers of order 4k
theoretically should give best performance. Namely, buffer
was sized to be significantly smaller than L1 cache, so that
reads would not disturb too much content of L1 cache. In
modern security climate it seems that OS takes any possible
pretext to fush caches, defeating benefits from small buffers.
Still, using small buffers is easy, so why use big ones if
they do not give measurable advantages?

--
Waldek Hebisch

Bart

unread,

Nov 10, 2022, 3:54:13 PM11/10/22

to

Because it's extra complexity for no benefit. Lexing code will have to
deal with the possibility that a token is split across two buffers which
means checking that on every character, which is much more likely to
slow things down.

I remember /having/ to use such techniques over 30 years ago, because
memory was limited. Now there is tons of memory, but we still have to
access the file system using 4KB buffers within our apps?

The largest source file I have is 80KB, which is 0.001% of the memory on
my PC already. Now you say I should access that 80KB in 4KB chunks; 4KB
is 0.00005% of my machine's 8GB. It seems that it was pretty pointless
having all that memory, as I'm not allowed to use it!

Sorry, but I can't see the point. This would only affect loadtime
(getting a file's contents into /my/ buffer), which is already too fast
to measure: about 30ms to read 90MB, so perhaps 30us to read my 80KB in
one go.

anti...@math.uni.wroc.pl

unread,

Nov 10, 2022, 9:32:52 PM11/10/22

to

Well, with reasonable language and compiler organization cost is
very low: one puts otherwise invalid charater after buffer.
If one uses dispatch table handling invalid character is just one
more position in table, with almost no cost for processing
normal characters.

> I remember /having/ to use such techniques over 30 years ago, because
> memory was limited. Now there is tons of memory, but we still have to
> access the file system using 4KB buffers within our apps?
>
> The largest source file I have is 80KB, which is 0.001% of the memory on
> my PC already. Now you say I should access that 80KB in 4KB chunks; 4KB
> is 0.00005% of my machine's 8GB. It seems that it was pretty pointless
> having all that memory, as I'm not allowed to use it!

Well, your file is already in memory (in system cache), so you
are using memory. And there are other good uses.

> Sorry, but I can't see the point. This would only affect loadtime
> (getting a file's contents into /my/ buffer), which is already too fast
> to measure: about 30ms to read 90MB, so perhaps 30us to read my 80KB in
> one go.

Well, you should be used to fact that other folks do not see
the point of what you are doing. So do not worry, points of
view differ...

--
Waldek Hebisch

Bart

unread,

Nov 11, 2022, 6:57:07 AM11/11/22

to

Which currently is seen as end of token. One technique I use is to first
identify the span or length of a 'long' token (name, number, string) and
then to deal separately with that substring (copy to ST, turn into a
number, etc).

If I had to use buffering, I'm sure I could find a way of making it work
efficiently (I must have done in the past), but it will be more effort
while not being as efficient or as simple.

It's convenient these days to treat small files (say files under 100MB)
as in-memory strings.

This is from my scripting language:

s:=readstrfile("/mx/big/fann4.m")
println s.len

It reads that same 9MB test file (here already cached). It takes 60ms to
load that file and create a string which is 9M character long.

> Well, your file is already in memory (in system cache), so you
> are using memory. And there are other good uses.

Yes, in most cases the file is already somewhere in memory, just outside
my program. Because:

* I've just been editing it
* It has already been compiled in the last minute or so
* It has just been downloaded, copied etc.

So the cost I'm seeing is that of transfering that data (up to 80KB per
file) from wherever the OS is keeping it, into my memory space.

I've just restarted my machine. The load time for a 9MB input is:

* 13/31ms from SSD (two different copies; I don't know the reason
they're that different)
* 185ms from HD

Second time around, load times were 0ms for both (or nearer to 0 than to
8 or 16ms)

This file is about 100 times bigger than my biggest single module. I
don't imagine that reading a file buffered would make any difference.

James Harris

unread,

Nov 13, 2022, 10:48:18 AM11/13/22

to

On 09/11/2022 14:12, Bart wrote:
> On 09/11/2022 11:54, James Harris wrote:
>> On Sunday, 6 November 2022 at 14:58:06 UTC, Bart wrote:
>>> On 06/11/2022 08:36, James Harris wrote:
>>>> On 05/11/2022 22:10, Bart wrote:
>>>
>>>>> If so, why? I just use C's `fread()` to read an entire file at once;
>>>>> job done. If all the 100s of apps on my PC all used the same slow file
>>>>> read methods, the machine would grind to a halt.
>>
>> I should say I want my compiler to be able to run on a range of
>> hardware so I don't want to assume that the memory size will always be
>> greater than the source size - though I can see definite advantages to
>> your approach, especially in terms of speed or if the hardware lacks
>> paging.
>
> I think that these days everyone runs their compilers on decent
> hardware, like PCs with plenty of ram and storage. Even if the target is
> some embedded device.

When I started this I took your view - that compilations (including
cross compilations) would run on normal PCs. That meant that compilers
and other build programs would have plenty of resources to play with.
However, someone pointed out to me that an application which is running
on a target machine may also want to execute a compile step. I don't
imagine it would be used much but it's a fair point. Therefore, in my
case, it makes sense to use only resources which are needed. A program
which works on a small machine will still work on a full-blown PC
whereas the converse is not necessarily true.

...

>> To make a good test of a compiler's performance a load of other things
>> would need to be included in the source it had to compile. Such things
>> as many identifiers of different lengths, symtab insertions,
>> retrievals, scope creation and destruction, different constructs:
>> loops, ifs, switches, exceptions, etc. It would make sense also to
>> have plenty of inefficient source code for the optimiser to do its
>> magic on.
>>
>> Such source is probably best generated by a program - which would be
>> satisfyingly contrarian. ;)
>
> Such source is best taken from real programs, as synthesised ones will
> never have that variety. Machine-generated input is best for development
> as you can then concentrate on specific token types, or short vs long
> names, or random vs similar identifiers, that sort of thing.

Well, I suspect that a human-written program may be too small for an
indicative speed test. The time taken to compile such source would
likely be overshadowed by overheads.

--
James Harris

Bart

unread,

Nov 13, 2022, 4:29:00 PM11/13/22

to

So the requirement when writing any cross-compiler is that the compiler
should be able to self-host on the same target?

I'm sceptical as to how useful and how practical that will be.

Because, it's not just about the source file. You may be able to reduce
memory requirements for scanning the source file to 4KB, but all the
other data structures still need to represent the whole file.

In my compiler, all those data structures, if the memory is never freed
when no longer needed, add to up to approx 30 times the size of the
source code. About 10 times if considering only the AST and symbol and
type tables.

Now you could structure the compiler so that all those outputs are
written out to files too. Then, congratulations, you've written a 1970s
compiler.

But consider also that the target machine might not have a file system,
or if it has, it's too small.

> Well, I suspect that a human-written program may be too small for an
> indicative speed test. The time taken to compile such source would
> likely be overshadowed by overheads.

The SQLite example is some 250Kloc of actual code in one file.

My 'Fann' example is 50-100 lines (depending on language) which is
duplicated for a large line count. So it's semi-synthised, but look at
any lines at random however, and it looks like real code.

James Harris

unread,

Nov 13, 2022, 5:05:22 PM11/13/22

to

On 13/11/2022 21:28, Bart wrote:
> On 13/11/2022 15:48, James Harris wrote:

...

>> When I started this I took your view - that compilations (including
>> cross compilations) would run on normal PCs. That meant that compilers
>> and other build programs would have plenty of resources to play with.
>> However, someone pointed out to me that an application which is
>> running on a target machine may also want to execute a compile step. I
>> don't imagine it would be used much but it's a fair point. Therefore,
>> in my case, it makes sense to use only resources which are needed. A
>> program which works on a small machine will still work on a full-blown
>> PC whereas the converse is not necessarily true.
>
> So the requirement when writing any cross-compiler is that the compiler
> should be able to self-host on the same target?

No, the goal is just to use resources as required! See below.

I could not quite work out what you meant about self hosting on the same
target and while I used the common term I don't have a specific concept
of a cross compiler. My build process is intended to be:

source to IR
IR adjustments to suit target
IR to target

The latter part would be target specific.

>
> I'm sceptical as to how useful and how practical that will be.
>
> Because, it's not just about the source file. You may be able to reduce
> memory requirements for scanning the source file to 4KB, but all the
> other data structures still need to represent the whole file.

Sure.

>
> In my compiler, all those data structures, if the memory is never freed
> when no longer needed, add to up to approx 30 times the size of the
> source code. About 10 times if considering only the AST and symbol and
> type tables.

That's interesting. How large would your tree and your composite symbol
table typically be compared with the source?

>
> Now you could structure the compiler so that all those outputs are
> written out to files too. Then, congratulations, you've written a 1970s
> compiler.

As it happens, my IR and ST are intended to be written out! They are
meant to be the primary means of program distribution, though I am not
doing that at the moment.

>
> But consider also that the target machine might not have a file system,
> or if it has, it's too small.

I wouldn't see a file system as essential but the executable program has
to be got into the target machine's memory somehow. Maybe you are
thinking of some environments which I would find difficult to compile for?

--
James Harris

Bart

unread,

Nov 13, 2022, 7:23:21 PM11/13/22

to

On 13/11/2022 22:05, James Harris wrote:
> On 13/11/2022 21:28, Bart wrote:

>> In my compiler, all those data structures, if the memory is never
>> freed when no longer needed, add to up to approx 30 times the size of
>> the source code. About 10 times if considering only the AST and symbol
>> and type tables.
>
> That's interesting. How large would your tree and your composite symbol
> table typically be compared with the source?

I'd expect the AST to take more memory than the source code. For example
this line:

a:=b+c*d

takes 9/10 bytes.

But this is represented as 7 AST nodes:

(assign a (add b (mul c d)))

Each AST node in my static compiler is 64 bytes (this is on x64 where
pointers are 8 bytes). So that's 448 bytes. ST entries are less relevant
here as they are shared (there are only 4 in all).

So, 20-40 times more than the source code for this example. My tests
showed 10x, because a lot of the source doesn't generate ASTs
(declarations, comments) and involves more white space, plus usually
there are keywords and longer identifiers.

The next data structure is a representation of x64 code using a list of
records. There is one per x64 instruction and occupies 32 bytes. My
expression generates 5 instructions, so that's a further 160 bytes.

The actual machine code generated in the next stage is some 20 bytes,
when a, b, c, d are stack-frame variables.

This probably comes across as excessive, but long repetitions of
`a:=b+c*d` was one of my early tests, and a lot of compilers have severe
problems including running out of memory.

My compiler takes 1.5GB for 2M lines of this, of which the above
accounts for 1.3GB. Some compilers had trouble getting beyond even 20K
lines.

The source in this case is 18MB, which is dwarfed by the 1500MB memory
usage. My compiler turns that 18MB source into a 40MB EXE in about 2
seconds.

Tom Lake

unread,

Nov 24, 2022, 9:34:18 AM11/24/22

to

So you don't care about how fast the generated code runs, instead focusing on how fast the program compiles?
That seems irrelevant to me since you're going to compile only once or a few times to debug it but run the program many times (if it's useful)

Bart

unread,

Nov 24, 2022, 9:53:15 AM11/24/22

to

On 24/11/2022 14:34, Tom Lake wrote:
> On Saturday, November 5, 2022 at 11:03:47 AM UTC-4, James Harris wrote:
>> Any thoughts on what would be in a good piece of source code to use to
>> test a compiler's speed?
>>
>> I guess I could do with something that's (1) realistic, not just an
>> assignment statement repeated over and over again, (2) large enough to
>> give lengthy timings, (3) easy enough to make, and (4) it would be valid
>> over multiple generations of the compiler. But what should be in it?
>>
>> Here are some comparative timings taken just now. They read and compile
>> my parser (as it's the largest source file in the compiler). It is about
>> 18k in size and from it the compiler produces am asm file of a little
>> over 100k.
>>
>> cda 566 ms
>> cdb 540 ms
>> cdc 600 ms
>> cdd 24 ms
>>
>> The reason for the gratifying jump in performance at the end is that I
>> added input and output buffering to cdd but it's got me wondering about
>> testing the times taken by future compilers.
>>
>> I should say that I don't expect compile times always to improve. The
>> run times of later compilers would likely go up as I add facilities and
>> down as I switch to other mechanisms. But it would still be something
>> I'd like to keep an eye on.

> So you don't care about how fast the generated code runs, instead focusing on how fast the program compiles?
> That seems irrelevant to me since you're going to compile only once or a few times to debug it but run the program many times (if it's useful)

I can run my compiler hundreds of times a day so fast edit-run cycles
are vital during intensive development (and I am very impatient, nor do
I like to lose concentration).

The speed of the generated code is actually less important: for my
stuff, unoptimised code might only be half the speed of optimised, but
during development is largely irrelevant anyway. For production versions
of working programs, there are ways to get those faster if necessary.

However, compilation speed can easily vary by 100:1 between fast and
slow compilers (on one test, but for diverse languages, I estimated
80,000:1). So, yes, a slow compiler could do with speeding up.

Complicated languages can hinder that, but I believe the language here,
like mine, is straightforward to compile.

As a matter of interest, how long do you typically wait on build-time
for a project? (And for what sort of line-count.) How long if
incremental compilation tricks are taken out of play, and a full build
is needed?