Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

432 views

Skip to first unread message

Dec 20, 2023, 1:08:54 PM12/20/23

to

Vir Campestris <vir.cam...@invalid.invalid> wrote:

> This is not the right group for this - but I don't know where is.

> Suggestions on a postcard please...

I'm crossposting this to comp.arch, where they may have some ideas.

> For reasons I won't go into I've been writing some code to evaluate

> memory performance on my AMD Ryzen 5 3400G.

>

> It says in the stuff I've found that each core has an 8-way set

> associative L1 data cache of 128k (and an L1 instruction cache); an L2

> cache of 512k, also set associative; and there's an L3 cache of 4MB.

>

> To measure the performance I have three nested loops.

>

> The innermost one goes around a loop incrementing a series of 64 bit

> memory locations. The length of the series is set by the outermost loop.

>

> The middle one repeats the innermost loop so that the number of memory

> accesses is constant regardless of the series length.

>

> The outermost one sets the series length. It starts at 1, and doubles it

> each time.

>

> I _thought_ what would happen is that as I increase the length of the

> series after a while the data won't fit in the cache, and I'll see a

> sudden slowdown.

>

> What I actually see is:

>

> With a series length of 56 to 128 bytes I get the highest speed.

>

> With a series length of 500B to 1.5MB, I get a consistent speed of about

> 2/3 the highest speed.

>

> Once the series length exceeds 1.5MB the speed drops, and is consistent

> from then on. That I can see is main memory speed, and is about 40% of

> the highest.

>

> OK so far.

>

> But...

> Series length 8B is about the same as the 56 to 128 speed. Series length

> 16B is a bit less. Series length 32 is a lot less. Not as slow as main

> memory, but not much more than half the peak speed. My next step up is

> the peak speed. Series length 144 to 448 is slower still - slower in

> fact than the main memory speed.

>

> WTF?

>

> I can post the code (C++, but not very complex) if that would help.

For 'series length 8B/16B/32B' do you mean 8 bytes? ie 8B is a single 64

bit word transferred?

What instruction sequences are being generated for the 8/16/32/64 byte

loops? I'm wondering if the compiler is using different instructions,

eg using MMX, SSE, AVX to do the operations. Maybe they are having

different caching behaviour?

It would help if you could tell us the compiler and platform you're using,

including version.

Theo

> This is not the right group for this - but I don't know where is.

> Suggestions on a postcard please...

I'm crossposting this to comp.arch, where they may have some ideas.

> For reasons I won't go into I've been writing some code to evaluate

> memory performance on my AMD Ryzen 5 3400G.

>

> It says in the stuff I've found that each core has an 8-way set

> associative L1 data cache of 128k (and an L1 instruction cache); an L2

> cache of 512k, also set associative; and there's an L3 cache of 4MB.

>

> To measure the performance I have three nested loops.

>

> The innermost one goes around a loop incrementing a series of 64 bit

> memory locations. The length of the series is set by the outermost loop.

>

> The middle one repeats the innermost loop so that the number of memory

> accesses is constant regardless of the series length.

>

> The outermost one sets the series length. It starts at 1, and doubles it

> each time.

>

> I _thought_ what would happen is that as I increase the length of the

> series after a while the data won't fit in the cache, and I'll see a

> sudden slowdown.

>

> What I actually see is:

>

> With a series length of 56 to 128 bytes I get the highest speed.

>

> With a series length of 500B to 1.5MB, I get a consistent speed of about

> 2/3 the highest speed.

>

> Once the series length exceeds 1.5MB the speed drops, and is consistent

> from then on. That I can see is main memory speed, and is about 40% of

> the highest.

>

> OK so far.

>

> But...

> Series length 8B is about the same as the 56 to 128 speed. Series length

> 16B is a bit less. Series length 32 is a lot less. Not as slow as main

> memory, but not much more than half the peak speed. My next step up is

> the peak speed. Series length 144 to 448 is slower still - slower in

> fact than the main memory speed.

>

> WTF?

>

> I can post the code (C++, but not very complex) if that would help.

For 'series length 8B/16B/32B' do you mean 8 bytes? ie 8B is a single 64

bit word transferred?

What instruction sequences are being generated for the 8/16/32/64 byte

loops? I'm wondering if the compiler is using different instructions,

eg using MMX, SSE, AVX to do the operations. Maybe they are having

different caching behaviour?

It would help if you could tell us the compiler and platform you're using,

including version.

Theo

Dec 20, 2023, 2:00:40 PM12/20/23

to

Can you present a table of the timing results ??

Dec 21, 2023, 9:11:18 AM12/21/23

to

On 20/12/2023 18:08, Theo wrote:

> Vir Campestris <vir.cam...@invalid.invalid> wrote:

>> This is not the right group for this - but I don't know where is.

>> Suggestions on a postcard please...

>

> I'm crossposting this to comp.arch, where they may have some ideas.

>

<snip>
> Vir Campestris <vir.cam...@invalid.invalid> wrote:

>> This is not the right group for this - but I don't know where is.

>> Suggestions on a postcard please...

>

> I'm crossposting this to comp.arch, where they may have some ideas.

>

>

> For 'series length 8B/16B/32B' do you mean 8 bytes? ie 8B is a single 64

> bit word transferred?

>

Yes. My system has a 64 bit CPU and 64 bit main memory.
> For 'series length 8B/16B/32B' do you mean 8 bytes? ie 8B is a single 64

> bit word transferred?

>

> What instruction sequences are being generated for the 8/16/32/64 byte

> loops? I'm wondering if the compiler is using different instructions,

> eg using MMX, SSE, AVX to do the operations. Maybe they are having

> different caching behaviour?

>

the loop sizes.

> It would help if you could tell us the compiler and platform you're using,

> including version.

>

Copyright (C) 2021 Free Software Foundation, Inc.

This is free software; see the source for copying conditions. There is NO

warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Which of course tells you I'm running Ubuntu!

On 20/12/2023 18:58, MitchAlsup wrote:

>

> Can we see the code ??

>

> Can you present a table of the timing results ??

are my results for powers of 2.

Size 1 gave 3.82242e+09 bytes/second.

Size 2 gave 3.80533e+09 bytes/second.

Size 4 gave 2.68017e+09 bytes/second.

Size 8 gave 2.33751e+09 bytes/second.

Size 16 gave 2.18424e+09 bytes/second.

Size 32 gave 2.10243e+09 bytes/second.

Size 64 gave 1.99371e+09 bytes/second.

Size 128 gave 1.98475e+09 bytes/second.

Size 256 gave 2.01653e+09 bytes/second.

Size 512 gave 2.00884e+09 bytes/second.

Size 1024 gave 2.02713e+09 bytes/second.

Size 2048 gave 2.01803e+09 bytes/second.

Size 4096 gave 3.26472e+09 bytes/second.

Size 8192 gave 3.85126e+09 bytes/second.

Size 16384 gave 3.85377e+09 bytes/second.

Size 32768 gave 3.85293e+09 bytes/second.

Size 65536 gave 2.06793e+09 bytes/second.

Size 131072 gave 2.06845e+09 bytes/second.

The code will continue, but the results are roughly stable for larger sizes.

The code I have put in a signature block; there's no point in risking

someone posting it again. I've commented it, but no doubt not in all the

right places! I'd be interested to know what results other people get.

Thanks

Andy

--

#include <chrono>

#include <iostream>

#include <vector>

int main()

{

// If your computer is much slower or faster than mine

// you may need to adjust this value.

constexpr size_t NextCount = 1 << 28;

std::vector<uint64_t> CacheStore(NextCount, 0);

// Get a raw pointer to the vector.

// On my machine (Ubuntu, g++) this improves

// performance. Using vector's operator[]

// results in a function call.

uint64_t *CachePtr = &CacheStore[0];

// SetSize is the count of the uint64_t items to be tested.

// I assume that when this is too big for a cache the data

// will overflow to the next level.

// Each loop it doubles in size. I've run it with smaller

// increments too, and the behaviour

// is still confusing.

for (auto SetSize = 1; SetSize < NextCount; SetSize<<=1)

{

size_t loopcount = 0;

size_t j = NextCount / SetSize;

auto start = std::chrono::steady_clock::now();

// The outer loop repeats enough times so that the data

// written by the inner loops of various sizes is

// approximately constant.

for (size_t k = 0; k < j; ++k)

{

// The inner loop modifies data

// within a set of words.

for (size_t l = 0; l < SetSize; ++l)

{

// read-modify-write some data.

// Comment this out

// to confirm that the looping is not

// the cause of the anomaly

++CachePtr[l];

// this counts the actual number

// of memory accesses.

// rounding errors means that for

// different SetSize values

// the count is not completely

// consistent.

++loopcount;

}

}

// Work out how long the loops took in microseconds,

// then scale to seconds

auto delta =

std::chrono::duration_cast<std::chrono::microseconds>

(std::chrono::steady_clock::now() - start).count()

/ 1e6;

// calculate how many bytes per second, and print.

std::cout << "Size " << SetSize << " gave "

<< (double)loopcount * (double)sizeof(uint64_t) /

delta << " bytes/second." << std::endl;

}

return 0;

}

Dec 21, 2023, 9:56:49 AM12/21/23

to

Vir Campestris <vir.cam...@invalid.invalid> writes:

>On 20/12/2023 18:08, Theo wrote:

>> Vir Campestris <vir.cam...@invalid.invalid> wrote:

>>> This is not the right group for this - but I don't know where is.

>>> Suggestions on a postcard please...

>>

>> I'm crossposting this to comp.arch, where they may have some ideas.

>>

><snip>

>>

>> For 'series length 8B/16B/32B' do you mean 8 bytes? ie 8B is a single 64

>> bit word transferred?

>>

>Yes. My system has a 64 bit CPU and 64 bit main memory.

Surely your main memory is just a sequence of 8-bit bytes (+ECC).
>On 20/12/2023 18:08, Theo wrote:

>> Vir Campestris <vir.cam...@invalid.invalid> wrote:

>>> This is not the right group for this - but I don't know where is.

>>> Suggestions on a postcard please...

>>

>> I'm crossposting this to comp.arch, where they may have some ideas.

>>

><snip>

>>

>> For 'series length 8B/16B/32B' do you mean 8 bytes? ie 8B is a single 64

>> bit word transferred?

>>

>Yes. My system has a 64 bit CPU and 64 bit main memory.

Dec 21, 2023, 10:32:15 AM12/21/23

to

hardware these days, but it used to be if you wanted to write a byte

into your 16-bit memory you had to read both bytes, then write one back.

And

Dec 21, 2023, 11:09:38 AM12/21/23

to

Now, these days DDR5 DIMM splits 64-bit data bus into pair of

independent 32-bit channels, but the total is still 64 bits.

That's a physical perspective. From logical perspective, DDR3 and DDR4

bits exchange data with controller in 512-bit chunks. On DDR5 DIMM each

channel exchanges data with controller in 512-bit chunks.

From signaling perspective, it is still possible (at least on non-ECC

gear) to tell to memory to write just 8 bits out of 512 and to ignore

the rest. In PC-class hardware this ability is used very rarely if used

at all.

Dec 21, 2023, 1:05:24 PM12/21/23

to

Vir Campestris wrote:

> Scott Lurndal wrote:

>

>> Surely your main memory is just a sequence of 8-bit bytes (+ECC).

>

> AIUI I have two DIMMs, each of which has a 32-bit bus. I'm not up on

> hardware these days, but it used to be if you wanted to write a byte

> into your 16-bit memory you had to read both bytes, then write one back.

I thought intel x64 machines work in cache lines of 64 bytes per memory

transaction ... maybe AMD processors are different?

> Scott Lurndal wrote:

>

>> Surely your main memory is just a sequence of 8-bit bytes (+ECC).

>

> AIUI I have two DIMMs, each of which has a 32-bit bus. I'm not up on

> hardware these days, but it used to be if you wanted to write a byte

> into your 16-bit memory you had to read both bytes, then write one back.

transaction ... maybe AMD processors are different?

Dec 21, 2023, 1:32:03 PM12/21/23

to

Vir Campestris wrote:

> On 20/12/2023 18:58, MitchAlsup wrote:

> >

> > Can we see the code ??

> >

> > Can you present a table of the timing results ??

> I've run this with more detailed increments on the line size, but here

> are my results for powers of 2.

{ A
> On 20/12/2023 18:58, MitchAlsup wrote:

> >

> > Can we see the code ??

> >

> > Can you present a table of the timing results ??

> I've run this with more detailed increments on the line size, but here

> are my results for powers of 2.

> Size 1 gave 3.82242e+09 bytes/second.

> Size 2 gave 3.80533e+09 bytes/second.

> Size 4 gave 2.68017e+09 bytes/second.

> Size 8 gave 2.33751e+09 bytes/second.

> Size 16 gave 2.18424e+09 bytes/second.

> Size 32 gave 2.10243e+09 bytes/second.

> Size 64 gave 1.99371e+09 bytes/second.

> Size 128 gave 1.98475e+09 bytes/second.

> Size 256 gave 2.01653e+09 bytes/second.

> Size 512 gave 2.00884e+09 bytes/second.

> Size 1024 gave 2.02713e+09 bytes/second.

> Size 2048 gave 2.01803e+09 bytes/second.

} A
> Size 2 gave 3.80533e+09 bytes/second.

> Size 4 gave 2.68017e+09 bytes/second.

> Size 8 gave 2.33751e+09 bytes/second.

> Size 16 gave 2.18424e+09 bytes/second.

> Size 32 gave 2.10243e+09 bytes/second.

> Size 64 gave 1.99371e+09 bytes/second.

> Size 128 gave 1.98475e+09 bytes/second.

> Size 256 gave 2.01653e+09 bytes/second.

> Size 512 gave 2.00884e+09 bytes/second.

> Size 1024 gave 2.02713e+09 bytes/second.

> Size 2048 gave 2.01803e+09 bytes/second.

{ B

> Size 4096 gave 3.26472e+09 bytes/second.

> Size 8192 gave 3.85126e+09 bytes/second.

> Size 16384 gave 3.85377e+09 bytes/second.

> Size 32768 gave 3.85293e+09 bytes/second.

> Size 65536 gave 2.06793e+09 bytes/second.

> Size 131072 gave 2.06845e+09 bytes/second.

} B
> Size 8192 gave 3.85126e+09 bytes/second.

> Size 16384 gave 3.85377e+09 bytes/second.

> Size 32768 gave 3.85293e+09 bytes/second.

> Size 65536 gave 2.06793e+09 bytes/second.

> Size 131072 gave 2.06845e+09 bytes/second.

A) Here we have a classical sequence pipelines often encounter where

a simple loop starts off fast 4 bytes/cycle and progressively slows

down by a factor of 2 (2 bytes per cycle) over some interval of

complexity (size). What is important is that factor of 2 something

that took 1 cycle early starts taking 2 cycles later on.

B) here we have a second classical sequence where the performance at

some boundary (4096) reverts back to the 1 cycle pipeline of performance

only to degrade again (in basically the same sequence as A). {{Side

note: size=4096 has "flutter" in the stairstep. size={8192..32768}

has peak performance--probably something to do with sets in the cache

and 4096 is the size of a page (TLB effects)}}.

I suspect the flutter has something to do with your buffer crossing

a page boundary.

Dec 21, 2023, 1:40:27 PM12/21/23

to

Michael S wrote:

> On Thu, 21 Dec 2023 15:32:10 +0000

> Vir Campestris <vir.cam...@invalid.invalid> wrote:

>> On 21/12/2023 14:56, Scott Lurndal wrote:

>> > Vir Campestris <vir.cam...@invalid.invalid> writes:

>> >> On 20/12/2023 18:08, Theo wrote:

>> >>> Vir Campestris <vir.cam...@invalid.invalid> wrote:

>> >>>> This is not the right group for this - but I don't know where is.

>> >>>> Suggestions on a postcard please...

>> >>>

>> >>> I'm crossposting this to comp.arch, where they may have some

>> >>> ideas.

>> >> <snip>

>> >>>

>> >>> For 'series length 8B/16B/32B' do you mean 8 bytes? ie 8B is a

>> >>> single 64 bit word transferred?

>> >>>

>> >> Yes. My system has a 64 bit CPU and 64 bit main memory.

>> >

>> > Surely your main memory is just a sequence of 8-bit bytes (+ECC).

>> >

>>

>> AIUI I have two DIMMs, each of which has a 32-bit bus. I'm not up on

>> hardware these days, but it used to be if you wanted to write a byte

>> into your 16-bit memory you had to read both bytes, then write one

>> back.

>>

>> And

> DIMMs have 64-bit data buses. Both these days and previous millennium.

> Now, these days DDR5 DIMM splits 64-bit data bus into pair of

> independent 32-bit channels, but the total is still 64 bits.

All DDR DIMMs have 64-bit busses (200 pins on the DIMM).
> On Thu, 21 Dec 2023 15:32:10 +0000

> Vir Campestris <vir.cam...@invalid.invalid> wrote:

>> On 21/12/2023 14:56, Scott Lurndal wrote:

>> > Vir Campestris <vir.cam...@invalid.invalid> writes:

>> >> On 20/12/2023 18:08, Theo wrote:

>> >>> Vir Campestris <vir.cam...@invalid.invalid> wrote:

>> >>>> This is not the right group for this - but I don't know where is.

>> >>>> Suggestions on a postcard please...

>> >>>

>> >>> I'm crossposting this to comp.arch, where they may have some

>> >>> ideas.

>> >> <snip>

>> >>>

>> >>> For 'series length 8B/16B/32B' do you mean 8 bytes? ie 8B is a

>> >>> single 64 bit word transferred?

>> >>>

>> >> Yes. My system has a 64 bit CPU and 64 bit main memory.

>> >

>> > Surely your main memory is just a sequence of 8-bit bytes (+ECC).

>> >

>>

>> AIUI I have two DIMMs, each of which has a 32-bit bus. I'm not up on

>> hardware these days, but it used to be if you wanted to write a byte

>> into your 16-bit memory you had to read both bytes, then write one

>> back.

>>

>> And

> DIMMs have 64-bit data buses. Both these days and previous millennium.

> Now, these days DDR5 DIMM splits 64-bit data bus into pair of

> independent 32-bit channels, but the total is still 64 bits.

Some SDR (pre 2000) DIMMs had 32-bit busses, some ancient laptop memory

carriers had 32-bit busses.

> That's a physical perspective. From logical perspective, DDR3 and DDR4

> bits exchange data with controller in 512-bit chunks. On DDR5 DIMM each

> channel exchanges data with controller in 512-bit chunks.

> From signaling perspective, it is still possible (at least on non-ECC

> gear) to tell to memory to write just 8 bits out of 512 and to ignore

> the rest. In PC-class hardware this ability is used very rarely if used

> at all.

requests can use this. AMD memory controllers hid this from the DRAMs

so we at least had the opportunity to recompute ECC on ECC carrying

DIMMs. MC would ask DRC for the line of data, check (and repair) ECC

then write the unCacheable data, and place the written data in the

outbound memory queue with its new ECC.

Dec 21, 2023, 4:21:05 PM12/21/23

to

lines split into two 64 regions? Iirc, it was on some of their first

hyperthreaded processors. A work around from intel was to offset threads

using alloca... I remember it.

Dec 21, 2023, 5:55:17 PM12/21/23

to

to clearly separate 64-bit DIMMs from 32-bit SIMMs.

> some ancient laptop

> memory carriers had 32-bit busses.

The later indeed had 72-pin variant with 32-bit bus.

>

> > That's a physical perspective. From logical perspective, DDR3 and

> > DDR4 bits exchange data with controller in 512-bit chunks. On DDR5

> > DIMM each channel exchanges data with controller in 512-bit chunks.

> >

>

> Note:: 512-bis is 64-bytes.

>

CPUs.

Dec 21, 2023, 6:50:28 PM12/21/23

to

board where pins make contact with the plug. They put pins on both

sides so they could get ~200 pins {Vdd, Gnd, lock, reset, control,

and data} this was in the early 1990s.

>> some ancient laptop

>> memory carriers had 32-bit busses.

> But were they called just DIMM or SO-DIMM?

> The later indeed had 72-pin variant with 32-bit bus.

>>

>> > That's a physical perspective. From logical perspective, DDR3 and

>> > DDR4 bits exchange data with controller in 512-bit chunks. On DDR5

>> > DIMM each channel exchanges data with controller in 512-bit chunks.

>> >

>>

>> Note:: 512-bis is 64-bytes.

>>

> Which *not* co-incidentally matches cache line size of majority of x86

> CPUs.

represented 90% of all computers being sold, and the people on the

committee wanting to keep it that way, is unsurprising.

Dec 21, 2023, 9:28:50 PM12/21/23

to

performance.

Dec 22, 2023, 8:42:44 AM12/22/23

to

overall time is consistent regardless of how many times it goes around

the inner loop, and how many times the outer one.

But pipelines are funny things.

64k x 64 bit words is I think my L3 cache size. It's not very big on my

CPU, they had to fit a GPU on the die.

I'd be interested to see the results from anyone else's computer.

Andy

Dec 22, 2023, 1:12:42 PM12/22/23

to

In comp.arch Vir Campestris <vir.cam...@invalid.invalid> wrote:

> I'd be interested to see the results from anyone else's computer.

$ neofetch --off
> I'd be interested to see the results from anyone else's computer.

OS: Kubuntu 23.10 x86_64

Host: 20QBS03N00 ThinkPad X1 Titanium Gen 1

Kernel: 6.5.0-14-generic

CPU: 11th Gen Intel i5-1130G7 (8) @ 4.000GHz

GPU: Intel Tiger Lake-UP4 GT2 [Iris Xe Graphics]

Memory: 4436MiB / 15704MiB

$ gcc --version

gcc version 13.2.0 (Ubuntu 13.2.0-4ubuntu3)

$ g++ -o cache cache.c

$ ./cache

Size 1 gave 4.13643e+09 bytes/second.

Size 2 gave 4.79971e+09 bytes/second.

Size 4 gave 4.87535e+09 bytes/second.

Size 8 gave 4.8321e+09 bytes/second.

Size 16 gave 4.71703e+09 bytes/second.

Size 32 gave 3.89488e+09 bytes/second.

Size 64 gave 4.02976e+09 bytes/second.

Size 128 gave 4.15832e+09 bytes/second.

Size 256 gave 4.19562e+09 bytes/second.

Size 512 gave 4.08511e+09 bytes/second.

Size 1024 gave 4.0796e+09 bytes/second.

Size 2048 gave 4.11983e+09 bytes/second.

Size 4096 gave 4.06869e+09 bytes/second.

Size 8192 gave 4.06807e+09 bytes/second.

Size 16384 gave 4.06217e+09 bytes/second.

Size 32768 gave 4.06067e+09 bytes/second.

Size 65536 gave 4.04791e+09 bytes/second.

Size 131072 gave 4.06143e+09 bytes/second.

Size 262144 gave 4.04301e+09 bytes/second.

Size 524288 gave 4.03872e+09 bytes/second.

Size 1048576 gave 3.97715e+09 bytes/second.

Size 2097152 gave 3.97609e+09 bytes/second.

Size 4194304 gave 3.98361e+09 bytes/second.

Size 8388608 gave 3.98617e+09 bytes/second.

Size 16777216 gave 3.98376e+09 bytes/second.

Size 33554432 gave 3.98504e+09 bytes/second.

Size 67108864 gave 3.98726e+09 bytes/second.

Size 134217728 gave 3.99495e+09 bytes/second.

Dec 22, 2023, 2:36:18 PM12/22/23

to

Theo wrote:

> $ g++ -o cache cache.c

> $ ./cache

{ A
> $ g++ -o cache cache.c

> $ ./cache

> Size 1 gave 4.13643e+09 bytes/second.

> Size 2 gave 4.79971e+09 bytes/second.

> Size 4 gave 4.87535e+09 bytes/second.

> Size 8 gave 4.8321e+09 bytes/second.

> Size 16 gave 4.71703e+09 bytes/second.

} A
> Size 2 gave 4.79971e+09 bytes/second.

> Size 4 gave 4.87535e+09 bytes/second.

> Size 8 gave 4.8321e+09 bytes/second.

> Size 16 gave 4.71703e+09 bytes/second.

{ B

> Size 32 gave 3.89488e+09 bytes/second.

> Size 64 gave 4.02976e+09 bytes/second.

> Size 128 gave 4.15832e+09 bytes/second.

> Size 256 gave 4.19562e+09 bytes/second.

> Size 512 gave 4.08511e+09 bytes/second.

> Size 1024 gave 4.0796e+09 bytes/second.

> Size 2048 gave 4.11983e+09 bytes/second.

> Size 4096 gave 4.06869e+09 bytes/second.

> Size 8192 gave 4.06807e+09 bytes/second.

> Size 16384 gave 4.06217e+09 bytes/second.

> Size 32768 gave 4.06067e+09 bytes/second.

> Size 65536 gave 4.04791e+09 bytes/second.

> Size 131072 gave 4.06143e+09 bytes/second.

> Size 262144 gave 4.04301e+09 bytes/second.

> Size 524288 gave 4.03872e+09 bytes/second.

> Size 1048576 gave 3.97715e+09 bytes/second.

> Size 2097152 gave 3.97609e+09 bytes/second.

> Size 4194304 gave 3.98361e+09 bytes/second.

> Size 8388608 gave 3.98617e+09 bytes/second.

> Size 16777216 gave 3.98376e+09 bytes/second.

> Size 33554432 gave 3.98504e+09 bytes/second.

> Size 67108864 gave 3.98726e+09 bytes/second.

> Size 134217728 gave 3.99495e+09 bytes/second.

} B
> Size 64 gave 4.02976e+09 bytes/second.

> Size 128 gave 4.15832e+09 bytes/second.

> Size 256 gave 4.19562e+09 bytes/second.

> Size 512 gave 4.08511e+09 bytes/second.

> Size 1024 gave 4.0796e+09 bytes/second.

> Size 2048 gave 4.11983e+09 bytes/second.

> Size 4096 gave 4.06869e+09 bytes/second.

> Size 8192 gave 4.06807e+09 bytes/second.

> Size 16384 gave 4.06217e+09 bytes/second.

> Size 32768 gave 4.06067e+09 bytes/second.

> Size 65536 gave 4.04791e+09 bytes/second.

> Size 131072 gave 4.06143e+09 bytes/second.

> Size 262144 gave 4.04301e+09 bytes/second.

> Size 524288 gave 4.03872e+09 bytes/second.

> Size 1048576 gave 3.97715e+09 bytes/second.

> Size 2097152 gave 3.97609e+09 bytes/second.

> Size 4194304 gave 3.98361e+09 bytes/second.

> Size 8388608 gave 3.98617e+09 bytes/second.

> Size 16777216 gave 3.98376e+09 bytes/second.

> Size 33554432 gave 3.98504e+09 bytes/second.

> Size 67108864 gave 3.98726e+09 bytes/second.

> Size 134217728 gave 3.99495e+09 bytes/second.

The B group are essentially 4.0 with noise.

The A group are essentially 4.8 with a startup flutter.

A 20% down or 25% up change at 32::64.

Dec 22, 2023, 4:58:20 PM12/22/23

to

In comp.arch MitchAlsup <mitch...@aol.com> wrote:

> Theo wrote:

> > $ g++ -o cache cache.c

> > $ ./cache

> { A

> > Size 1 gave 4.13643e+09 bytes/second.

> > Size 2 gave 4.79971e+09 bytes/second.

> > Size 4 gave 4.87535e+09 bytes/second.

> > Size 8 gave 4.8321e+09 bytes/second.

> > Size 16 gave 4.71703e+09 bytes/second.

> } A

> { B

> > Size 32 gave 3.89488e+09 bytes/second.

...
> Theo wrote:

> > $ g++ -o cache cache.c

> > $ ./cache

> { A

> > Size 1 gave 4.13643e+09 bytes/second.

> > Size 2 gave 4.79971e+09 bytes/second.

> > Size 4 gave 4.87535e+09 bytes/second.

> > Size 8 gave 4.8321e+09 bytes/second.

> > Size 16 gave 4.71703e+09 bytes/second.

> } A

> { B

> > Size 32 gave 3.89488e+09 bytes/second.

> > Size 134217728 gave 3.99495e+09 bytes/second.

> } B

>

> The B group are essentially 4.0 with noise.

> The A group are essentially 4.8 with a startup flutter.

> A 20% down or 25% up change at 32::64.

The nearest machine I have to the OP is this:
> } B

>

> The B group are essentially 4.0 with noise.

> The A group are essentially 4.8 with a startup flutter.

> A 20% down or 25% up change at 32::64.

OS: Ubuntu 20.04.6 LTS x86_64

Host: 1U4LW-X570/2L2T

Kernel: 5.4.0-167-generic

CPU: AMD Ryzen 7 5800X (16) @ 3.800GHz

GPU: 29:00.0 ASPEED Technology, Inc. ASPEED Graphics Family

Memory: 3332MiB / 128805MiB

gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0

Size 1 gave 6.22043e+09 bytes/second.

Size 2 gave 6.35674e+09 bytes/second.

Size 4 gave 4.14766e+09 bytes/second.

Size 8 gave 5.4239e+09 bytes/second.

Size 16 gave 6.01113e+09 bytes/second.

Size 32 gave 7.75976e+09 bytes/second.

Size 64 gave 8.7972e+09 bytes/second.

Size 128 gave 9.71523e+09 bytes/second.

Size 256 gave 9.91644e+09 bytes/second.

Size 512 gave 1.00179e+10 bytes/second.

Size 1024 gave 1.0065e+10 bytes/second.

Size 2048 gave 9.78508e+09 bytes/second.

Size 4096 gave 9.76764e+09 bytes/second.

Size 8192 gave 9.86537e+09 bytes/second.

Size 16384 gave 9.90053e+09 bytes/second.

Size 32768 gave 9.91552e+09 bytes/second.

Size 65536 gave 9.84556e+09 bytes/second.

Size 131072 gave 9.78442e+09 bytes/second.

Size 262144 gave 9.80282e+09 bytes/second.

Size 524288 gave 9.81447e+09 bytes/second.

Size 1048576 gave 9.81981e+09 bytes/second.

Size 2097152 gave 9.81456e+09 bytes/second.

Size 4194304 gave 9.70057e+09 bytes/second.

Size 8388608 gave 9.55507e+09 bytes/second.

Size 16777216 gave 9.44032e+09 bytes/second.

Size 33554432 gave 9.33896e+09 bytes/second.

Size 67108864 gave 9.28529e+09 bytes/second.

Size 134217728 gave 9.25213e+09 bytes/second.

which seems to have more flutter in group A. I get similar results in a VM

running on a AMD Ryzen 9 5950X (32) @ 3.393GHz.

Theo

Dec 22, 2023, 6:25:25 PM12/22/23

to

> gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0

{ A

{ B

{ C

> which seems to have more flutter in group A. I get similar results in a VM

> running on a AMD Ryzen 9 5950X (32) @ 3.393GHz.

A has some kind of startup irregularity: down then up.

B is essentially constant

C is slowly degrading (maybe from TLB effects}

> Theo

{ A

> Size 1 gave 6.22043e+09 bytes/second.

> Size 2 gave 6.35674e+09 bytes/second.

> Size 4 gave 4.14766e+09 bytes/second.

> Size 8 gave 5.4239e+09 bytes/second.

> Size 16 gave 6.01113e+09 bytes/second.

> Size 32 gave 7.75976e+09 bytes/second.

> Size 64 gave 8.7972e+09 bytes/second.

} A
> Size 2 gave 6.35674e+09 bytes/second.

> Size 4 gave 4.14766e+09 bytes/second.

> Size 8 gave 5.4239e+09 bytes/second.

> Size 16 gave 6.01113e+09 bytes/second.

> Size 32 gave 7.75976e+09 bytes/second.

> Size 64 gave 8.7972e+09 bytes/second.

{ B

> Size 128 gave 9.71523e+09 bytes/second.

> Size 256 gave 9.91644e+09 bytes/second.

> Size 512 gave 1.00179e+10 bytes/second.

> Size 1024 gave 1.0065e+10 bytes/second.

> Size 2048 gave 9.78508e+09 bytes/second.

> Size 4096 gave 9.76764e+09 bytes/second.

> Size 8192 gave 9.86537e+09 bytes/second.

> Size 16384 gave 9.90053e+09 bytes/second.

> Size 32768 gave 9.91552e+09 bytes/second.

> Size 65536 gave 9.84556e+09 bytes/second.

> Size 131072 gave 9.78442e+09 bytes/second.

> Size 262144 gave 9.80282e+09 bytes/second.

> Size 524288 gave 9.81447e+09 bytes/second.

> Size 1048576 gave 9.81981e+09 bytes/second.

> Size 2097152 gave 9.81456e+09 bytes/second.

} B
> Size 256 gave 9.91644e+09 bytes/second.

> Size 512 gave 1.00179e+10 bytes/second.

> Size 1024 gave 1.0065e+10 bytes/second.

> Size 2048 gave 9.78508e+09 bytes/second.

> Size 4096 gave 9.76764e+09 bytes/second.

> Size 8192 gave 9.86537e+09 bytes/second.

> Size 16384 gave 9.90053e+09 bytes/second.

> Size 32768 gave 9.91552e+09 bytes/second.

> Size 65536 gave 9.84556e+09 bytes/second.

> Size 131072 gave 9.78442e+09 bytes/second.

> Size 262144 gave 9.80282e+09 bytes/second.

> Size 524288 gave 9.81447e+09 bytes/second.

> Size 1048576 gave 9.81981e+09 bytes/second.

> Size 2097152 gave 9.81456e+09 bytes/second.

{ C

> Size 4194304 gave 9.70057e+09 bytes/second.

> Size 8388608 gave 9.55507e+09 bytes/second.

> Size 16777216 gave 9.44032e+09 bytes/second.

> Size 33554432 gave 9.33896e+09 bytes/second.

> Size 67108864 gave 9.28529e+09 bytes/second.

> Size 134217728 gave 9.25213e+09 bytes/second.

} C
> Size 8388608 gave 9.55507e+09 bytes/second.

> Size 16777216 gave 9.44032e+09 bytes/second.

> Size 33554432 gave 9.33896e+09 bytes/second.

> Size 67108864 gave 9.28529e+09 bytes/second.

> Size 134217728 gave 9.25213e+09 bytes/second.

> which seems to have more flutter in group A. I get similar results in a VM

> running on a AMD Ryzen 9 5950X (32) @ 3.393GHz.

B is essentially constant

C is slowly degrading (maybe from TLB effects}

> Theo

Dec 23, 2023, 4:34:19 PM12/23/23

to

On 22/12/2023 21:58, Theo wrote:

<snip>

but I compiled with -ofast. Does that make a difference?

Andy

<snip>

>

> which seems to have more flutter in group A. I get similar results

in a VM

> running on a AMD Ryzen 9 5950X (32) @ 3.393GHz.

>

> Theo

Thank you for those. Neither of them show the odd thing I was seeing -
> which seems to have more flutter in group A. I get similar results

in a VM

> running on a AMD Ryzen 9 5950X (32) @ 3.393GHz.

>

> Theo

but I compiled with -ofast. Does that make a difference?

Andy

Dec 23, 2023, 5:50:15 PM12/23/23

to

Vir Campestris <vir.cam...@invalid.invalid> schrieb:

That would put it in an executable named "fast" :-)

-march=native -mtune=native might make more of a difference, depending

on how good the compiler's model of your hardware is.

-march=native -mtune=native might make more of a difference, depending

on how good the compiler's model of your hardware is.

Dec 25, 2023, 9:56:25 AM12/25/23

to

On 23/12/2023 23:50, Thomas Koenig wrote:

> Vir Campestris <vir.cam...@invalid.invalid> schrieb:

>> On 22/12/2023 21:58, Theo wrote:

>> <snip>

>>>

>>> which seems to have more flutter in group A. I get similar results

>> in a VM

>>> running on a AMD Ryzen 9 5950X (32) @ 3.393GHz.

>>>

>>> Theo

>> Thank you for those. Neither of them show the odd thing I was seeing -

>> but I compiled with -ofast. Does that make a difference?

>

> That would put it in an executable named "fast" :-)

Some programs are really fussy about capitalisation!
> Vir Campestris <vir.cam...@invalid.invalid> schrieb:

>> On 22/12/2023 21:58, Theo wrote:

>> <snip>

>>>

>>> which seems to have more flutter in group A. I get similar results

>> in a VM

>>> running on a AMD Ryzen 9 5950X (32) @ 3.393GHz.

>>>

>>> Theo

>> Thank you for those. Neither of them show the odd thing I was seeing -

>> but I compiled with -ofast. Does that make a difference?

>

> That would put it in an executable named "fast" :-)

>

> -march=native -mtune=native might make more of a difference, depending

> on how good the compiler's model of your hardware is.

between -O0 and -O1 is often large, but the difference between -O1 and

higher levels usually makes far less difference. The processors

themselves do a good job of things like instruction scheduling and

register renaming at runtime, and are designed to be good for running

weakly optimised code. I find it makes a bigger difference on

processors that don't do as much at run-time, such as microcontroller cores.

But the "-march=native" can make a very big difference, especially if it

means the compiler can use SIMD or other advanced instructions. (You

don't need "-mtune" if you have "march", as it is implied by "-march" -

you only need both if you want to make a build that will run on many

x86-64 variants but is optimised for one particular one.) And the

"-march=native" benefits go well with some of the higher optimisations -

thus it is good to combine "-march=native" with "-Ofast" or "-O2".

In practice, things also vary dramatically according to the type of program.

Dec 27, 2023, 9:20:14 AM12/27/23

to

On 25/12/2023 14:56, David Brown wrote:

> On 23/12/2023 23:50, Thomas Koenig wrote:

>> Vir Campestris <vir.cam...@invalid.invalid> schrieb:

>>> On 22/12/2023 21:58, Theo wrote:

>>> <snip>

>>>>

>>>> which seems to have more flutter in group A. I get similar results

>>> in a VM

>>>> running on a AMD Ryzen 9 5950X (32) @ 3.393GHz.

>>>>

>>>> Theo

>>> Thank you for those. Neither of them show the odd thing I was seeing -

>>> but I compiled with -ofast. Does that make a difference?

>>

>> That would put it in an executable named "fast" :-)

YKWIM :P
> On 23/12/2023 23:50, Thomas Koenig wrote:

>> Vir Campestris <vir.cam...@invalid.invalid> schrieb:

>>> On 22/12/2023 21:58, Theo wrote:

>>> <snip>

>>>>

>>>> which seems to have more flutter in group A. I get similar results

>>> in a VM

>>>> running on a AMD Ryzen 9 5950X (32) @ 3.393GHz.

>>>>

>>>> Theo

>>> Thank you for those. Neither of them show the odd thing I was seeing -

>>> but I compiled with -ofast. Does that make a difference?

>>

>> That would put it in an executable named "fast" :-)

>

> Some programs are really fussy about capitalisation!

>

>>

>> -march=native -mtune=native might make more of a difference, depending

>> on how good the compiler's model of your hardware is.

>

> For gcc on modern x86-64 processors (IME at least), the difference

> between -O0 and -O1 is often large, but the difference between -O1 and

> higher levels usually makes far less difference. The processors

> themselves do a good job of things like instruction scheduling and

> register renaming at runtime, and are designed to be good for running

> weakly optimised code. I find it makes a bigger difference on

> processors that don't do as much at run-time, such as microcontroller

> cores.

>

> But the "-march=native" can make a very big difference, especially if it

> means the compiler can use SIMD or other advanced instructions. (You

> don't need "-mtune" if you have "march", as it is implied by "-march" -

> you only need both if you want to make a build that will run on many

> x86-64 variants but is optimised for one particular one.) And the

> "-march=native" benefits go well with some of the higher optimisations -

> thus it is good to combine "-march=native" with "-Ofast" or "-O2".

>

> In practice, things also vary dramatically according to the type of

> program.

>

mtune=native _does_ make a difference. Somewhat to my surprise it makes
> Some programs are really fussy about capitalisation!

>

>>

>> -march=native -mtune=native might make more of a difference, depending

>> on how good the compiler's model of your hardware is.

>

> For gcc on modern x86-64 processors (IME at least), the difference

> between -O0 and -O1 is often large, but the difference between -O1 and

> higher levels usually makes far less difference. The processors

> themselves do a good job of things like instruction scheduling and

> register renaming at runtime, and are designed to be good for running

> weakly optimised code. I find it makes a bigger difference on

> processors that don't do as much at run-time, such as microcontroller

> cores.

>

> But the "-march=native" can make a very big difference, especially if it

> means the compiler can use SIMD or other advanced instructions. (You

> don't need "-mtune" if you have "march", as it is implied by "-march" -

> you only need both if you want to make a build that will run on many

> x86-64 variants but is optimised for one particular one.) And the

> "-march=native" benefits go well with some of the higher optimisations -

> thus it is good to combine "-march=native" with "-Ofast" or "-O2".

>

> In practice, things also vary dramatically according to the type of

> program.

>

it _slower_ - the biggest difference being about 5% with a size of 128k

UINT64.

I checked O0 (really slow) O1 (a lot faster) O3 (quite a bit faster too)

Ofast (mostly slightly faster than O3 when I use a capital O) and O3

march=native.

The pipeline thing made me try something different - instead of

incrementing a value I used std::copy to copy each word to the next one.

That rose to a peak rate of about 64MB/S with a size of 32k, dropped

sharply to 45MB/s and then showed a steady decline to 40MB/s at 256k. It

then dropped sharply to 10MB/S for all larger sizes.

A much more sensible result!

Andy

Dec 27, 2023, 10:02:16 AM12/27/23

to

rules somewhat. For example, it allows optimisations of stores that

might cause data races if another thread can see the same data, and it

enables "-ffast-math" which treats floating point as somewhat

approximate and always finite, rather than following IEEE rules

strictly. If you are okay with this, then "-Ofast" is fine, but you

should read the documentation to check that you are happy with it.

While you are there, you can see a number of other optimisation flags

that are not enabled by any -O sets. These may or may not help,

depending on your source code.

Dec 27, 2023, 11:38:50 AM12/27/23

to

>

> Andy

Can you tell us what are you trying to achieve?

Is it a microbenchmark intended to improve your understanding of Zen1

internals?

Or part of the code that you want to run as fast as possible?

If the later, what exactly do you want to be done by the code?

Dec 29, 2023, 12:54:57 PM12/29/23

to

I was trying to understand cache performance on my system.

Over in comp.lang.c++ you'll find a thread about Sieve or Eratosthenes.

I and another poster are trying to optimise it. Not for any good reason

of course... but I was just curious about some of the results I was getting.

Andy

Dec 29, 2023, 1:19:18 PM12/29/23

to

Vir Campestris wrote:

> It is a microbenchmark.

> I was trying to understand cache performance on my system.

> Over in comp.lang.c++ you'll find a thread about Sieve or Eratosthenes.

>

> I and another poster are trying to optimise it. Not for any good reason

> of course... but I was just curious about some of the results I was

> getting.

A few years ago this guy popped-up on youtube, saying "I was the author
> It is a microbenchmark.

> I was trying to understand cache performance on my system.

> Over in comp.lang.c++ you'll find a thread about Sieve or Eratosthenes.

>

> I and another poster are trying to optimise it. Not for any good reason

> of course... but I was just curious about some of the results I was

> getting.

of Task Manager on Windows NT", I ignored the recommendation for quite a

while thinking he just wanted people to blow smoke up his arse,

eventually I watched and he seems a decent guy ... anyway he did a drag

racing series using sieve prime algorithm

<https://www.youtube.com/playlist?list=PLF2KJ6Gy3cZ5Er-1eF9fN1Hgw_xkoD9V1>

Others have contributed in all manner of languages

<https://github.com/PlummersSoftwareLLC/Primes>

Dec 29, 2023, 4:30:54 PM12/29/23

to

Vir Campestris wrote:

> The code I have put in a signature block; there's no point in risking

> someone posting it again. I've commented it, but no doubt not in all the

> right places! I'd be interested to know what results other people get.

Built and run under Win11 with g++ 13.2.0
> The code I have put in a signature block; there's no point in risking

> someone posting it again. I've commented it, but no doubt not in all the

> right places! I'd be interested to know what results other people get.

on a laptop with

i7-11370H CPU (3.3GHz base, 4.28GHz turbo)

L1 d-cache 48kB per core

L2 cache 1.2MB

L3 cache 12MB

16GB memory

Size 1 gave 5.12047e+09 bytes/second.

Size 2 gave 5.76467e+09 bytes/second.

Size 4 gave 5.78979e+09 bytes/second.

Size 8 gave 5.6642e+09 bytes/second.

Size 16 gave 5.55303e+09 bytes/second.

Size 32 gave 4.4151e+09 bytes/second.

Size 64 gave 4.67783e+09 bytes/second.

Size 128 gave 4.85115e+09 bytes/second.

Size 256 gave 4.88147e+09 bytes/second.

Size 512 gave 4.5214e+09 bytes/second.

Size 1024 gave 4.64794e+09 bytes/second.

Size 2048 gave 4.69149e+09 bytes/second.

Size 4096 gave 4.69416e+09 bytes/second.

Size 8192 gave 4.73425e+09 bytes/second.

Size 16384 gave 4.76353e+09 bytes/second.

Size 32768 gave 4.70864e+09 bytes/second.

Size 65536 gave 4.75244e+09 bytes/second.

Size 131072 gave 4.76452e+09 bytes/second.

Size 262144 gave 4.69839e+09 bytes/second.

Size 524288 gave 4.71094e+09 bytes/second.

Size 1048576 gave 4.62922e+09 bytes/second.

Size 2097152 gave 4.55435e+09 bytes/second.

Size 4194304 gave 4.58315e+09 bytes/second.

Size 8388608 gave 4.55917e+09 bytes/second.

Size 16777216 gave 4.60661e+09 bytes/second.

Size 33554432 gave 4.63509e+09 bytes/second.

Size 67108864 gave 4.62349e+09 bytes/second.

Size 134217728 gave 4.62406e+09 bytes/second.

Dec 29, 2023, 5:04:19 PM12/29/23

to

Size 1 gave 4.64966e+09 bytes/second.

Size 2 gave 9.39481e+09 bytes/second.

Size 4 gave 1.59656e+10 bytes/second.

Size 8 gave 2.6789e+10 bytes/second.

Size 16 gave 4.42999e+10 bytes/second.

Size 32 gave 4.51076e+10 bytes/second.

Size 64 gave 4.90103e+10 bytes/second.

Size 128 gave 4.08414e+10 bytes/second.

Size 256 gave 4.36978e+10 bytes/second.

Size 512 gave 4.54119e+10 bytes/second.

Size 1024 gave 4.50594e+10 bytes/second.

Size 2048 gave 4.71622e+10 bytes/second.

Size 4096 gave 4.60261e+10 bytes/second.

Size 8192 gave 3.84764e+10 bytes/second.

Size 16384 gave 3.90878e+10 bytes/second.

Size 32768 gave 3.76718e+10 bytes/second.

Size 65536 gave 3.87339e+10 bytes/second.

Size 131072 gave 3.89058e+10 bytes/second.

Size 262144 gave 3.85662e+10 bytes/second.

Size 524288 gave 4.19209e+10 bytes/second.

Size 1048576 gave 3.00962e+10 bytes/second.

Size 2097152 gave 1.37616e+10 bytes/second.

Size 4194304 gave 1.4136e+10 bytes/second.

Size 8388608 gave 1.39413e+10 bytes/second.

Size 16777216 gave 1.35747e+10 bytes/second.

Size 33554432 gave 1.37286e+10 bytes/second.

Size 67108864 gave 1.38163e+10 bytes/second.

Size 134217728 gave 1.29987e+10 bytes/second.

Dec 29, 2023, 5:05:23 PM12/29/23

to

Dec 29, 2023, 5:07:34 PM12/29/23

to

Andy Burns wrote:

> and (perversely?) with -Ofast

I must look at the exponent, not just the mantissa

I must look at the exponent, not just the mantissa

I must look at the exponent, not just the mantissa

> and (perversely?) with -Ofast

I must look at the exponent, not just the mantissa

I must look at the exponent, not just the mantissa

I must look at the exponent, not just the mantissa

Dec 29, 2023, 7:00:34 PM12/29/23

to

> Size 1 gave 4.64966e+09 bytes/second.

> Size 2 gave 9.39481e+09 bytes/second.

> Size 4 gave 1.59656e+10 bytes/second.

> Size 8 gave 2.6789e+10 bytes/second.

> Size 16 gave 4.42999e+10 bytes/second.

} A
> Size 2 gave 9.39481e+09 bytes/second.

> Size 4 gave 1.59656e+10 bytes/second.

> Size 8 gave 2.6789e+10 bytes/second.

> Size 16 gave 4.42999e+10 bytes/second.

{ B

> Size 32 gave 4.51076e+10 bytes/second.

> Size 64 gave 4.90103e+10 bytes/second.

> Size 128 gave 4.08414e+10 bytes/second.

> Size 256 gave 4.36978e+10 bytes/second.

> Size 512 gave 4.54119e+10 bytes/second.

> Size 1024 gave 4.50594e+10 bytes/second.

> Size 2048 gave 4.71622e+10 bytes/second.

> Size 4096 gave 4.60261e+10 bytes/second.

} B
> Size 64 gave 4.90103e+10 bytes/second.

> Size 128 gave 4.08414e+10 bytes/second.

> Size 256 gave 4.36978e+10 bytes/second.

> Size 512 gave 4.54119e+10 bytes/second.

> Size 1024 gave 4.50594e+10 bytes/second.

> Size 2048 gave 4.71622e+10 bytes/second.

> Size 4096 gave 4.60261e+10 bytes/second.

{ C

> Size 8192 gave 3.84764e+10 bytes/second.

> Size 16384 gave 3.90878e+10 bytes/second.

> Size 32768 gave 3.76718e+10 bytes/second.

> Size 65536 gave 3.87339e+10 bytes/second.

> Size 131072 gave 3.89058e+10 bytes/second.

> Size 262144 gave 3.85662e+10 bytes/second.

> Size 524288 gave 4.19209e+10 bytes/second.

} C
> Size 16384 gave 3.90878e+10 bytes/second.

> Size 32768 gave 3.76718e+10 bytes/second.

> Size 65536 gave 3.87339e+10 bytes/second.

> Size 131072 gave 3.89058e+10 bytes/second.

> Size 262144 gave 3.85662e+10 bytes/second.

> Size 524288 gave 4.19209e+10 bytes/second.

{ D

> Size 1048576 gave 3.00962e+10 bytes/second.

> Size 2097152 gave 1.37616e+10 bytes/second.

> Size 4194304 gave 1.4136e+10 bytes/second.

> Size 8388608 gave 1.39413e+10 bytes/second.

> Size 16777216 gave 1.35747e+10 bytes/second.

> Size 33554432 gave 1.37286e+10 bytes/second.

> Size 67108864 gave 1.38163e+10 bytes/second.

> Size 134217728 gave 1.29987e+10 bytes/second.

} D
> Size 2097152 gave 1.37616e+10 bytes/second.

> Size 4194304 gave 1.4136e+10 bytes/second.

> Size 8388608 gave 1.39413e+10 bytes/second.

> Size 16777216 gave 1.35747e+10 bytes/second.

> Size 33554432 gave 1.37286e+10 bytes/second.

> Size 67108864 gave 1.38163e+10 bytes/second.

> Size 134217728 gave 1.29987e+10 bytes/second.

A displays BW follows access size

B displays essentially the same performance throughout

C displays a minor step down in performance due to L2 or TLB effects 5 to 4 ratio

D displays a major stepdown in performance at DIV 3 compared to C 3 to 1 ratio

Dec 30, 2023, 8:58:00 AM12/30/23

to

Andy Burns wrote:

> Vir Campestris wrote:

>

>> It is a microbenchmark.

>> I was trying to understand cache performance on my system.

>> Over in comp.lang.c++ you'll find a thread about Sieve or Eratosthenes.

>>

>> I and another poster are trying to optimise it. Not for any good

>> reason of course... but I was just curious about some of the results I

>> was getting.

>

> A few years ago this guy popped-up on youtube, saying "I was the author

> of Task Manager on Windows NT", I ignored the recommendation for quite a

> while thinking he just wanted people to blow smoke up his arse,

> eventually I watched and he seems a decent guy ... anyway he did a drag

> racing series using sieve prime algorithm

>

> <https://www.youtube.com/playlist?list=PLF2KJ6Gy3cZ5Er-1eF9fN1Hgw_xkoD9V1>

Dave is a Good Guy!
> Vir Campestris wrote:

>

>> It is a microbenchmark.

>> I was trying to understand cache performance on my system.

>> Over in comp.lang.c++ you'll find a thread about Sieve or Eratosthenes.

>>

>> I and another poster are trying to optimise it. Not for any good

>> reason of course... but I was just curious about some of the results I

>> was getting.

>

> A few years ago this guy popped-up on youtube, saying "I was the author

> of Task Manager on Windows NT", I ignored the recommendation for quite a

> while thinking he just wanted people to blow smoke up his arse,

> eventually I watched and he seems a decent guy ... anyway he did a drag

> racing series using sieve prime algorithm

>

> <https://www.youtube.com/playlist?list=PLF2KJ6Gy3cZ5Er-1eF9fN1Hgw_xkoD9V1>

I have contributed to his "smallest windows program ever" with a small

tweak that saved 4 (or 6?) bytes.

The sieve algorithm used above is more advanced than the fixed modulo 30

program I wrote a ffew decades ago, but similar in approach.

I decided to use mod 30 because as soon as you get to 30+, there are

exactly 8 possible prime locations in each block, so it fits perfectly

in a byte map. :-)

1,7,11,13,17,19,23,29

>

> Others have contributed in all manner of languages

>

> <https://github.com/PlummersSoftwareLLC/Primes>

Terje
> Others have contributed in all manner of languages

>

> <https://github.com/PlummersSoftwareLLC/Primes>

--

- <Terje.Mathisen at tmsw.no>

"almost all programming can be viewed as an exercise in caching"

Dec 30, 2023, 1:24:50 PM12/30/23

to

On 12/30/2023 7:57 AM, Terje Mathisen wrote:

> Andy Burns wrote:

>> Vir Campestris wrote:

>>

>>> It is a microbenchmark.

>>> I was trying to understand cache performance on my system.

>>> Over in comp.lang.c++ you'll find a thread about Sieve or Eratosthenes.

>>>

>>> I and another poster are trying to optimise it. Not for any good

>>> reason of course... but I was just curious about some of the results

>>> I was getting.

>>

>> A few years ago this guy popped-up on youtube, saying "I was the

>> author of Task Manager on Windows NT", I ignored the recommendation

>> for quite a while thinking he just wanted people to blow smoke up his

>> arse, eventually I watched and he seems a decent guy ... anyway he did

>> a drag racing series using sieve prime algorithm

>>

>> <https://www.youtube.com/playlist?list=PLF2KJ6Gy3cZ5Er-1eF9fN1Hgw_xkoD9V1>

>

> Dave is a Good Guy!

>

Yes, and generally interesting videos on YouTube.
> Andy Burns wrote:

>> Vir Campestris wrote:

>>

>>> It is a microbenchmark.

>>> I was trying to understand cache performance on my system.

>>> Over in comp.lang.c++ you'll find a thread about Sieve or Eratosthenes.

>>>

>>> I and another poster are trying to optimise it. Not for any good

>>> reason of course... but I was just curious about some of the results

>>> I was getting.

>>

>> A few years ago this guy popped-up on youtube, saying "I was the

>> author of Task Manager on Windows NT", I ignored the recommendation

>> for quite a while thinking he just wanted people to blow smoke up his

>> arse, eventually I watched and he seems a decent guy ... anyway he did

>> a drag racing series using sieve prime algorithm

>>

>> <https://www.youtube.com/playlist?list=PLF2KJ6Gy3cZ5Er-1eF9fN1Hgw_xkoD9V1>

>

> Dave is a Good Guy!

>

He did interviews with David Cutler and Raymond Chen as well, both

fairly big names at Microsoft, which were also kinda interesting.

> I have contributed to his "smallest windows program ever" with a small

> tweak that saved 4 (or 6?) bytes.

>

> The sieve algorithm used above is more advanced than the fixed modulo 30

> program I wrote a ffew decades ago, but similar in approach.

>

> I decided to use mod 30 because as soon as you get to 30+, there are

> exactly 8 possible prime locations in each block, so it fits perfectly

> in a byte map. :-)

>

> 1,7,11,13,17,19,23,29

in my compiler, and noted that the relative performance difference was

smaller than might otherwise be expected (intuitively, one would expect

dynamic types to be horridly slow, since nearly every operation is

effectively a runtime call).

Namely, I only saw around a 3x difference at the time, rather than

around a 10x+ difference, which is more what I would expect.

Granted, a prime sieve is also not exactly something that performs well

on my ISA design (but, slightly less "aggressively badly" than a the

"recursive Fibonacci number" algorithm, *).

*, IIRC:

long rfib(long x)

{ return((x<2)?1:(rfib(x-1)+rfib(x-2))); }

Which is basically a stress test of function call and return performance...

Dec 30, 2023, 5:05:55 PM12/30/23

to

BGB wrote:

> long rfib(long x)

> { return((x<2)?1:(rfib(x-1)+rfib(x-2))); }

At first I thought the following looked pretty good:
> long rfib(long x)

> { return((x<2)?1:(rfib(x-1)+rfib(x-2))); }

rfib:

ENTER R29,R0,#0 // just 3 preserved registers.

CMP R2,R1,#2

PGE R2,TT

MOV R1,#1 // return 1

EXIT R29,R0,#0

SUB R30,R1,#2

SUB R1,R1,#1

CALL rfib // rfib(n-1)

MOV R29,R1

MOV R1,R30

CALL rfib // rfib(n-2)

ADD R1,R1,R29

EXIT R29,R0,#0 // return rfib(n-1)+rfib(n-2)

But then I moved save and restore to the rfib() calls making the last rung

on the call tree less expensive by 6 memory references each, and by saving

restoring only 2 registers rather than 3 at the cost of a MOV, so each non-

leaf level in the call tree saves/restores only 4 registers instead of 6::

rfib:

SUB R2,R1,#2

PNLT R2,TT // (n-2) < 0

MOV R1,#1 // return 1

RET

ENTER R30,R0,#0 // just 2 preserved registers.

MOV R30,R2

SUB R1,R1,#1

CALL rfib // rfib(n-1)

MOV R2,R1

MOV R1,R30

MOV R30,R2

CALL rfib // rfib(n-2)

ADD R1,R1,R30

EXIT R30,R0,#0 // return rfib(n-1)+rfib(n-2)

But then I realized that rfib(n-2) has already been computed by rfib(n-1).

Since the size of the redundant element on the stack is constant, rfib(n-1)

makes a location available to rfib(n-2) so only 1 call is required instead

of 2::

rfib:

ENTER R0,R0,#8

STD #0,[SP+8] // setup rfib[n-2]

CALL rfib1 // recurse through rfib1

EXIT R0,R0,#8

rfib1:

SUB R2,R1,#2

PNLT R2,TT // (n-2) < 0

MOV R1,#1 // return 1

RET

ENTER R0,R0,#8 // no preserved registers just return address

SUB R1,R1,#1

CALL rfib1 // rfib(n-1)

LDD R2,[SP] // save rfib(n-2)

ADD R1,R1,R30

STD R1,[SP+16] // restore rfib(n-2)

EXIT R0,R0,#8 // return rfib(n-1)+rfib(n-2)

....

Dec 30, 2023, 6:19:19 PM12/30/23

to

On 12/30/2023 4:05 PM, MitchAlsup wrote:

> BGB wrote:

>

>> long rfib(long x)

>> { return((x<2)?1:(rfib(x-1)+rfib(x-2))); }

>

> At first I thought the following looked pretty good:

>

> rfib:

> ENTER R29,R0,#0 // just 3 preserved registers.

>

> CMP R2,R1,#2

> PGE R2,TT

> MOV R1,#1 // return 1

> EXIT R29,R0,#0

>

> SUB R30,R1,#2

> SUB R1,R1,#1

> CALL rfib // rfib(n-1)

> MOV R29,R1

> MOV R1,R30

> CALL rfib // rfib(n-2)

>

> ADD R1,R1,R29

> EXIT R29,R0,#0 // return rfib(n-1)+rfib(n-2)

>

Theoretically, could be done in my ISA something like:
> BGB wrote:

>

>> long rfib(long x)

>> { return((x<2)?1:(rfib(x-1)+rfib(x-2))); }

>

> At first I thought the following looked pretty good:

>

> rfib:

> ENTER R29,R0,#0 // just 3 preserved registers.

>

> CMP R2,R1,#2

> PGE R2,TT

> MOV R1,#1 // return 1

> EXIT R29,R0,#0

>

> SUB R30,R1,#2

> SUB R1,R1,#1

> CALL rfib // rfib(n-1)

> MOV R29,R1

> MOV R1,R30

> CALL rfib // rfib(n-2)

>

> ADD R1,R1,R29

> EXIT R29,R0,#0 // return rfib(n-1)+rfib(n-2)

>

rfib:

ADD -24, SP | MOV LR, R1

MOV 1, R2 | CMPGE 2, R4

BF .L0

MOV.X R12, (SP, 0)

MOV R4, R12 | MOV.Q R1, (SP, 16)

ADD R12, -1, R4

BSR rfib

MOV R2, R13 | ADD R12, -2, R4

BSR rfib

ADD R2, R13, R2

MOV.Q (SP, 16), R1

MOV.X (SP, 0), R12

.L0:

ADD 24, SP

JMP R1

But, my compiler isn't going to pull off anything like this.

Checking, actual BGBCC output:

rfib:

MOV LR, R1

BSR __prolog_0000_0000020000FC

ADD -208, SP

MOV RQ4, RQ13

// tk_shell_chksvar.c:234 { return((x<2)?1:(rfib(x-1)+rfib(x-2))); }

CMPQGE 2, RQ13

BT .L00802A62

MOV 1, RQ10

BRA .L00802A63

.L00802A62:

SUB RQ13, 1, RQ14

MOV RQ14, RQ4

BSR rfib

MOV RQ2, RQ12

SUB RQ13, 2, RQ14

MOV RQ14, RQ4

BSR rfib

MOV RQ2, RQ11

ADD RQ12, RQ11, RQ14

MOV RQ14, RQ10

.L00802A63:

MOV RQ10, RQ2

.L00C108CE:

ADD 208, SP

BRA __epilog_0000_0000020000FC

.balign 4

( Yes, this is more MOV's than "technically necessary", but eliminating

them from the compiler output is "easier said than done" given how some

parts of the compiler work. RQn/RDn are technically equivalent to Rn,

but the compiler distinguishes them as early on the idea was that the

assembler might distinguish instructions based on type, like

EAX/RAX/AX/... on x86-64, but it didn't quite happen this way. )

And, fetching the prolog and epilog:

__prolog_0000_0000020000FC:

ADD -48, SP

MOV.Q R1, (SP, 40)

MOV.X R12, (SP, 16)

MOV.Q R14, (SP, 32)

MOV.X R10, (SP, 0)

RTS

__epilog_0000_0000020000FC:

MOV.Q (SP, 40), R1

MOV.X (SP, 0), R10

MOV.X (SP, 16), R12

MOV.Q (SP, 32), R14

ADD 48, SP

JMP R1

Or, a little closer to the final machine-code (BJX2 Baseline):

rfib: //@03E27C

4911 STC R1, R1

.reloc __prolog_0000_0000020000FC 0F/RELW20_BJX

F000_D000 BSR (PC, 0)

F2F1_D330 ADD -208, SP

18D4 MOV R4, R13

F2DA_C802 CMPQGE 2, R13

.reloc .L00802A62 0F/RELW20_BJX

E000_C000 BT (PC, 0)

DA01 MOV 1, R10

.reloc .L00802A63 0F/RELW20_BJX

F000_C000 BRA (PC, 0)

3000 NOP

.L00802A62: //@03E298

F2ED_11FF ADD R13, -1, R14

F04E_1089 MOV R14, R4

.reloc rfib 0F/RELW20_BJX

F000_D000 BSR (PC, 0)

F4C2_1089 MOV R2, R12 |

F2ED_11FE ADD R13, -2, R14

F04E_1089 MOV R14, R4

.reloc rfib 0F/RELW20_BJX

F000_D000 BSR (PC, 0)

F0B2_1089 MOV R2, R11

F0EC_10B0 ADD R12, R11, R14

F0AE_1089 MOV R14, R10

.L00802A63: //@03E2C0

182A MOV R10, R2

.L00C108CE: //@03E2C2

F2F0_D0D0 ADD 208, SP

.reloc __epilog_0000_0000020000FC 0F/RELW20_BJX

F000_C000 BRA (PC, 0)

3030 BRK

defeats its use as a benchmark...

Say, because it drops from ~ O(2^n) to ~ O(n)...

Granted, There are plenty of other much more efficient ways of

calculating Fibonacci numbers...

Dec 30, 2023, 7:00:39 PM12/30/23

to

> SUB RQ13, 1, RQ14

> MOV RQ14, RQ4

> BSR rfib

> MOV RQ2, RQ12

> SUB RQ13, 2, RQ14

> MOV RQ14, RQ4

> BSR rfib

> MOV RQ2, RQ11

> ADD RQ12, RQ11, RQ14

> MOV RQ14, RQ10

> ..L00802A63:
> MOV RQ14, RQ4

> BSR rfib

> MOV RQ2, RQ12

> SUB RQ13, 2, RQ14

> MOV RQ14, RQ4

> BSR rfib

> MOV RQ2, RQ11

> ADD RQ12, RQ11, RQ14

> MOV RQ14, RQ10

> MOV RQ10, RQ2

> ..L00C108CE:

> F000_D000 BSR (PC, 0)

> F2F1_D330 ADD -208, SP

> 18D4 MOV R4, R13

> F2DA_C802 CMPQGE 2, R13

> ..reloc .L00802A62 0F/RELW20_BJX
> F2F1_D330 ADD -208, SP

> 18D4 MOV R4, R13

> F2DA_C802 CMPQGE 2, R13

> E000_C000 BT (PC, 0)

> DA01 MOV 1, R10

> ..reloc .L00802A63 0F/RELW20_BJX
> DA01 MOV 1, R10

> F000_C000 BRA (PC, 0)

> 3000 NOP

> ..L00802A62: //@03E298
> 3000 NOP

> F2ED_11FF ADD R13, -1, R14

> F04E_1089 MOV R14, R4

> ..reloc rfib 0F/RELW20_BJX
> F04E_1089 MOV R14, R4

> F000_D000 BSR (PC, 0)

> F4C2_1089 MOV R2, R12 |

> F2ED_11FE ADD R13, -2, R14

> F04E_1089 MOV R14, R4

> ..reloc rfib 0F/RELW20_BJX
> F4C2_1089 MOV R2, R12 |

> F2ED_11FE ADD R13, -2, R14

> F04E_1089 MOV R14, R4

> F000_D000 BSR (PC, 0)

> F0B2_1089 MOV R2, R11

> F0EC_10B0 ADD R12, R11, R14

> F0AE_1089 MOV R14, R10

> ..L00802A63: //@03E2C0
> F0B2_1089 MOV R2, R11

> F0EC_10B0 ADD R12, R11, R14

> F0AE_1089 MOV R14, R10

> 182A MOV R10, R2

> ..L00C108CE: //@03E2C2

> F2F0_D0D0 ADD 208, SP

> ..reloc __epilog_0000_0000020000FC 0F/RELW20_BJX

pure function and then the compiler itself performs that transformation.

{{Oh and BTW: Fibonacci should have never been a benchmark (post 1985)}}

> Say, because it drops from ~ O(2^n) to ~ O(n)...

Back in 1982±, S.E.L. got their compiler to recognize the transcendental

loop as constant, and this one transformation improved Whetstone by 3×.

> Granted, There are plenty of other much more efficient ways of

> calculating Fibonacci numbers...

BigO( const )

Dec 31, 2023, 4:12:15 AM12/31/23

to

On 30/12/2023 23:19, BGB wrote:

> Granted, There are plenty of other much more efficient ways of

> calculating Fibonacci numbers...

>

Yes. In order of effort.
> Granted, There are plenty of other much more efficient ways of

> calculating Fibonacci numbers...

>

1) Memoize the recursive function.

2) Return values for (n-1,n-2) as a tuple, and make it single (tail)

recursive. Which a smart compiler could theoretically convert to an

iterative loop.

3) Write an iterative loop yourself.

4) Remember college maths, and solve the recurrence relation to give a

very simple closed form solution.

In fact, given that fib grows exponentially, so we only ever need to

calculate it for small values of n, the only “bad” solution is the naive

double recursive one.

Dec 31, 2023, 10:22:44 AM12/31/23

to

since you are effectively getting rid of the O(2^n) exponential

recursion fanout, making it O(n) instead.

The step from that to tail recursion elimination is the only thing

remaining, right?

What you've done is similar to converting

f_n = fib(n)

into

(f_n,f_n_minus_1) = fib2(n)

{

if (n < 2) return (1,1);

(f2,f1) = fib2(n-1);

return (f1+f2,f1);

Dec 31, 2023, 10:44:51 AM12/31/23

to

On 31/12/2023 09:12, Pancho wrote:

> In fact, given that fib grows exponentially, so we only ever need to

> calculate it for small values of n, the only “bad” solution is the naive

> double recursive one.

Well, I might disagree.
> In fact, given that fib grows exponentially, so we only ever need to

> calculate it for small values of n, the only “bad” solution is the naive

> double recursive one.

Given that we only ever need to calculate it for small values of n (as

you say, and if that's true) its efficiency doesn't matter. If you need

to compute it often enough that efficiency might become relevant just

put the results in a table and use that. Given that fib goes over 10^12

after about 60 terms it needn't be a very big table.

The naive doubly recursive solution has the huge merit of being

human-readable.

Unlike, say:

double fib( int n )

{

return (pow( (1+sqrt(5.0))/2, n+1 )

- pow( (1-sqrt(5.0))/2, n+1 ))

/ sqrt(5.0);

}

... which is great for fibs of big values, but not fast for small ones.

--

Cheers,

Daniel.

Dec 31, 2023, 11:40:14 AM12/31/23

to

On 31/12/2023 15:45, Daniel James wrote:

> On 31/12/2023 09:12, Pancho wrote:

>> In fact, given that fib grows exponentially, so we only ever need to

>> calculate it for small values of n, the only “bad” solution is the

>> naive double recursive one.

>

> Well, I might disagree.

>

> Given that we only ever need to calculate it for small values of n (as

> you say, and if that's true) its efficiency doesn't matter. If you need

> to compute it often enough that efficiency might become relevant just

> put the results in a table and use that. Given that fib goes over 10^12

> after about 60 terms it needn't be a very big table.

>

Memoization, is effectively a lookup table, but without the thinking. In
> On 31/12/2023 09:12, Pancho wrote:

>> In fact, given that fib grows exponentially, so we only ever need to

>> calculate it for small values of n, the only “bad” solution is the

>> naive double recursive one.

>

> Well, I might disagree.

>

> Given that we only ever need to calculate it for small values of n (as

> you say, and if that's true) its efficiency doesn't matter. If you need

> to compute it often enough that efficiency might become relevant just

> put the results in a table and use that. Given that fib goes over 10^12

> after about 60 terms it needn't be a very big table.

>

my case, avoiding thinking massively improves code reliability.

> The naive doubly recursive solution has the huge merit of being

> human-readable.

haven't tested recently, but off the top of my head, the maximum

function call depth would be something like 300, so fib(10) would be a

stack overflow. fib(60) would need a stack function call depth ~ 1e18.

> Unlike, say:

>

> double fib( int n )

> {

> return (pow( (1+sqrt(5.0))/2, n+1 )

> - pow( (1-sqrt(5.0))/2, n+1 ))

> / sqrt(5.0);

> }

>

> ... which is great for fibs of big values, but not fast for small ones.

>

I seem to recall, if statements are more costly than floating point

function calls, like exponential.

Dec 31, 2023, 11:53:03 AM12/31/23

to

On 29/12/2023 22:04, Andy Burns wrote:

> and (perversely?) with -Ofast

Thank you for those. It looks as though there are some other effects
> and (perversely?) with -Ofast

going on apart from the cache size - but I can clearly see the drop when

you run out of cache. Within the caches? Too much noise for me to

understand!

Andy

Dec 31, 2023, 12:03:47 PM12/31/23

to

case you'll get one or two branch mispredicts, while the 2^n recursive

calls (for fib(10 or 11)) will probably each take about as long.

An iterative O(n) version will take 2-4 clcok cycles per iteration, so

for n less than about 40, you'd only have ~100 clock cycles to evaluate

the two pow() calls. (I'm assuming the compiler or programmer have

pre-calculated both (1+sqrt(5))*0.5 and (1-sqrt(5))*0.5 as well as

1/sqrt(5), so that only the two pow() calls remain!)

You need Mitch's ~20-cycle pow() to win this one at one or two-digit N.

Dec 31, 2023, 4:04:07 PM12/31/23

to

On 31/12/2023 16:40, Pancho wrote:

> I

> haven't tested recently, but off the top of my head, the maximum

> function call depth would be something like 300, so fib(10) would be a

> stack overflow. fib(60) would need a stack function call depth ~ 1e18.

>

Sorry this was a mistake, the function call depth is only ~n not 2^n.
> I

> haven't tested recently, but off the top of my head, the maximum

> function call depth would be something like 300, so fib(10) would be a

> stack overflow. fib(60) would need a stack function call depth ~ 1e18.

>

For fib(n), there are ~2^n function calls, but not 2^n deep on the stack.

Dec 31, 2023, 4:19:06 PM12/31/23

to

On 31/12/2023 17:03, Terje Mathisen wrote:

>

> The recursive fob() function has just a single if () statement, so worst

> case you'll get one or two branch mispredicts, while the 2^n recursive

> calls (for fib(10 or 11)) will probably each take about as long.

>

How many cycles for a branch mispredict? From real life, I have seen if
>

> The recursive fob() function has just a single if () statement, so worst

> case you'll get one or two branch mispredicts, while the 2^n recursive

> calls (for fib(10 or 11)) will probably each take about as long.

>

statements be more expensve than pow, but it was a long time ago, and

possibly code specific. It surprised me at the time, but ever since I

tend to ignore the cost of functions like pow().

> An iterative O(n) version will take 2-4 clcok cycles per iteration, so

> for n less than about 40, you'd only have ~100 clock cycles to evaluate

> the two pow() calls. (I'm assuming the compiler or programmer have

> pre-calculated both (1+sqrt(5))*0.5 and (1-sqrt(5))*0.5 as well as

> 1/sqrt(5), so that only the two pow() calls remain!)

>

> You need Mitch's ~20-cycle pow() to win this one at one or two-digit N.

>

Premature optimisation is the root of all evil and what not. If you are

really calling it a huge number of times, a lookup is quickest.

Jan 1, 2024, 9:59:45 AMJan 1

to

On 31/12/2023 16:40, Pancho wrote:

>> The naive doubly recursive solution has the huge merit of being

>> human-readable.

>

> Yes, but you'd be surprised how quickly double recursion goes bad. I

> haven't tested recently, but off the top of my head, the maximum

> function call depth would be something like 300, so fib(10) would be

> a stack overflow. fib(60) would need a stack function call depth ~

> 1e18.

On 64-bit gcc here on Debian (so, 64-bit ints) I can call fib(50)
>> human-readable.

>

> Yes, but you'd be surprised how quickly double recursion goes bad. I

> haven't tested recently, but off the top of my head, the maximum

> function call depth would be something like 300, so fib(10) would be

> a stack overflow. fib(60) would need a stack function call depth ~

> 1e18.

without blowing the stack, but it takes minutes (on a Ryzen 7).

However, the (signed) 64-bit integer result overflows for fib(n) where

n>45, which is more of a problem.

>> Unlike, say:

>>

>> double fib( int n )

>> {

>> return (pow( (1+sqrt(5.0))/2, n+1 )

>> - pow( (1-sqrt(5.0))/2, n+1 ))

>> / sqrt(5.0);

>> }

>>

>> ... which is great for fibs of big values, but not fast for small

>> ones.

>>

>

> Why isn't it fast for small fibs?

I mean ... obviously it will be as fast for small ones as for large, but

for small ones it won't beat an iterative calculation and for very small

ones I doubt it will beat the naive recursive method.

My point, though, was that clear code is often more important than fast

execution, and while the naive recursive method for fibs gets horribly

slow for large values even it is acceptable for small ones.

> I seem to recall, if statements are more costly than floating point

> function calls, like exponential.

jump to code that isn't in the cache ... but optimizers and CPUs are

pretty good at branch prediction and cacheing.

--

Cheers,

Daniel.

Jan 1, 2024, 3:20:22 PMJan 1

to

Daniel James wrote:

> Unlike, say:

> double fib( int n )

> {

> return (pow( (1+sqrt(5.0))/2, n+1 )

> - pow( (1-sqrt(5.0))/2, n+1 ))

> / sqrt(5.0);

> }

Looks fast to me::
> Unlike, say:

> double fib( int n )

> {

> return (pow( (1+sqrt(5.0))/2, n+1 )

> - pow( (1-sqrt(5.0))/2, n+1 ))

> / sqrt(5.0);

> }

fib:

ADD R2,R1,#1

POWN R3,#(1+sqrt(5.0))/2,R2

POWN R4,#(1-sqrt(5.0))/2,R2

FADD R2,R3,R4

FDIV R1,R2,#sqrt(5)

RET

Jan 1, 2024, 3:25:22 PMJan 1