Software Drag Racing: C++ vs C# vs Python vs Eiffel

Finnian Reilly

unread,

Mar 30, 2021, 8:20:28 AM3/30/21

to Eiffel Users

I came across this youtube video where retired Microsoft Engineer Davepl wrote the same 'Primes' benchmark in Python, C#, and C++ and then compares and explains the differences in the code before racing them head to head to see what the performance difference is like between the languages.

I was curious to know how Eiffel would stack up in this comparison, so I downloaded his C++ source code and compiled it on my own machine using the same optimizations you would use to finalize an Eiffel program

g++ -O3 -std=gnu++11 -pthread PrimeCPP.cpp -oprimes

I then ported his C++ program to Eiffel with these classes

These are the results I got when I ran them both

I then added Eiffel to the graph Davepl made for his video based on relative comparisons of pixel spans using Gimp.

I have to say I am a little disappointed. I was expecting Eiffel to surpass C# and be within 15% of C++.

Anthony W

unread,

Mar 30, 2021, 8:29:00 AM3/30/21

to eiffel...@googlegroups.com

How is this even possible?

I thought Eiffel was supposed to generate C code and compile that to create the executable - it makes no sense to me that a compiled app would run slower than an interpreted one...

Anthony

--
You received this message because you are subscribed to the Google Groups "Eiffel Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to eiffel-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/eiffel-users/1742ff2d-d237-4833-a4bc-5eeac6c83e9en%40googlegroups.com.

Finnian Reilly

unread,

Mar 30, 2021, 8:39:32 AM3/30/21

to Eiffel Users

/* {PRIME_NUMBER_SIEVE}.execute */
void F1999_17736 (EIF_REFERENCE Current)
{
    GTCX
    RTEX;
    EIF_INTEGER_32 loc1 = (EIF_INTEGER_32) 0;
    EIF_INTEGER_32 loc2 = (EIF_INTEGER_32) 0;
    EIF_INTEGER_32 loc3 = (EIF_INTEGER_32) 0;
    EIF_INTEGER_32 loc4 = (EIF_INTEGER_32) 0;
    EIF_BOOLEAN loc5 = (EIF_BOOLEAN) 0;
    EIF_REFERENCE loc6 = (EIF_REFERENCE) 0;
    EIF_REFERENCE tr1 = NULL;
    EIF_REAL_32 tr4_1;
    EIF_BOOLEAN tb1;
    EIF_BOOLEAN tb2;
    RTLD;

    RTLI(3);
    RTLR(0,Current);
    RTLR(1,loc6);
    RTLR(2,tr1);
    RTLIU(3);

    RTEAA("execute", 1998, Current, 6, 0, 27204);
    RTGC;
    RTHOOK(1);
    loc4 = F1999_17739(Current);
    RTHOOK(2);
    loc6 = *(EIF_REFERENCE *)(Current);
    RTHOOK(3);
    tr4_1 = (EIF_REAL_32) (F1999_17739(Current));
    tr4_1 = (EIF_REAL_32) sqrt((double) tr4_1);
    tr1 = RTLNS(eif_new_type(1451, 0x00).id, 1451, _OBJSIZ_0_0_0_0_1_0_0_0_);
    *(EIF_REAL_32 *)tr1 = tr4_1;
    loc2 = F1450_6688(RTCW(tr1));
    RTHOOK(4);
    loc1 = (EIF_INTEGER_32) ((EIF_INTEGER_32) 3L);
    for (;;) {
        RTHOOK(5);
        if ((EIF_BOOLEAN) (loc1 > loc2)) break;
        RTHOOK(6);
        loc5 = (EIF_BOOLEAN) (EIF_BOOLEAN) 0;
        RTHOOK(7);
        loc3 = (EIF_INTEGER_32) loc1;
        for (;;) {
            RTHOOK(8);
            if ((EIF_BOOLEAN) (loc5 || (EIF_BOOLEAN) (loc3 >= loc4))) break;
            RTHOOK(9);
            /* INLINED CODE (SPECIAL.item) */
            tb2 = *((EIF_BOOLEAN *)loc6 + (loc3));
            /* END INLINED CODE */
            tb1 = tb2;
            if (tb1) {
                RTHOOK(10);
                loc1 = (EIF_INTEGER_32) loc3;
                RTHOOK(11);
                loc5 = (EIF_BOOLEAN) (EIF_BOOLEAN) 1;
            } else {
                RTHOOK(12);
                loc3 += ((EIF_INTEGER_32) 2L);
            }
        }
        RTHOOK(13);
        loc3 = (EIF_INTEGER_32) (EIF_INTEGER_32) (loc1 * loc1);
        for (;;) {
            RTHOOK(14);
            if ((EIF_BOOLEAN) (loc3 >= loc4)) break;
            RTHOOK(15);
            /* INLINED CODE (SPECIAL.put) */
            *((EIF_BOOLEAN *)loc6 + (loc3)) = (EIF_BOOLEAN) 0;
            /* END INLINED CODE */
            ;
            RTHOOK(16);
            loc3 += (EIF_INTEGER_32) (loc1 * ((EIF_INTEGER_32) 2L));
        }
        RTHOOK(17);
        loc1 += ((EIF_INTEGER_32) 2L);
    }
    RTHOOK(18);
    RTLE;
    RTEE;
}

Finnian Reilly

unread,

Mar 30, 2021, 8:42:15 AM3/30/21

to Eiffel Users

I am wondering if the Eiffel SPECIAL arrays are a bit slow in comparison to C++. I might try and write another version using a MANAGED_POINTER instead and see what difference that makes.

On Tuesday, 30 March 2021 at 13:29:00 UTC+1 aweissen wrote:

Ulrich Windl

unread,

Mar 30, 2021, 8:56:07 AM3/30/21

to eiffel...@googlegroups.com

>>> Anthony W <awei...@gmail.com> schrieb am 30.03.2021 um 14:28 in Nachricht
<CAG+gLBgr_4M+Q=Wy++w=bWWa2CGBJJ6YvNeJPzZA6Q=TN2...@mail.gmail.com>:

> How is this even possible?
>
> I thought Eiffel was supposed to generate C code and compile that to create
> the executable - it makes no sense to me that a compiled app would run
> slower than an interpreted one...

...if it does the same thing.

A purely numerical problem like computing primes does not really need the overhead of an OO language.

>
> Anthony
>
> On Tue, Mar 30, 2021 at 7:20 AM Finnian Reilly <frei...@gmail.com> wrote:
>
>> I came across this youtube video

>> <https://www.youtube.com/watch?v=D3h62rgewZM> where retired Microsoft

>> Engineer Davepl wrote the same 'Primes' benchmark in Python, C#, and C++
>> and then compares and explains the differences in the code before racing
>> them head to head to see what the performance difference is like between
>> the languages.
>>
>> I was curious to know how Eiffel would stack up in this comparison, so I

>> downloaded his C++ source code <https://github.com/davepl/Primes> and

>> compiled it on my own machine using the same optimizations you would use to
>> finalize an Eiffel program
>> g++ -O3 -std=gnu++11 -pthread PrimeCPP.cpp -oprimes
>>
>> I then ported his C++ program to Eiffel with these classes

>> <http://goog_830323635>
>>
>> 1. PRIMES_BENCHMARK_APP
>>
> <http://www.eiffel-loop.com/benchmark/source/apps/primes_benchmark_app.html>
>> 2. PRIME_NUMBER_SIEVE
>>
><http://www.eiffel-loop.com/benchmark/source/support/prime_number_sieve.html>

>>
>> These are the results I got when I ran them both
>> [image: C++-vs-Eiffel.png]
>> I then added Eiffel to the graph Davepl made for his video based on
>> relative comparisons of pixel spans using Gimp.
>> [image: software-dragrace-1024.png]
>> I have to say I am a little disappointed. I was expecting Eiffel to
>> surpass C# and be within 15% of C++.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Eiffel Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to eiffel-users...@googlegroups.com.
>> To view this discussion on the web visit
>>
> https://groups.google.com/d/msgid/eiffel-users/1742ff2d-d237-4833-a4bc-5eeac6
> c83e9en%40googlegroups.com
>>

> <https://groups.google.com/d/msgid/eiffel-users/1742ff2d-d237-4833-a4bc-5eeac6
> c83e9en%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .

>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Eiffel Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to eiffel-users...@googlegroups.com.
> To view this discussion on the web visit

> https://groups.google.com/d/msgid/eiffel-users/CAG%2BgLBgr_4M%2BQ%3DWy%2B%2Bw
> %3DbWWa2CGBJJ6YvNeJPzZA6Q%3DTN2C-NQ%40mail.gmail.com.

Finnian Reilly

unread,

Mar 30, 2021, 9:26:03 AM3/30/21

to Eiffel Users

I created a second implementation PRIME_NUMBER_SIEVE_2 which uses MANGED_POINTER instead of TO_SPECIAL and this is what I got

Passes: 1365, Time: 5.003, Avg: 0.004, Limit: 1000000, Count: 78498, Valid: True
Passes: 1052, Time: 5.001, Avg: 0.005, Limit: 1000000, Count: 78498, Valid: True

The MANGED_POINTER implementation only achieves 1052 passes in 5 seconds, so special arrays are definitely faster in Eiffel.

Finnian Reilly

unread,

Mar 30, 2021, 11:50:25 AM3/30/21

to eiffel...@googlegroups.com

Removing the OO overhead

I thought Eiffel was supposed to generate C code and compile that to create
the executable - it makes no sense to me that a compiled app would run
slower than an interpreted one...

...if it does the same thing.

A purely numerical problem like computing primes does not really need the overhead of an OO language.

hi Ulrich

I have tested your hypothesis by removing the OO overhead. I did this by modifiying PRIMES_BENCHMARK_APP to run the benchmark a second time but this time reusing the same instance of PRIME_NUMBER_SIEVE and turning garbage collection off. As you can see it made very little difference to the results.

TO_SPECIAL implementation
Passes: 1328, Time: 5.002, Avg: 0.004, Limit: 1000000, Count: 78498, Valid: True

TO_SPECIAL implementation without GC overhead
Passes: 1369, Time: 5.003, Avg: 0.004, Limit: 1000000, Count: 78498, Valid: True

I am curious to know how Ada would perform in this test. I think I will go on an Ada group and ask an Ada guy if he would like to try.

-- Finnian

-- 
SmartDevelopersUseUnderScoresInTheirIdentifiersBecause_it_is_much_easier_to_read

Alexander Kogtenkov

unread,

Mar 30, 2021, 3:09:50 PM3/30/21

to eiffel...@googlegroups.com

Both C++ and C# use a specialized version of a bit vector, not an array of Boolean values. A more realistic comparison would be to replace "BitArray" in the C# implementation with "bool []".

The reverse could also be modelled by packing BOOLEANs into an array of NATURAL_32 or NATURAL_64. With this implementation, the Eiffel version runs slightly slower than the C++ version, but just by some meaningful percentage (about 7% on my machine) that can be explained by additional dereferencing required to support moving GC. Another source of inefficiency comes from the need to compute the address of the modified item twice: to read its value and to write the updated value back. The built-in versions of bit vectors can do the modification in place, i.e. without the double computation.

In theory, such a highly optimized bit vector class can be added to the Eiffel library to improve the mentioned overhead of 7% compared to C++ version. In practice, it’s unclear how many applications are using bit vectors.

(Hint: using bit vectors instead of arrays of boolean values reduced the memory footprint by 8, improving the CPU cache locality by the same factor. Depending on the CPU and chipset model, the whole bit vector can now fit into a higher-level cache, further improving the performance.)

Alexander Kogtenkov

Finnian Reilly <frei...@gmail.com>:

--

You received this message because you are subscribed to the Google Groups "Eiffel Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to eiffel-users...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/eiffel-users/1742ff2d-d237-4833-a4bc-5eeac6c83e9en%40googlegroups.com.

Anthony W

unread,

Mar 30, 2021, 3:28:12 PM3/30/21

to eiffel...@googlegroups.com

Thanks for the clarification Alexander! I'm about to take another foray into Eiffel land, and the performance discrepancy caused some concern. Fingers crossed I can make good headway using Eiffel to replace my C# trading app.

To view this discussion on the web visit https://groups.google.com/d/msgid/eiffel-users/1617131388.643745245%40f302.i.mail.ru.

Finnian Reilly

unread,

Mar 31, 2021, 9:07:00 AM3/31/21

to Eiffel Users

Hi Alexander

In theory, such a highly optimized bit vector class can be added to the Eiffel library to improve the mentioned overhead of 7% compared to C++ version. In practice, it’s unclear how many applications are using bit vectors.

I tried wrapping the C++ type vector<bool> with the class EL_CPP_BOOLEAN_VECTOR and now Eiffel is within 33% of C++ (inlining = 3)

Alexander Kogtenkov

unread,

Mar 31, 2021, 2:23:34 PM3/31/21

to eiffel...@googlegroups.com

Hi Finnian,

I’ll send my version (or, more precisely, fragments of it) to you privately because I did not use EL_* classes.

It turns out that the inlining factor plays an essential role to get the best results. My version (it differs from the C++ version by very efficient bit counting algorithm, but this affects only the final step, not the main loop) with the factor set to 15 outperforms the C++ version on my machine. The results can (and do) significantly differ with lower or higher values of the factor that is expected with such micro-benchmarks. The results can be made more predictable and less dependent on the factor value by inlining two small features (one that clears a bit and one that returns a bit) by hand.

Note that everything is pure Eiffel code without any external features, special classes and other heavy constructs. BTW, I’m using an up-to-date version of EiffelStudio, not the one from 2016.

Alexander Kogtenkov

Finnian Reilly <frei...@gmail.com>:

--

You received this message because you are subscribed to the Google Groups "Eiffel Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to eiffel-users...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/eiffel-users/f29abbac-a949-45f4-a4bd-d06656beb0fen%40googlegroups.com.

Chris Tillman

unread,

Mar 31, 2021, 2:28:18 PM3/31/21

to eiffel...@googlegroups.com

Nice! Q.E.D.

To view this discussion on the web visit https://groups.google.com/d/msgid/eiffel-users/1617215011.812603511%40f496.i.mail.ru.

--

Chris Tillman
Developer

Finnian Reilly

unread,

Mar 31, 2021, 5:13:01 PM3/31/21

to eiffel...@googlegroups.com

Hi Alexander

Hi Finnian,

I’ll send my version (or, more precisely, fragments of it) to you privately because I did not use EL_* classes.

Thank you for that, it hasn't arrived yet, but I look forward to receiving it.

It turns out that the inlining factor plays an essential role to get the best results.

This is a concept I am not sure I fully understand. I always assumed it to be a C compiler optimization where function calls are replaced with the actual code, but I haven't as yet been able to identify exactly which gcc switch it corresponds to. I looked in config.sh

But this prompts the question, could not the C++ program also benefit from inlining, and would it not be cheating if we do not also apply it the C++ example?

My version (it differs from the C++ version by very efficient bit counting algorithm, but this affects only the final step,

I assume you mean obtaining "prime_count", but that is only called outside the benchmark timer loop, so won't affect the result

not the main loop) with the factor set to 15

Do you mean an inlining factor of 15, wow I didn't know you could make it that high. Doesn't that lead to hugh exes?

outperforms the C++ version on my machine.

That's impressive. You should go to the comments of Dave Garage video and brag a bit. It would be good publicity for Eiffel.

https://www.youtube.com/watch?v=D3h62rgewZM

The results can (and do) significantly differ with lower or higher values of the factor that is expected with such micro-benchmarks. The results can be made more predictable and less dependent on the factor value by inlining two small features (one that clears a bit and one that returns a bit) by hand.

Note that everything is pure Eiffel code without any external features, special classes and other heavy constructs. BTW,

Well that is impressive if you are not using any external features.

I’m using an up-to-date version of EiffelStudio, not the one from 2016.

But will it in theory still work on the old compiler with some modifications?

Finnian

Finnian Reilly

unread,

Apr 1, 2021, 5:25:56 AM4/1/21

to eiffel...@googlegroups.com

The Art of Inlining

This is a concept I am not sure I fully understand. I always assumed it to be a C compiler optimization where function calls are replaced with the actual code, but I haven't as yet been able to identify exactly which gcc switch it corresponds to. I looked in config.sh

I think I have answered my own question by RTFM at eiffel.org :-)

Inlining, Inlining Size: enables inlining on Eiffel features that can be inlined, i.e. whose size is less or equal to the specified size in the combo box. The size value given in parameter corresponds to the number of instructions as seen by the Eiffel compiler (for example a := b.f corresponds to 2 instructions). The inlining is very powerful since it can inline a function in all your Eiffel code, without scope limitation as found in C or C++ compilers. (C code generation mode only)

This seems to suggest that inlining modifies the C-code generation and infact is not some instruction to the C compiler. Wow! you learn something new everyday.

I would love to see an article explaining inlining in detail, with graphs etc.

when is it appropriate to use, what are the trade-offs
What are it's limitations, and at what point do you experience diminishing returns
Caveats to look out for

I remember once setting inlining to 5 on a protein folding analyzer project to spectacular effect.

Finnian

-- 
SmartDevelopersUseUnderScoresInTheirIdentifiersBecause_it_is_much_easier_to_read

Hubert Cater

unread,

Apr 1, 2021, 9:17:58 AM4/1/21

to Eiffel Users

I agree, I would find this very helpful as well, as inlining, and how to use it effectively remains a bit of a mystery to me on my end.

Virus-free. www.avast.com

--

You received this message because you are subscribed to the Google Groups "Eiffel Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to eiffel-users...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/eiffel-users/3eed7322-b42f-81b8-0ea9-5e2d80cddc8c%40eiffel-loop.com.

Ulrich Windl

unread,

Apr 6, 2021, 2:05:02 AM4/6/21

to eiffel...@googlegroups.com

>>> "'Alexander Kogtenkov' via Eiffel Users" <eiffel...@googlegroups.com>

schrieb am 30.03.2021 um 21:09 in Nachricht
<1617131388...@f302.i.mail.ru>:

> Both C++ and C# use a specialized version of a bit vector, not an array of
> Boolean values. A more realistic comparison would be to replace "BitArray"
in
> the C# implementation with "bool []".
>
> The reverse could also be modelled by packing BOOLEANs into an array of
> NATURAL_32 or NATURAL_64. With this implementation, the Eiffel version runs

> slightly slower than the C++ version, but just by some meaningful percentage

> (about 7% on my machine) that can be explained by additional dereferencing
> required to support moving GC. Another source of inefficiency comes from the

> need to compute the address of the modified item twice: to read its value
and
> to write the updated value back. The built-in versions of bit vectors can do

> the modification in place, i.e. without the double computation.
>
> In theory, such a highly optimized bit vector class can be added to the
> Eiffel library to improve the mentioned overhead of 7% compared to C++
> version. In practice, it’s unclear how many applications are using bit
> vectors.
>
> (Hint: using bit vectors instead of arrays of boolean values reduced the
> memory footprint by 8, improving the CPU cache locality by the same factor.

> Depending on the CPU and chipset model, the whole bit vector can now fit
into
> a higher-level cache, further improving the performance.)

Hi!

AFAIR, UCSD Pascal was actually using a bitvector when you used a PACKED ARRAY
of BOOLEAN.

Regards,
Ulrich

>
> Alexander Kogtenkov
>
>
>>Finnian Reilly <frei...@gmail.com>:
>>
>>I came across this youtube video where retired Microsoft Engineer Davepl
> wrote the same 'Primes' benchmark in Python, C#, and C++ and then compares
> and explains the differences in the code before racing them head to head to

> see what the performance difference is like between the languages.
>>
>>I was curious to know how Eiffel would stack up in this comparison, so I
> downloaded his C++ source code and compiled it on my own machine using the

> same optimizations you would use to finalize an Eiffel program
>>g++ -O3 -std=gnu++11 -pthread PrimeCPP.cpp -oprimes
>>
>>I then ported his C++ program to Eiffel with these classes

>>* PRIMES_BENCHMARK_APP
>>* PRIME_NUMBER_SIEVE

>>These are the results I got when I ran them both
>>

>>I then added Eiffel to the graph Davepl made for his video based on relative

> comparisons of pixel spans using Gimp.

>>I have to say I am a little disappointed. I was expecting Eiffel to surpass

> C# and be within 15% of C++.
>> --
>>You received this message because you are subscribed to the Google Groups
> "Eiffel Users" group.
>>To unsubscribe from this group and stop receiving emails from it, send an
> email to eiffel-users...@googlegroups.com .
>>To view this discussion on the web visit
>
https://groups.google.com/d/msgid/eiffel-users/1742ff2d-d237-4833-a4bc-5eeac6

> c83e9en%40googlegroups.com .
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Eiffel Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to eiffel-users...@googlegroups.com.
> To view this discussion on the web visit
>

https://groups.google.com/d/msgid/eiffel-users/1617131388.643745245%40f302.i.
> mail.ru.

Ulrich Windl

unread,

Apr 6, 2021, 3:53:36 AM4/6/21

to eiffel...@googlegroups.com

>>> "'Alexander Kogtenkov' via Eiffel Users" <eiffel...@googlegroups.com>

schrieb am 31.03.2021 um 20:23 in Nachricht
<1617215011...@f496.i.mail.ru>:

> Hi Finnian,
>
> I’ll send my version (or, more precisely, fragments of it) to you privately

> because I did not use EL_* classes.
>
> It turns out that the inlining factor plays an essential role to get the
> best results. My version (it differs from the C++ version by very efficient

> bit counting algorithm, but this affects only the final step, not the main
> loop) with the factor set to 15 outperforms the C++ version on my machine.
> The results can (and do) significantly differ with lower or higher values of

> the factor that is expected with such micro-benchmarks. The results can be
> made more predictable and less dependent on the factor value by inlining two

> small features (one that clears a bit and one that returns a bit) by hand.

I thought "inlining by hand" is about as obsolete as using the "register"
attribute in C. ;-)
The other thing that makes me wonder is: Unless you have really large
functions that aren't "static", then gcc inlines them anyway (with -O2). And
probably when using a different C compilation procedure, gcc can even optimize
across compilation units (-combine -fwhole-program)...

Finnian Reilly

unread,

Apr 6, 2021, 7:18:55 AM4/6/21

to eiffel...@googlegroups.com

Hi Alexander

It turns out that the inlining factor plays an essential role to get the best results. My version (it differs from the C++ version by very efficient bit counting algorithm, but this affects only the final step, not the main loop) with the factor set to 15 outperforms the C++ version on my machine.

I don't doubt your benchmarks, but I have am having trouble understanding the theory of why it is faster than an unpacked SPECIAL array. Looking at your `clear' routine, i see that it first has to read the existing value at `index' in order to set a new value, whereas with an unpacked array you don't have this additional read. So in theory it should be slower, not faster. Do your results depend on some clever C compiler optimizations? How does it work?

feature -- Modification

   clear (a: like area; i: INTEGER)
         -- Put `False` at `i`-th bit of `a`.
      require
         valid_bit_index: 0 <= i and i < sieve_size
      local
         index: INTEGER
         v: like area.item
      do
         index := i |>> index_shift
         v := a [index] & (one |<< (i & index_mask)).bit_not
         a [index] := v
      ensure
         cleared: not item (a, i)
      end

Finnian

Alexander Kogtenkov

unread,

Apr 6, 2021, 9:38:58 AM4/6/21

to eiffel...@googlegroups.com

The ability to address a single byte of memory is in the days of yore. For a long time, the minimum addressable memory unit is not a single byte, but 8-16-32-64 bytes instead. The CPU cache lines have similar addressable storage capacity. The CPUs still support reading and writing single bytes. How can 1 byte be written in such an architecture?

After figuring out how this is done, we can see that if we write some data, reading at the same address is "for free".

The next hint is that the CPUs perform memory prefetching to satisfy the performance of the CPU core. In other words, when we request data at address N, after serving it to us, the hardware tries to load the data at address N + 1 in hope that it will be requested as well. If we perform the requests at a very high rate (because the address is new at every iteration, like with an array of 1-byte boolean values), the memory access rate might become the limiting factor, and we would have to wait when the next portion of memory arrives to the CPU core.

Using a bit-based algorithm instead of a byte-based, we reduce the requirement to throughput of the prefetching circuit by the factor of 8, reducing the waiting time and improving the CPU usage (yes, at the cost of additional bit operations, but here the additional work is more time-effective than additional waiting time).

Now, why not using longer integers? We could be using 8-byte, 16-byte, etc. integers. Unfortunately, general-purpose ALU would need pretty sophisticated (and slow) logic behind to support multiplication, etc. To overcome this, CPUs have special instruction sets and special registers without full-width multiplication and other costly operations, but then, the compiler should be instructed to use them instead of the general-purpose ones. Another issue is alignment: longer values require larger padding (i.e. unused memory). If the padding is not used or the data is not aligned to take the cache line size into account, it’s possible that the longer (8- or 16-byte) integer would be placed in two adjacent memory units instead of a single one. If this happens, access to such an integer would require 2 hardware memory reads instead of one, and we are back to the waiting issue (and other issues caused by the need to read and update 2 cache lines instead of one).

The real behavior becomes even trickier thanks to synchronization (if the same (or neighboring) memory addresses are accessed by different CPU cores, or, worse, by different CPUs). Reducing memory footprint improves access locality and reduces the number of potential conflicts of concurrent accesses.

Talking about locality, the less memory is used the more chances to get it all in the fast level-1 cache. If not, in the slower level-2 cache. If not… — you see the point. Lower memory requirements also reduce the risk that the data will be replaced in the cache by some other application running at the same time.

Hope this helps,

Alexander Kogtenkov

Finnian Reilly <fin...@eiffel-loop.com>:

--

You received this message because you are subscribed to the Google Groups "Eiffel Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to eiffel-users...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/eiffel-users/59bdd5b9-a3b0-3fc6-a159-bcc969f45218%40eiffel-loop.com.

Reply all

Reply to author

Forward

Software Drag Racing: C++ vs C# vs Python vs Eiffel - Which Will Win?

Finnian Reilly

Anthony W

Finnian Reilly

Finnian Reilly

Ulrich Windl

Finnian Reilly

Finnian Reilly

Alexander Kogtenkov

Anthony W

Finnian Reilly

Alexander Kogtenkov

Chris Tillman

Finnian Reilly

Finnian Reilly

Hubert Cater

Ulrich Windl

Ulrich Windl

Finnian Reilly

Alexander Kogtenkov