A heads up ahead of our Embench meeting this coming Monday

109 views
Skip to first unread message

Ray Simar

unread,
Nov 15, 2023, 8:23:10 PM11/15/23
to emb...@lists.librecores.org, Jennifer Hellar, David Harris, Paolo Savini, David Patterson, jeremy....@embecosm.com
Hi all,

I wanted to just give a quick heads up before our upcoming Monday meeting.  I am very encouraged by the emails I have seen from everyone.  A special thanks to David Harris for helping to fill in some gaps with the Wally core!

I wanted to see if we might be able to consolidate our results by the end of this week.

David H, would you be able to update your Wally numbers?

Jennifer and Paolo, how are you feeling about your pieces?

Everyone use this email thread to provide your latest and greatest and let’s see where we are.

I have a sense that we may be able to pull together a better picture for Dave, so thought I would test the waters.  Thanks so much for your help!

Finally, everyone do your best to stay healthy.  I among a bunch of students who are sick all the time: flu and Covid making the rounds.

All the best,
Ray

David Harris

unread,
Nov 15, 2023, 8:27:35 PM11/15/23
to Ray Simar, emb...@lists.librecores.org, Jennifer Hellar, Paolo Savini, David Patterson, jeremy....@embecosm.com
Ray,

I got somewhat different results on one test case earlier this week than I had recorded this summer.  There must be some microarchitectural changes that made a bit of a performance difference.

I’ll need to rerun all the tests with and without caches.  I'll see what I can do by the end of the week, but this is a crunch week with a bunch of other deadlines.

DAvid

David Patterson

unread,
Nov 16, 2023, 10:01:19 AM11/16/23
to David Harris, Ray Simar, emb...@lists.librecores.org, Jennifer Hellar, Paolo Savini, jeremy....@embecosm.com
I'd love to get the overview of everything we've collected and their relationships/overlaps
Best, Dave

P.S. If this email arrives outside of regular hours, I'm not expecting a reply then. I just work weird hours.

Jennifer Hellar

unread,
Nov 16, 2023, 5:52:56 PM11/16/23
to David Patterson, David Harris, Ray Simar, emb...@lists.librecores.org, Paolo Savini, jeremy....@embecosm.com

Thanks for all the feedback and conversation, all!  Here is a summary of what I have generated…

 

Data: https://drive.proton.me/urls/503GADCXJM#OtYw1dINdlpP

 

Cores/Architectures:

  • OpenHW CV32E40P v1.3.2
    • RV32I
    • RV32IM
    • RV32IMC
    • RV32IMFC
  • OpenHW CV32E40X commit b658fbe0
    • RV32I
    • RV32IM
    • RV32IMC
    • RV32IMC_Zba_Zbb_Zbc

 

Toolchain: Embecosm RISC-V GCC 13.1.0

 

Optimization levels: “-msave-restore -Oz”, “-Oz”, “-msave-restore -Os”, “-Os”, “-msave-restore -O2”, “-O2”, “-O3”, “-Ofast”

 

Metrics:

  • From build:
    • object size
    • app size
    • # instructions
    • # 32b
    • # 16b
  • From simulation:
    • # cycles
    • # retired
    • # fetches
    • # loads
    • # stores

 

Best,

Jennifer

 

From: David Patterson <davidpa...@google.com>
Date: Thursday, November 16, 2023 at 9:01 AM
To: David Harris <har...@g.hmc.edu>
Cc: Ray Simar <ray....@rice.edu>, emb...@lists.librecores.org <emb...@lists.librecores.org>, Jennifer Hellar <Jennife...@cirrus.com>, Paolo Savini <paolo....@embecosm.com>, jeremy....@embecosm.com <jeremy....@embecosm.com>
Subject: Re: A heads up ahead of our Embench meeting this coming Monday

I'd love to get the overview of everything we've collected and their relationships/overlaps Best, Dave P. S. If this email arrives outside of regular hours, I'm not expecting a reply then. I just work weird hours. On Wed, Nov 15,

ZjQcmQRYFpfptBannerStart

This Message is from an External Sender

    Report Suspicious    ‌

ZjQcmQRYFpfptBannerEnd

David Harris

unread,
Nov 17, 2023, 4:24:30 PM11/17/23
to Ray Simar, emb...@lists.librecores.org, Jennifer Hellar, Paolo Savini, David Patterson, jeremy....@embecosm.com
I’ve kicked off jobs for Embench on Wally.  I think I’ll have data by tomorrow morning with and without caches.

I see both size and speed are slightly worse for certain benchmarks.  The size must be caused by the compiler, so maybe the speed is as well.  My previous results were in GCC11 and now I’m running GCC 13.1.1.

David

On Nov 15, 2023, at 5:22 PM, Ray Simar <ray....@rice.edu> wrote:

Ray Simar

unread,
Nov 17, 2023, 5:13:14 PM11/17/23
to David Harris, emb...@lists.librecores.org, Jennifer Hellar, Paolo Savini, David Patterson, jeremy....@embecosm.com
Great. Many thanks!

I share with the crunch sentiment as we wrap up the semester: 😬

All the best,
Ray

Sent from my iPhone, forgive my terseness

On Nov 17, 2023, at 3:24 PM, David Harris <har...@g.hmc.edu> wrote:

I’ve kicked off jobs for Embench on Wally.  I think I’ll have data by tomorrow morning with and without caches.

David Harris

unread,
Nov 21, 2023, 1:21:01 AM11/21/23
to Ray Simar, emb...@lists.librecores.org, Jennifer Hellar, Paolo Savini, David Patterson, jeremy....@embecosm.com
Team,

Here are the results I obtained on OpenHWGroup CORE-V Wally with GCC 13.1.1.  

The data all included caches.  RV32D is incompatible with the uncached version of Wally because of the 64-bit floating-point load/store operations.  The data set fits in the cache and the benchmarks warm up the cache and branch predictor, so I don’t see any reason to expect the cache would perform differently than an ideal single-cycle tightly integrated memory.

Data is collected for all benchmarks, but I calculate geometric mean excluding the md5sum, primecount, and tarfind benchmarks that are not part of 1.0.

Something is fishy that certain benchmarks such as md5sum, primecount, statemate, tarfind, and wikisort run so much faster on RISC-V than ARM.  Doesn’t seem like apples-to-apples. cubic is much slower on RISC-V because the long double is defined differently between the architectures.

David


Wally Embench Results.xlsx

Roger Shepherd

unread,
Nov 22, 2023, 11:14:19 AM11/22/23
to David Harris, Ray Simar, emb...@lists.librecores.org, Jennifer Hellar, Paolo Savini, David Patterson, Jeremy Bennett
David,

Great to see these results. Some comments below

On 21 Nov 2023, at 06:20, David Harris <har...@g.hmc.edu> wrote:

Team,

Here are the results I obtained on OpenHWGroup CORE-V Wally with GCC 13.1.1.  
...

Data is collected for all benchmarks, but I calculate geometric mean excluding the md5sum, primecount, and tarfind benchmarks that are not part of 1.0.

Something is fishy that certain benchmarks such as md5sum, primecount, statemate, tarfind, and wikisort run so much faster on RISC-V than ARM.  Doesn’t seem like apples-to-apples. cubic is much slower on RISC-V because the long double is defined differently between the architectures.

Huffbench is also much faster on RISC-V than ARM

I have a strong suspicion that at least some of this is a compiler effect. When someone (Embecosm?) produced results for ARM and RISC-V across a range of gcc versions I took the data and did some gcc version comparisons for both ARM and RISC-V and did a same-compiler-version comparison of ARM and RISC-V. (I’ve attached a pdf of the results - this slightly different to the version I’ve distributed before). 

Although I don’t have the results for md5sum, prime count, or tarfind, the stalemate result stands out, there being a 1.74x speed-up on ARM between gcc 8.5 and gcc 11.3. RISC-V shows no such speed-up over these versions which makes think that in 8.5 the optimisation has been enabled for RISC-V but wasn’t enabled for ARM until gcc 11. 

In the final radar plot, as well as plotting gcc 11 results for RISC-V (green) and ARM (blue) I’ve also plotted the RISC-V scaled (grey) so that the geo mean is the same as for ARM. I’ve done this to try and make clear how the relative performances differ between the individual benchmarks of the suite. Interestingly statemate no longer looks different. 

Roger

2023-11-23 Suite graphs.pdf

David Harris

unread,
Nov 23, 2023, 6:35:09 PM11/23/23
to Roger Shepherd, Ray Simar, emb...@lists.librecores.org, Jennifer Hellar, Paolo Savini, David Patterson, Jeremy Bennett, Ross Thompson
Team (+Rose Thompson)

I’ve noticed some issues in the benchmarks getting speedup with F/D extensions even when the code contains no floats or doubles.  I’m also getting strange optimization from nbody that results in an apparent 1547x speedup.

1) libud.c was converted from doubles to ints per the comment at line 69 of libud.c.  However, line 171 contains a multiplication by 2.0.  This generates some calls to floating-point libraries, such as adddf3.  Thus, compiling with floating-point gives a significant  speedup not expected for integer code. In Embench 2.0, I’d suggest changing 2.0 to 2 (and seeing if there is anything else causing floating-point code).  It looks like that should have no logical affect on the program.

2) libwikisort also contains multiplications by 1.0 at lines 904 and 910 and 934 even though there are no floating-point variables.  I haven’t tried to determine whether this is truly needed.  I think it might be possible to replace with some integer operations to pick a random number modulo 5.

3) nbody is reporting speed of 1547 speedup for rv32imafdc_zba_zbbz_zbc_zbs_zicsr, compared to 0.76 for rv32imc.  I’ve traced each flavor in spike.  With the D extension, the fsqrt instruction is called 10 times (after warmup), and a total of 1026 instructions are executed.  Without the D extension, the “sqrt” library function is called 1000 times (after warmup), and a total of 3,160,713 instructions are executed.

The nbodies benchmark_body has a loop at lines 193-194 to call bodies_energy 100 times and add the energy to a running sum.    Each call to bodies_energy has two nested loops that call sqrt a total of 10 times.  The sqrt library call takes 843 instructions.  

for (i = 0; i < 100; ++i)
tot_e += bodies_energy (solar_bodies, BODIES_SIZE);

The call to bodies_energy doesn’t change anything.  The answer ends up in fa4.  The D version of the code only calls bodies_energy once, and then has a loop that adds up fa4 100 times into fa5.  The non-D version of the code calls bodies_energy 100 times.

8000027c: 06400793           li a5,100
80000280: fff78793           add a5,a5,-1
80000284: 02e7f7d3           fadd.d fa5,fa5,fa4
80000288: fe079ce3           bnez a5,80000280 <benchmark_body+0xb0>

I think this is a poor benchmark to have such an obvious optimization to increase the performance 100x.  It’s not clear to me why the non-D version of the code doesn’t detect this optimization.  I believe the true speedup is about 15x from floating-point, and another 100x from detecting that the call to bodies_energy is invariant and only needs to be done once.

I would suggest for Embench 2.0 that the bodies_energy function update the position after each call so that the answer changes from call to call and the 100 iterations can’t be optimized away.  I’m not sure where the Embench flavor of nbody came from, but here’s a flavor that updates positions at line 32.


David




<2023-11-23 Suite graphs.pdf>



Roger Shepherd




Roger Shepherd

unread,
Nov 24, 2023, 1:26:36 PM11/24/23
to David Harris, Ray Simar, emb...@lists.librecores.org, Jennifer Hellar, Paolo Savini, David Patterson, Jeremy Bennett, Ross Thompson
David,

nbody is recognised as problematic although I don’t think anyone has previously done this much analysis on it. I think there is a consensus to drop it from embench 2.0. Certainly some compilers can kill this benchmark.

I have run Embench on Apple x86 and M1 cores. nbody saw an incredible speedup. In fact, it showed up an underlying problem with the way we measure speed in Embench. Essentially we use a “fudge factor” to ensure that benchmarks run on real hardware for a “reasonable” period of time - that is long enough that the clock quantum does not introduce significant errors, and short enough that the benchmark runs in seconds rather than hours. The problem is that if we benchmark a processor with very different performance properties - e.g. a floating point unit, or a compiler that can collapse a benchmark. the speedup can be such that the clock quantum is significant - this is what happend on Apple - nobody would take between 0 and 1 quanta! If I slowed down the whole suite so as to make nobody reasonable, the rest of the suite would be unacceptably slow. I think there is a fix to this whereby each benchmark has it's own fudge factor applied, but it would involve rather more work to do this than I could manage quickly and so I’ve not made a proposal for such a change

Roger

-- 
You received this message because you are subscribed to the Google Groups "Embench" group.
To unsubscribe from this group and stop receiving emails from it, send an email to embench+u...@lists.librecores.org.
To view this discussion on the web visit https://groups.google.com/a/lists.librecores.org/d/msgid/embench/3FC9F41D-3016-41B9-8F78-B7829A88D541%40g.hmc.edu.

Roger Shepherd



Pascal Gouedo

unread,
Nov 27, 2023, 2:54:27 AM11/27/23
to Roger Shepherd, David Harris, Ray Simar, emb...@lists.librecores.org, Jennifer Hellar, Paolo Savini, David Patterson, Jeremy Bennett, Ross Thompson, Pascal Gouedo

Hi Roger and David,

 

There is one solution to this kind of problem, do like Coremark and allow each benchmark to characterize itself and automatically compute its own number of iterations.

Best regards,

Pascal.

 

From: Roger Shepherd <roger.s...@chipless.eu>
Sent: Friday, November 24, 2023 7:26 PM
To: David Harris <har...@g.hmc.edu>

Cc: Ray Simar <ray....@rice.edu>; emb...@lists.librecores.org; Jennifer Hellar <jennife...@cirrus.com>; Paolo Savini <paolo....@embecosm.com>; David Patterson <davidpa...@google.com>; Jeremy Bennett <jeremy....@embecosm.com>; Ross Thompson <ross...@gmail.com>
Subject: Re: [embench] A heads up ahead of our Embench meeting this coming Monday

 

Vous ne recevez pas souvent de courriers de la part de roger.s...@chipless.eu. Découvrez pourquoi cela est important

Reply all
Reply to author
Forward
0 new messages