Some multiplexed derived events return negative values

28 views
Skip to first unread message

Wyatt Spear

unread,
Nov 25, 2025, 8:40:24 PM (7 days ago) Nov 25
to ptools-perfapi
We ran into negative values from PAPI 7.2.0 for some branch related events when using TAU plus multiplexing. 

This was our event list for TAU, but it might be possible to see this with fewer events:  TAU_METRICS=TIME,PAPI_BR_CN,PAPI_BR_UCN,PAPI_BR_TKN,PAPI_BR_NTK,PAPI_BR_MSP,PAPI_BR_PRC,PAPI_NATIVE_perf::BRANCHES,PAPI_NATIVE_perf::BRANCH-MISSES

I'm linking a test code that demonstrates this without involving TAU: papi_multiplex_native_test.c

It looks like BR_CN is sometimes 0 when BR_MSP or BR_TKN are subtracted to derive BR_PRC or BR_NTK, for example.

Please let me know if there's something we can do on our side to work around this, or if a PAPI-side fix may be in the works.

Thanks,
Wyatt Spear

Heike Jagode

unread,
Nov 25, 2025, 11:42:26 PM (7 days ago) Nov 25
to Wyatt Spear, ptools-perfapi
Hi Wyatt,

Thanks for reporting this. What architecture are you running on?

Thanks,
Heike

__________________________________________
Heike Jagode, Ph.D., Research Associate Professor
Innovative Computing Laboratory (ICL)
University of Tennessee Knoxville
http://icl.utk.edu/~jagode/


--
You received this message because you are subscribed to the Google Groups "ptools-perfapi" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ptools-perfap...@icl.utk.edu.
To view this discussion visit https://groups.google.com/a/icl.utk.edu/d/msgid/ptools-perfapi/3fdc24e2-1ff9-4e4a-8ede-53e0bfb3dd19n%40icl.utk.edu.

Nils Smeds

unread,
Nov 26, 2025, 3:18:23 AM (7 days ago) Nov 26
to Heike Jagode, Wyatt Spear, ptools-perfapi
This may not always be due to errors in the PAPI code.

You can help by submitting the output from the papi events test programs. In particular the ones stating the event mappings from PAPI predefined names to the actual counter events on the particular platform you are running on (and their derivations from multiple events).

The rest of this got rather long, but touches on some fundamental aspects of using performance counters that need to be repeated from time to time.

The events actually counted in a counter by a vendor are usually not well defined. They have an intent to count something, but in the real world their implementation is limited by the amount of silicon the vendor allocates to counters and the amount of money and time the vendor is willing to spend on implementing their counters. (The answer is: as little as possibly needed for our own development - and a little less than that)

"Misleading" counts are often caused by eg mis-predicted branches and other speculative execution that is later thrown away, but may or may not have caused counters to increment. Another case is, of course overall e.g. derived Flops counters that usually assume the vector was full, and both the multiply and add were "needed" in SIMD instructions.

The way a vendor event counter is implemented may vary between CPU versions of that same vendor's architecture, even when their description in the public hardware manuals is not changed - and the details needed to get the calculation "right" may be impossible to state in text form in a limited space in the documentation. 

Indeed, it is not even well defined what the "right" value is. "An add on a branch, later discarded - did that add really occur?" 

Branch counters, are notorious to give off results, in my opinion. They are useful to compare the execution of two similar versions of code, run on the same platform and see if they changed significantly, but not much beyond that. 

Counter multiplexing adds yet another level of uncertainty  in that the counters used in the calculations (when the event is derived) either always were active at the same time or never was.

If you really want to see branches, I think you have to use a hardware emulator/simulator (and hope it is not too off from what happens in the hardware).

Limiting yourself to only use the branch counters that map to a single hardware event on your platform, will help you to get positive results, but they will still be fuzzy as to what the numbers "actually" represent. 

Just my 2c on the matter. (But characters are obviously cheap today, so I apologize for the long note)

Nils Smeds

unread,
Nov 26, 2025, 2:23:59 PM (7 days ago) Nov 26
to Wyatt Spear, ptools-perfapi, Heike Jagode
You made my day!

I was right at least once today.

/Nils

On Wed, 26 Nov 2025, 19:31 Wyatt Spear, <wjs...@gmail.com> wrote:
I've seen this on both an Intel Xeon X5680 and an AMD 9654 Genoa.


Sorry if that is a double post but it doesn't seem to have shown up in the groups thread.


Here's the filtered papi decode output from the Xeon:

papi_decode | grep -i branch
PAPI_BRU_IDL,NOT_DERIVED,,"Branch idle cycles","Cycles branch units are idle",
PAPI_BTAC_M,NOT_DERIVED,,"Br targt addr miss","Branch target address cache misses",
PAPI_BR_UCN,NOT_DERIVED,,"Uncond branch","Unconditional branch instructions",,BR_INST_EXEC:DIRECT
PAPI_BR_CN,NOT_DERIVED,,"Cond branch","Conditional branch instructions",,BR_INST_EXEC:COND
PAPI_BR_TKN,NOT_DERIVED,,"Cond branch taken","Conditional branch instructions taken",,BR_INST_EXEC:TAKEN
PAPI_BR_NTK,DERIVED_SUB,,"Cond br not taken","Conditional branch instructions not taken",,BR_INST_EXEC:ANY,BR_INST_EXEC:TAKEN
PAPI_BR_MSP,NOT_DERIVED,,"Cond br mspredictd","Conditional branch instructions mispredicted",,BR_MISP_EXEC:ANY
PAPI_BR_PRC,DERIVED_SUB,,"Cond br predicted","Conditional branch instructions correctly predicted",,BR_INST_EXEC:COND,BR_MISP_EXEC:COND
PAPI_BR_INS,NOT_DERIVED,,"Branches","Branch instructions",,BR_INST_EXEC:ANY

And here is a test that shows the source of the negative values in the derived events and confirms it's multiplexing related.


An AI assembled that test but it looks fairly sane to me...

The output of running that code on the Xeon system:

PAPI Multiplexing Derived Event Test

PAPI 7.2. 0, GenuineIntel Intel(R) Xeon(R) CPU           X5680  @ 3.33GHz
Derived event definitions:
  PAPI_BR_CN     : NOT_DERIVED
  PAPI_BR_UCN    : NOT_DERIVED
  PAPI_BR_TKN    : NOT_DERIVED
  PAPI_BR_NTK    : DERIVED_SUB (BR_INST_EXEC:ANY - BR_INST_EXEC:TAKEN)
  PAPI_BR_MSP    : NOT_DERIVED
  PAPI_BR_PRC    : DERIVED_SUB (BR_INST_EXEC:COND - BR_MISP_EXEC:COND)

--- WITHOUT MULTIPLEXING ---
Adding events:
  PAPI_BR_CN           OK
  PAPI_BR_UCN          OK
  PAPI_BR_TKN          OK
  PAPI_BR_NTK          OK
  PAPI_BR_MSP          FAIL (Invalid argument)
  PAPI_BR_PRC          FAIL (Invalid argument)
  perf::BRANCHES       FAIL (Invalid argument)
  perf::BRANCH-MISSES  FAIL (Invalid argument)
Results: 4/8 events added, 0 negative values (0.0%)

--- WITH MULTIPLEXING ---
Adding events:
  PAPI_BR_CN           OK
  PAPI_BR_UCN          OK
  PAPI_BR_TKN          OK
  PAPI_BR_NTK          OK
  PAPI_BR_MSP          OK
  PAPI_BR_PRC          OK
  perf::BRANCHES       OK
  perf::BRANCH-MISSES  OK
  Negative: iter=3 PAPI_BR_NTK=-22232
  Negative: iter=4 PAPI_BR_NTK=-224470
  Negative: iter=5 PAPI_BR_NTK=-214256
Results: 8/8 events added, 260 negative values (6.5%)
Negative by event: PAPI_BR_NTK=75 PAPI_BR_PRC=185
Inconsistent extrapolation: CN<MSP=228, CN<TKN=133

--- SUMMARY ---
                    No Mpx    Mpx
Events added:          4       8
Negative values:       0     260

Negative values occur ONLY with multiplexing enabled.


=Wyatt

Wyatt Spear

unread,
Nov 26, 2025, 6:21:36 PM (6 days ago) Nov 26
to ptools-perfapi, Heike Jagode, ptools-perfapi, Wyatt Spear
I've seen this on both an Intel Xeon X5680 and an AMD 9654 Genoa.

=Wyatt

Wyatt Spear

unread,
Nov 26, 2025, 6:21:39 PM (6 days ago) Nov 26
to ptools-perfapi, Nils Smeds, Wyatt Spear, ptools-perfapi, Heike Jagode
I've seen this on both an Intel Xeon X5680 and an AMD 9654 Genoa.


Phil Mucci

unread,
Nov 27, 2025, 7:43:29 AM (6 days ago) Nov 27
to Wyatt Spear, ptools-perfapi, Nils Smeds, Heike Jagode, Wyatt Spear
Hi wyatt,

Been a long time! Hope you are well. Come work for me at AMD sometime. ;-)

Considering the multiplexing internal is big wrt processor frequency, i don’t think the differences are big enough to conclude the formula isn’t valid. However PAPI should at least detect this case and zero the results and warn.

Perf events does log the number of samples it has taken per counter during an interval - so it might be in the log.

Recommendation would be to increases runtime of that code segment. If the formula is wrong, it should be wrong all the time. If not, and mpx intervals are the problem, then it should disappear (or reduce) with increasing runtimes/samples.

There may be a couple tunables as well for multiplexing that allow one to change the interval. Perf can use time or cycle counts as the trigger.

Regards,
Phil



On Nov 27, 2025, at 00:21, Wyatt Spear <wjs...@gmail.com> wrote:


Reply all
Reply to author
Forward
0 new messages