Subject: Query regarding programmatic energy measurements and Intel RAPL availability

55 views
Skip to first unread message

Aidan Dakhama

unread,
Feb 19, 2026, 6:04:11 AMFeb 19
to cloudlab-users

Hi all,

I am currently planning some experiments on CloudLab and need to collect power and energy consumption metrics for my workloads. I have a couple of questions regarding the supported energy measurement capabilities on the clusters:

External Energy Measurements: I understand that some of the clusters/machines have external power monitoring capabilities. I believe you can often see this on the graphs via the web interface. Is there a supported API, CLI tool, or programmatic method for users to fetch these external energy measurements during an experiment and if so where could I find the information on using this?

Intel RAPL Availability: I am also interested in using Intel RAPL. Can you confirm if RAPL is supported and enabled across Intel-based machines, and specifically on the m510 nodes? I primarily need to know if the necessary MSRs or perf_events are exposed to the OS/user space, or if they are disabled/restricted at the BIOS level.

Thank you for your time and help!

Best regards,

Aidan


ajma...@gmail.com

unread,
Feb 19, 2026, 8:09:50 PMFeb 19
to cloudlab-users
Hi Aidan,

External Energy Measurements:  At one point I believe we had a project running power monitoring at one of the clusters, but that project hasn't been active in over 6 years.  Unless I'm mistaken, right now we do not collect external power measurements at any of the CloudLab sites.  The experiment status pages just report load averages from the OS and network statistics from our switch port counters.  We also do not have a way for users to fetch such metrics themselves, as that information would come from servers' out of band management systems.  Depending on the scope/scale of what you need, we could potentially scrape these metrics for you over the course of an experiment and send them to you at the conclusion.

RAPL Availability:  I had to look into this more closely.  It looks like running `sudo rdmsr -X --bitfield 63:63 0x610` (following a `modprobe msr`) will print a `1` if RAPL is locked and `0` otherwise.  That command on both a m510 and a c6620 node printed `0`, so I think you're probably good to go?  If you have issues, then I can look more closely at the BIOS options.

Best,
 - Aleks

Aidan Dakhama

unread,
May 14, 2026, 10:16:08 AMMay 14
to cloudlab-users

Hi Aleks,

Thanks again for checking the RAPL status previously.

We are preparing for our project and will definitely benefit from the external power measurements. Could you let us know which specific nodes/clusters have this available, and how we can arrange for that data to be collected and sent to us?

Additionally, we hope to evaluate our workloads across Intel, and perhaps also AMD, and ARM nodes. Since we will be extensively accessing Linux interfaces like perf_event_open and powercap sysfs, we need to know if the underlying hardware features are fully exposed and enabled at the BIOS/firmware level.

Specifically, can you confirm if the following are universally enabled on your bare-metal machines, or if there are any platform-wide restrictions?

  • Intel: PMU, RAPL, PEBS, Processor Trace (PT), and LBR.

  • AMD: Core/Data Fabric PMU, RAPL (via AMD MSRs), IBS, and LBR/BRS.

  • ARM: PMUv3, AMU, SPE, and CoreSight/ETM.

  • General: Any other BIOS/firmware features required to fully utilise hardware performance counters and power telemetry.


Thanks again for your time and support!

Best regards,

Aidan

Aleksander Maricq

unread,
May 19, 2026, 2:58:26 PMMay 19
to cloudlab-users
Hi Aidan,

Apologies for the delay in response.  Offhand I don't know the answer to your question about the various features you listed being enabled or disabled.  In general, we tend to leave the BIOS defaults as they are while changing a relatively small subset of features for basic functionality or security reasons.  It's quite likely that not all of the features you listed have explicit BIOS settings attached to them, often times (such as for Dell machines) they're tied into wider system performance profile settings .  For Dell machines for example, the default is typically a Maximum Performance system profile which can sometimes lock down access to hardware features through some of these host-level interfaces.  Our policy isn't to _intentionally_ restrict access to such features, but sometimes it happens due to the administrators at a particular site either intentionally leaving the servers on a Maximum Performance profile (for performance reasons) or unintentionally leaving it on that profile due to it being the default.  If the profile is set to Custom, which we have done for some of our hardware types, that should open things up more.

You would likely know more about the Linux interfaces you're looking for than we would, so let me suggest a handful of hardware types for you to start looking at, and you can let us know if there's anything you're looking for that's missing.  For Intel-based machines, the c6620 nodes at CloudLab Utah have the performance profile set to custom, and for AMD-based machines, the c6525-25g and c6525-100g nodes also at CloudLab Utah have the performance profile set to custom.  For ARM, you only have two choices of hardware type period:  The two Nvidia Grace Hopper nodes at Clemson (which are often in high demand) and the low-powered m400 nodes at CloudLab Utah.

As far as gathering external power measurements is concerned, there is no cluster where we have any setup that actively gathers power measurements (though we used to do this at CloudLab Utah once upon a time).  That is to say, any such setup would be a one-off on a per-experiment basis.  The easiest way would be for us to use Dell machines and use racadm to poll the given iDRACs using get System.Power.  That output would look something like this (with some information trimmed out):
#Avg.LastDay=386 W | 1317 Btu/hr                                            
#Avg.LastHour=428 W | 1461 Btu/hr
#Avg.LastWeek=423 W | 1444 Btu/hr
...
#Max.LastDay=738 W | 2519 Btu/hr
#Max.LastDay.Timestamp=Tue May 19 11:31:05 2026
#Max.LastHour=554 W | 1891 Btu/hr
#Max.LastHour.Timestamp=Tue May 19 12:22:05 2026
#Max.LastWeek=738 W | 2519 Btu/hr
#Max.LastWeek.Timestamp=Tue May 19 11:31:05 2026
...
#Min.LastDay=140 W | 478 Btu/hr
#Min.LastDay.Timestamp=Mon May 18 22:34:21 2026
#Min.LastHour=421 W | 1437 Btu/hr
#Min.LastHour.Timestamp=Tue May 19 12:20:35 2026
#Min.LastWeek=5 W | 17 Btu/hr
#Min.LastWeek.Timestamp=Sun May 17 12:15:55 2026
#Realtime.Amps=2.2 Amps
#Realtime.Headroom=1973 W | 6734 Btu/hr
#Realtime.Power=427 W | 1457 Btu/hr
...

The command is slow enough that per-second granularity wouldn't be possible.  To be honest, I don't even know how fast the "Realtime" fields actually update, so your granularity might only be on the order of minutes.  We would set up a script to pull this info off of the iDRAC(s) every X amount of time and dump it all into an output file per host that you could parse through.  See if you can get what you need from the host interfaces first, and if you do still need external measurements we can work it out from there.

Hopefully this all is enough for you to at least get started.  Let us know if you have any other questions or problems.

Best,
 - Aleks

Reply all
Reply to author
Forward
0 new messages