Learning to love daemons

na...@verse.com

unread,

Nov 20, 2013, 4:50:32 AM11/20/13

to likwid-d...@googlegroups.com

I've been thinking about how the counters should actually be accessed. Currently it can be done 'direct', or via an 'access daemon'.

First, background stuff Jan already knows but I'm just learning:

The per-core performance counters are read with the assembly command 'rdmsr'. 'rdmsr' is a privileged instruction that can only be issued when in "Ring 0" (kernel mode). The kernel module 'msr' creates /proc files that can be used to execute this command on behalf of "Ring 3" (user mode) processes . The whole CPU 'uncore' functions are accessed as part of the memory-mapped PCI space, which can also be accessed under /proc.

The direct mode is problematic for the Marker API because of the permissions required to access the MSR registers and PCI space. A thread/process needs to be running as root, and in addition needs to have CAP_SYS_RAWIO set ('man capabilities' for details). While this may be reasonable for a single program like likwid-perfctr, it's not possible for a user without root permissions to do this.

The alternative is to use a properly anointed daemon to do the reading on behalf of the client. This way, only a single daemon needs to be setuid root and setcap CAP_SYS_RAWIO. While still a security concern, setting up a single simple program this way is simpler and safer than setting up each program that is to be instrumented.

My current thinking is that Markers should do away with the 'direct' mode, and always use the access daemon. While at it, perhaps this can be done for the rest of the Likwid suite. This way there is only one code path that needs to be maintained, and only one program running with potentially dangerous permissions.

My main worry would be the accuracy of readings. The lag between sending a 'read counters' request and it actually happening might make it impossible to measure small segments. On the other hand, the reading is already intermediated by the kernel so perhaps this won't be much worse. One partial solution would be to do RDTSC type readings locally, so at least the total time is correct.

If one did move to this approach, it might be worth using a more standard protocol for access. HTTP is supported by GLib's 'libsoup', and might work well, or 0mq/nanomsg could be used if higher performance or asynchronous connections are needed. Using a standard protocol would have the side benefit of letting measurements be taken for non-compiled languages.

Are there reasons I'm missing that direct mode is still needed?

--nate

ps. An alternative that might be worth considering is to skip the daemon and instead do it as a kernel module. Considering how close to the metal Likwid has to operate, I occasionally think this might end up being simpler. It certainly would provide a good separation between 'client' and 'server'.

pps. Enjoy Denver! I'm in the SF Bay area, so won't be able to meet you

na...@verse.com

unread,

Nov 20, 2013, 6:24:20 AM11/20/13

to likwid-d...@googlegroups.com

On Wednesday, November 20, 2013 1:50:32 AM UTC-8, na...@verse.com wrote:

The per-core performance counters are read with the assembly command 'rdmsr'. 'rdmsr' is a privileged instruction that can only be issued when in "Ring 0" (kernel mode).

Already I realize this isn't quite right. The 'rdmsr' command is indeed Ring 0 only, but the performance counters can also be read with either 'rdmsr' or 'rdpmc'. 'rdpmc' can be configured to be allowed from non-privileged Ring 3 user level code:

"When in protected or virtual 8086 mode, the performance-monitoring counters enabled (PCE) flag in register CR4 restricts the use of the RDPMC instruction as follows. When the PCE flag is set, the RDPMC instruction can be executed at any privilege level; when the flag is clear, the instruction can only be executed at privilege level 0. (When in real-address mode, the RDPMC instruction is always enabled.) The performance-monitoring counters can also be read with the RDMSR instruction, when executing at privilege level 0."

This should be much faster than going through the kernel, which should yield more accurate readings. You would still need a daemon to do the set up, but the readings could be done truly directly (rather than DAEMON_AM_DIRECT which goes through /proc). It seems like there must be a way to make the uncore PCI memory space also available to non-root...

--nate

na...@verse.com

unread,

Nov 21, 2013, 7:25:12 AM11/21/13

to likwid-d...@googlegroups.com

On Wednesday, November 20, 2013 3:24:20 AM UTC-8, na...@verse.com wrote:

It seems like there must be a way to make the uncore PCI memory space also available to non-root...

I tried asking for assistance here: http://stackoverflow.com/questions/20120812/how-should-i-read-intel-pci-uncore-performance-counters-on-linux-as-non-root

I also noticed the commented out section of 'RDPMC' for Xeon Phi --- perhaps you've already discovered issues with this approach?

--nate

moebiusband

unread,

Nov 21, 2013, 8:40:06 AM11/21/13

to likwid-d...@googlegroups.com

Hi Nate,

I will answer to your previous posts next week in more detail.

With regard to the RDPMC instruction. This is an ancient instruction and heads back as far as Pentium Pro. I tried it on Xeon Phi, but it did not work. As far as I remember I could not execute it. But it is worth another try. You are right for the marker API (at least for the core things) this is a good alternative.

Did you try to execute it, e.g. on a SandyBridge? I will also have a look and let you know about the outcome. The instruction is documented in Intel SDM page 1362.

Also AMD has a similar interface to HPM which can be accessed from User Space.

http://developer.amd.com/resources/archive/amd-lightweight-profiling-specification/

I once had a look at it but did not pursue it further.

Regards,

Jan

na...@verse.com

unread,

Nov 22, 2013, 7:57:36 AM11/22/13

to likwid-d...@googlegroups.com

On Thursday, November 21, 2013 5:40:06 AM UTC-8, moebiusband wrote:

With regard to the RDPMC instruction. This is an ancient instruction and heads back as far as Pentium Pro. I tried it on Xeon Phi, but it did not work. As far as I remember I could not execute it. But it is worth another try. You are right for the marker API (at least for the core things) this is a good alternative.

Did you try to execute it, e.g. on a SandyBridge? I will also have a look and let you know about the outcome. The instruction is documented in Intel SDM page 1362.

I have not yet tried using RDPMC yet, but I don't anticipate problems. Quite a few other people have used it successfully for this purpose. Vince Weaver tested it against RDMSR here: http://web.eece.maine.edu/~vweaver/projects/perf_events/patches/

His results were that on Sandy Bridge, RDMSR is about 100 cycles, and RDPMC about 30. But these measurements were done already within a kernel module, so the actual benefit of reading from user space with RDPMC should be greater. Avoiding the switch from Ring 3 to Ring 0 should save us another couple hundred cycles per read.

--nate

moebiusband

unread,

Nov 22, 2013, 5:34:45 PM11/22/13

to likwid-d...@googlegroups.com

But if I cannot execute from user space with a standard linux kernel the main advantage is somewhat gone. I suspect on the long term we really have to consider a lightweight kernel module. But this is what I always did not want.

JAn

Nathan Kurz

unread,

Nov 23, 2013, 2:56:14 AM11/23/13

to likwid-d...@googlegroups.com

Using RDPMC should work fine from normal user space as an unprivileged
program. Vince just happened to test it from within a kernel module
for comparison to the other items. Running from userspace does
requires that a particular configuration register be set in advance to
allow it to work from user space. If this is set, it doesn't even
require root or an access server to use. I think state of this
register can be checked and set from
/sys/bus/event_source/devices/cpu/rdpmc

I think Linux changed to make this the default a few years ago, but
I'm not certain it went in. Patch was here:
http://www.serverphorums.com/read.php?12,407343
The phrase to search for in the source would be X86_CR4_PCE in
https://github.com/torvalds/linux/blob/master/arch/x86/kernel/cpu/perf_event.c

I think using RDPMC directly from instrumented programs is something
Likwid should pursue. What I haven't yet figured out is a good way to
read the uncore PCI counters from userspace as a non-root user without
using an intermediary process running as (userspace) root. I hope to
figure out a way to do this also, but this is a separate issue.

Whether or not using kernel module is a good idea, I don't think it
detracts too much from Likwid. It's not great, but since the
alternative involves installing an SUID root application server, it's
not much worse. In both cases, someone with root access needs to do
the install. I'd agree with you more if the alternative could be
installed and run on machines without prior root access.

--nate

Reply all

Reply to author

Forward