I've been thinking about how the counters should actually be accessed. Currently it can be done 'direct', or via an 'access daemon'.
First, background stuff Jan already knows but I'm just learning:
The per-core performance counters are read with the assembly command 'rdmsr'. 'rdmsr' is a privileged instruction that can only be issued when in "Ring 0" (kernel mode). The kernel module 'msr' creates /proc files that can be used to execute this command on behalf of "Ring 3" (user mode) processes . The whole CPU 'uncore' functions are accessed as part of the memory-mapped PCI space, which can also be accessed under /proc.
The direct mode is problematic for the Marker API because of the permissions required to access the MSR registers and PCI space. A thread/process needs to be running as root, and in addition needs to have CAP_SYS_RAWIO set ('man capabilities' for details). While this may be reasonable for a single program like likwid-perfctr, it's not possible for a user without root permissions to do this.
The alternative is to use a properly anointed daemon to do the reading on behalf of the client. This way, only a single daemon needs to be setuid root and setcap CAP_SYS_RAWIO. While still a security concern, setting up a single simple program this way is simpler and safer than setting up each program that is to be instrumented.
My current thinking is that Markers should do away with the 'direct' mode, and always use the access daemon. While at it, perhaps this can be done for the rest of the Likwid suite. This way there is only one code path that needs to be maintained, and only one program running with potentially dangerous permissions.
My main worry would be the accuracy of readings. The lag between sending a 'read counters' request and it actually happening might make it impossible to measure small segments. On the other hand, the reading is already intermediated by the kernel so perhaps this won't be much worse. One partial solution would be to do RDTSC type readings locally, so at least the total time is correct.
If one did move to this approach, it might be worth using a more standard protocol for access. HTTP is supported by GLib's 'libsoup', and might work well, or 0mq/nanomsg could be used if higher performance or asynchronous connections are needed. Using a standard protocol would have the side benefit of letting measurements be taken for non-compiled languages.
Are there reasons I'm missing that direct mode is still needed?
ps. An alternative that might be worth considering is to skip the daemon and instead do it as a kernel module. Considering how close to the metal Likwid has to operate, I occasionally think this might end up being simpler. It certainly would provide a good separation between 'client' and 'server'.
pps. Enjoy Denver! I'm in the SF Bay area, so won't be able to meet you