I've continued hacking things up to meet my particular needs, and have thought a little about how to generalize this. I've also explored the Likwid alternatives a bit more to understand how others have approached things. I realize you've been thinking about these issues a lot longer than I have, but outside perspective might be interesting.
My thought is that Likwid's primary strength is that it provides an easy and consistent means of setting and accessing performance counters across processor families regardless of protocol (MSR vs PCI). I haven't found anything else that does this well. The ability to translate from a group name to a configurable set of counters is great. I also really like that you are striving to support the full capabilities of each processor, rather reducing to a least-common-denominator supported by all of them.
I think it would be useful to try to compartmentalize (and simplify) these features by separating them more from the presentation, analysis, and benchmarking portions of Likwid. I think the best way of doing this would be to make likwid-perfctr (and the others) clients of liblikwid of the same standing as other users of the library. Obviously the library would have to be customized for them to continue working as they are, but if it it can handle all of those elegantly it should be a decent public API.
This would also mean that both likwid-perfctr and an instrumented application would be able to use the same data structures and configuration info. So instead of passing the application a bit map of the counters in use, the original eventString could be set as an environment variable, and the application could 'decode' it in the same manner that likwid-perfctr did: looking up the CPU information and mapping the counter names to the hardware indexes.
I think the weakest part of Likwid right now (from a design standpoint) is the intermediate file used by the marker API and the analysis and output code that works with it. On the bright side, this seems easy to fix: instead of having the application use the library to collect information and pass it back to likwid-perfctr, have the library expose functions to help the application to access the saved information, and let the application decide whether and how it wants to print or analyze it.
In the same vein, the way you process custom formulas in the group files is heroic! At first I couldn't figure out how it was working, and then when I looked I was amazed. On the other hand, when moving to a library it would be nice to be able to modify the group files without recompiling the library. The easy path might be to have liblikwid's responsibility to end by writing out the counter names and their values as CSV to a file (one row per thread) and then use Perl/Python/Ruby in a separate application to prettify this and do the dynamic calculations.
* likwid-perfctr will pass a bitmap to the instrumented application indicating which counters are in use. This enables to only read out the counters which are required by the current group.
As mentioned above, I'd strongly suggest giving the client access to the names of the counters as well.
* There are counter maps for each architecture now which simplify the marker library significantly
Yes, this seems like a good move.
* I switched to glib for standard library stuff. In the next release I will use the hash implementation of glib.
Then you could also use it for runtime parsing of group files and the like.
* There are fences added which ensure that the marker API functions just return if the application is run without likwid-perfctr wrapper. This allows to run the instrumented binary with little added overhead, also on machines without likwid setup.
Perhaps this is what you've done, but I think it would be best to have this check happen as a macro expansion rather than within the call: Likwid_Start() -> "do { if (Likwid) likwid_start() } while (0)" (or something). This would be in addition to having the macro compile to nothing unless LIKWID was defined. It's possible that function call overhead is low enough not to be an issue, but short of run time modified code a single correctly predicted branch is about the best you can do.
For the library approach it would be possible to specify the group you want to measure. You can pass this as an argument or environment variable at runtime. I have to think about a library API and will propose a suggestion here.
I think splitting out the eventString parsing from the rest of the initialization might be a good start for this. The client would grab it from an environment variable where it had been put either by likwid-perfctr or by the user (client doesn't know which). The client would try to initialize the counters, leaving them set as they were if it discovers they are already active.
I think it would also be nice to get all the Likwid run time globals under a single 'conf' variable. While the static local variables per file works just fine, it's sometimes been difficult trying to figure out what information is kept where, and how to make it accessible to a client program. I've been trying some stuff on paper, and have gotten far enough to think that it should be possible. But I'm also still finding that there are whole areas that I know nothing about, so perhaps I'm overly optimistic.
Do you think of a central instance of libperfctr (as a daemon running all the time?). This is of course another level :-). Well I have to think about it.
Rather than a daemon, I've been thinking about a more compartmentalized approach where the counter reading is completely independent of the counter setting. The libperfctr client wouldn't care whether it sets the counters itself, or if they were set by Likwid, PMU Tools, perf, or VTune. It would just read counters, and write the results in some standard format. This output could be piped directly to an analysis program, or saved to a file.
The hard part would be coming up with an API that is both simple and flexible enough. I particularly like the goal of trying to make likwid-perfctr be just a regular client of libperfctr. It's not that this is necessary in itself, but it would mean that you have a pretty general interface, that could then be wrapped to let you make the user interface portions with more flexible scripting language.
What he lacks is a good integrated approach to counter interfaces and multiple CPU's.
--nate