Likwid Markers as a Library

34 views
Skip to first unread message

na...@verse.com

unread,
Nov 2, 2013, 2:54:07 AM11/2/13
to likwid-d...@googlegroups.com
Hi Jan --

I spent a few hours trying to understand how to make Likwid do what I want (mostly outputting 1000's of 'regions' to a CSV file) and thought I'd write down my current understanding.  Perhaps they'll be useful to you or others.  If I have things wrong, I'd be glad for correction.

The current 'markers' approach requires the coordination of likwid-perfctr and libperfctr.   likwid-perfctr reads the command line and configures the counters as appropriate for the current processor.  It then executes the program to be measured.  If that program has been instrumented with the macros in libperfctr, that program will read from those counters.  Every 'region' is used as the key of a hash, and an array of  (final - start) counter values is bulk of the key.  If a region is used more than once, the counters are updated to be cumulative.   The called program writes the final values of this hash  to a temp file whose name was passed by environment variable.   likwid-perfctr then reads this file and uses it to generate the formatted output.  

The trickiness (at least I find it tricky) is that the measured child has no knowledge of what each counter is measuring.  It simply assumes that the counters have been initialized by the parent.  The child also doesn't know which counters are actually in use, and thus needs to read and save the results for all possible counters in case they are being used.   In return, the parent only learns what 'regions' the child measured when it reads the temp file.  They are separate processes with no access to information the other knows. 

The alternative "library" approach would be to allow the instrumented program to configure the counters itself via libperfctr.  This approach would be considerably simpler, since the child would have access to all the information it needs to interpret the counters as well as read them.   It would reduce some duplicated counter reading code, since the child would then know which counters it needs to read.  The clumsy temp file could go away, and the child could simply be executed standalone from the command line.   

The downsides?  Mostly that it requires changing quite a bit of code.  It also might complicate the issue of having multiple programs trying to have sole control of the same system wide counters, but I think that small changes to the current locking system should be able to handle this.  You'd also want to smoothly handle the case where you still want to run an instrumented program under likwid-perfctr.  These don't seem any worse than the current situation where two competing likwid-perfctr instantiations are trying to run.   Are there other downsides? 

Overall, I think this would be a good direction to go.   likwid-perfctr is already trying to do a lot things, and would be substantially simplified if it was allowed to deal only with 'whole program' issues.  It may even be possible to make likwid-perfctr just another client of libperfctr --- one that happens to measure other programs it spawns --- and have it use the same interface.  Tasty dogfood!   The markers could would be more straightforward too, since everything would be happening in a single address space.   

--nate
  



moebiusband

unread,
Nov 4, 2013, 7:11:21 AM11/4/13
to likwid-d...@googlegroups.com
Hi Nate,


On Saturday, November 2, 2013 7:54:07 AM UTC+1, na...@verse.com wrote:
Hi Jan --

I spent a few hours trying to understand how to make Likwid do what I want (mostly outputting 1000's of 'regions' to a CSV file) and thought I'd write down my current understanding.  Perhaps they'll be useful to you or others.  If I have things wrong, I'd be glad for correction.

The current 'markers' approach requires the coordination of likwid-perfctr and libperfctr.   likwid-perfctr reads the command line and configures the counters as appropriate for the current processor.  It then executes the program to be measured.  If that program has been instrumented with the macros in libperfctr, that program will read from those counters.  Every 'region' is used as the key of a hash, and an array of  (final - start) counter values is bulk of the key.  If a region is used more than once, the counters are updated to be cumulative.   The called program writes the final values of this hash  to a temp file whose name was passed by environment variable.   likwid-perfctr then reads this file and uses it to generate the formatted output.  

The trickiness (at least I find it tricky) is that the measured child has no knowledge of what each counter is measuring.  It simply assumes that the counters have been initialized by the parent.  The child also doesn't know which counters are actually in use, and thus needs to read and save the results for all possible counters in case they are being used.   In return, the parent only learns what 'regions' the child measured when it reads the temp file.  They are separate processes with no access to information the other knows. 

I could not describe it better. That is exactly how it works currently. 

The next release will bring the following changes to the marker API, most of it is already implemented:

* likwid-perfctr will pass a bitmap to the instrumented application indicating which counters are in use. This enables to only read out the counters which are required by the current group.

* There are counter maps for each architecture now which simplify the marker library significantly

* Parts of the library were rewritten for lower overhead

* I switched to glib for standard library stuff. In the next release I will use the hash implementation of glib.

* There are fences added which ensure that the marker API functions just return if the application is run without likwid-perfctr wrapper. This allows to run the instrumented binary with little added overhead, also on machines without likwid setup.
 

The alternative "library" approach would be to allow the instrumented program to configure the counters itself via libperfctr.  This approach would be considerably simpler, since the child would have access to all the information it needs to interpret the counters as well as read them.   It would reduce some duplicated counter reading code, since the child would then know which counters it needs to read.  The clumsy temp file could go away, and the child could simply be executed standalone from the command line.   

The downsides?  Mostly that it requires changing quite a bit of code.  It also might complicate the issue of having multiple programs trying to have sole control of the same system wide counters, but I think that small changes to the current locking system should be able to handle this.  You'd also want to smoothly handle the case where you still want to run an instrumented program under likwid-perfctr.  These don't seem any worse than the current situation where two competing likwid-perfctr instantiations are trying to run.   Are there other downsides? 

The issue of locking is not really addressed at the moment. There will be a simple file locking approach with locking on core granularity  which can be integrates in any other (external) locking scheme (as a cluster batch system) 

For the library approach it would be possible to specify the group you want to measure. You can pass this as an argument or environment variable at runtime. I have to think about a library API and will propose a suggestion here. Still this will not happen for the next release.

 

Overall, I think this would be a good direction to go.   likwid-perfctr is already trying to do a lot things, and would be substantially simplified if it was allowed to deal only with 'whole program' issues.  It may even be possible to make likwid-perfctr just another client of libperfctr --- one that happens to measure other programs it spawns --- and have it use the same interface.  Tasty dogfood!   The markers could would be more straightforward too, since everything would be happening in a single address space.   



Do you think of a central instance of libperfctr (as a daemon running all the time?). This is of course another level :-). Well I have to think about it.

Thanks a lot for sharing your thoughts and suggestions.

Jan

 

na...@verse.com

unread,
Nov 8, 2013, 8:22:11 AM11/8/13
to likwid-d...@googlegroups.com
I've continued hacking things up to meet my particular needs, and have thought a little about how to generalize this.  I've also explored the Likwid alternatives a bit more to understand how others have approached things.   I realize you've been thinking about these issues a lot longer than I have, but outside perspective might be interesting. 

My thought is that Likwid's primary strength is that it provides an easy and consistent means of setting and accessing performance counters across processor families regardless of protocol (MSR vs PCI).  I haven't found anything else that does this well.   The ability to translate from a group name to a configurable set of counters is great.   I also really like that you are striving to support the full capabilities of each processor, rather reducing to a least-common-denominator supported by all of them. 

 I think it would be useful to try to compartmentalize (and simplify) these features by separating them more from the presentation, analysis, and benchmarking portions of Likwid.  I think the best way of doing this would be to make likwid-perfctr (and the others) clients of liblikwid of the same standing as other users of the library.  Obviously the library would have to be customized for them to continue working as they are, but if it it can handle all of those elegantly it should be a decent public API.

This would also mean that both likwid-perfctr and an instrumented application would be able to use the same data structures and configuration info.  So instead of passing the application a bit map of the counters in use, the original eventString could be set as an environment variable, and the application could 'decode' it in the same manner that likwid-perfctr did:  looking up the CPU information and mapping the counter names to the hardware indexes.   

I think the weakest part of Likwid right now (from a design standpoint) is the intermediate file used by the marker API and the analysis and output code that works with it.  On the bright side, this seems easy to fix:  instead of having the application use the library to collect information and pass it back to likwid-perfctr, have the library expose functions to help the application to access the saved information, and let the application decide whether and how it wants to print or analyze it.  

In the same vein, the way you process custom formulas in the group files is heroic!  At first I couldn't figure out how it was working, and then when I looked I was amazed.   On the other hand, when moving to a library it would be nice to be able to modify the group files without recompiling the library.   The easy path might be to have liblikwid's responsibility to end by writing out the counter names and their values as CSV to a file (one row per thread) and then use Perl/Python/Ruby in a separate application to prettify this and do the dynamic calculations.    


* likwid-perfctr will pass a bitmap to the instrumented application indicating which counters are in use. This enables to only read out the counters which are required by the current group.


As mentioned above, I'd strongly suggest giving the client access to the names of the counters as well.   

 
* There are counter maps for each architecture now which simplify the marker library significantly

Yes, this seems like a good move.

* I switched to glib for standard library stuff. In the next release I will use the hash implementation of glib.

An interesting alternative to consider might be embedding Lua to use its hash map instead of Glib : http://playcontrol.net/opensource/LuaHashMap/doxygen/index.html 
Then you could also use it for runtime parsing of group files and the like. 
 
* There are fences added which ensure that the marker API functions just return if the application is run without likwid-perfctr wrapper. This allows to run the instrumented binary with little added overhead, also on machines without likwid setup.

Perhaps this is what you've done, but I think it would be best to have this check happen as a macro expansion rather than within the call:  Likwid_Start() -> "do { if (Likwid) likwid_start() } while (0)" (or something).   This would be in addition to having the macro compile to nothing unless LIKWID was defined.   It's possible that function call overhead is low enough not to be an issue, but short of run time modified code a single correctly predicted branch is about the best you can do.  
 
For the library approach it would be possible to specify the group you want to measure. You can pass this as an argument or environment variable at runtime. I have to think about a library API and will propose a suggestion here. 

I think splitting out the eventString parsing from the rest of the initialization might be a good start for this.  The client would grab it from an environment variable where it had been put either by likwid-perfctr or by the user (client doesn't know which).  The client would try to initialize the counters, leaving them set as they were if it discovers they are already active.  

I think it would also be nice to get all the Likwid run time globals under a single 'conf' variable.   While the static local variables per file works just fine, it's sometimes been difficult trying to figure out what information is kept where, and how to make it accessible to a client program.  I've been trying some stuff on paper, and have gotten far enough to think that it should be possible.  But I'm also still finding that there are whole areas that I know nothing about, so perhaps I'm overly optimistic.


Do you think of a central instance of libperfctr (as a daemon running all the time?). This is of course another level :-). Well I have to think about it.

Rather than a daemon, I've been thinking about a more compartmentalized approach where the counter reading is completely independent of the counter setting.   The libperfctr client wouldn't care whether it sets the counters itself, or if they were set by Likwid, PMU Tools, perf, or VTune.  It would just read counters, and write the results in some standard format.  This output could be piped directly to an analysis program, or saved to a file.   

The hard part would be coming up with an API that is both simple and flexible enough.  I particularly like the goal of trying to make likwid-perfctr be just a regular client of libperfctr.  It's not that this is necessary in itself, but it would mean that you have a pretty general interface, that could then be wrapped to let you make the user interface portions with more flexible scripting language.  

This seems to be the direction that Andi Kleen is taking with his PMU Tools: https://github.com/andikleen/pmu-tools

What he lacks is a good integrated approach to counter interfaces and multiple CPU's.  

--nate

moebiusband

unread,
Nov 12, 2013, 7:46:59 AM11/12/13
to likwid-d...@googlegroups.com
Hi Nate,

thank you a lot for your thoughts.

Just a few remarks:



On Friday, November 8, 2013 2:22:11 PM UTC+1, na...@verse.com wrote:

 I think it would be useful to try to compartmentalize (and simplify) these features by separating them more from the presentation, analysis, and benchmarking portions of Likwid.  I think the best way of doing this would be to make likwid-perfctr (and the others) clients of liblikwid of the same standing as other users of the library.  Obviously the library would have to be customized for them to continue working as they are, but if it it can handle all of those elegantly it should be a decent public API.

Making LIKWID a library is on the road map. I like your ideas and will think about how this can be realized.
 

This would also mean that both likwid-perfctr and an instrumented application would be able to use the same data structures and configuration info.  So instead of passing the application a bit map of the counters in use, the original eventString could be set as an environment variable, and the application could 'decode' it in the same manner that likwid-perfctr did:  looking up the CPU information and mapping the counter names to the hardware indexes.   

I think the weakest part of Likwid right now (from a design standpoint) is the intermediate file used by the marker API and the analysis and output code that works with it.  On the bright side, this seems easy to fix:  instead of having the application use the library to collect information and pass it back to likwid-perfctr, have the library expose functions to help the application to access the saved information, and let the application decide whether and how it wants to print or analyze it.  

I agree with you that the current solution is not very good. At the beginning I wanted a as lightweight (and dumb) solution as possible. But as new architectures were added and more functionality makes things more complicated the current solution is not good enough anymore. I wanted it as simple as possible but as you said it would be easier to give the marker API the same view on the things as the wrapper application. I have no good reason speaking against it. At some point you have to realize that a solution has come to its limits.

 

In the same vein, the way you process custom formulas in the group files is heroic!  At first I couldn't figure out how it was working, and then when I looked I was amazed.   On the other hand, when moving to a library it would be nice to be able to modify the group files without recompiling the library.   The easy path might be to have liblikwid's responsibility to end by writing out the counter names and their values as CSV to a file (one row per thread) and then use Perl/Python/Ruby in a separate application to prettify this and do the dynamic calculations.    

That is a very good suggestion. I fully agree that all the string handling and IO stuff is much better done using a whatever skript language. You may have seen that I am a great admirer of perl. 
 


* I switched to glib for standard library stuff. In the next release I will use the hash implementation of glib.

An interesting alternative to consider might be embedding Lua to use its hash map instead of Glib : http://playcontrol.net/opensource/LuaHashMap/doxygen/index.html 
Then you could also use it for runtime parsing of group files and the like. 

I heard of lua but do not know much about it. I will have a look. It really was a fight with myself to decide on that. But nothing is lost since I use glib only in very few places up to now. I for sure would prefer not to have external dependencies. Of course the solution must perform reasonably well to not add any overhead.
A benefit of using glib is that it is a stable full featured standard library also adding other things apart from hash data structures. And it is hard to find that as a pure C library.
 
 
* There are fences added which ensure that the marker API functions just return if the application is run without likwid-perfctr wrapper. This allows to run the instrumented binary with little added overhead, also on machines without likwid setup.

Perhaps this is what you've done, but I think it would be best to have this check happen as a macro expansion rather than within the call:  Likwid_Start() -> "do { if (Likwid) likwid_start() } while (0)" (or something).   This would be in addition to having the macro compile to nothing unless LIKWID was defined.   It's possible that function call overhead is low enough not to be an issue, but short of run time modified code a single correctly predicted branch is about the best you can do.  
I will consider that.
 

The hard part would be coming up with an API that is both simple and flexible enough.  I particularly like the goal of trying to make likwid-perfctr be just a regular client of libperfctr.  It's not that this is necessary in itself, but it would mean that you have a pretty general interface, that could then be wrapped to let you make the user interface portions with more flexible scripting language.  

I also like this idea. You have seen that I already tried to lay out functionality for the output filters. Still the goal of likwid to be lightweight, low overhead and simple to use should not be affected by that.
 

Btw I happen to be at the Supercomputing Conference in Denver next week. Maybe you are by accident also there and we could meet?  Just sent me a note to my gmail.

I try hard to get the current release out. Adds support for IvyBridge-EP (including Uncore). After that I will start to think about your suggestions. This gave me new ideas I probably would not found myself. I will try to define a very lean core library and separate all other things from that. So thank you again!

Best Regards,

Jan
Reply all
Reply to author
Forward
0 new messages