CPU usage tracking

92 views
Skip to first unread message

anodos

unread,
Feb 4, 2014, 2:37:01 PM2/4/14
to stream...@googlegroups.com
We are using Streamline quite extensively in our software, on both the client and server (nodejs) sides. Our server involves a cluster of worker processes with each worker capable of handling multiple requests at once. We would like to track server side CPU usage per client request. Almost every client request ends up calling into a streamlined function on the nodejs side. With many asynchronous client requests coming in it is quite normal for several to be processed simultaneously in one "worker" process, with "task switching" happening during periods of I/O and Streamline trampolining.
We are using an older version of Streamline because it works fabulously and upgrading to the newest would require work and testing, so this functionality may already be implemented and we just aren't aware of it, but Streamline may provide us with a convenient way to track CPU time. Streamline is aware of each "thread" of execution with the way it manages callbacks and trampolining and even the "fall through" that happens during I/O (the whole stack unwinds back to the event loop). This makes possible things like the Context API (TLS) that I just saw announced. If TLS is possible, then it also seems possible that Streamline could (with an option) use the high resolution timers to track the time spent actually executing code on a per request (aka, per "context" or per "thread") basis. I'm thinking in terms of our version of Streamline here, which is pre-fiber, but Streamline could start a timer for a particular "thread" whenever it is about to call into a function or callback for that "thread" and stop it whenever the function/callback returns or falls through. The total time would accumulate for each request until it is completely processed and give us a rough estimate of CPU time used for that request (we don't need it to be exact). This would be used to monitor all sorts of things, like performance hot spots, system usage per client, etc.
I'm mentioning it here mostly to make sure this hasn't already been done and to also throw it out for a sanity check.

Bruno Jouhier

unread,
Feb 4, 2014, 6:27:11 PM2/4/14
to stream...@googlegroups.com
Hello Anodos,

We are doing something similar in our server. We are monitoring two things:

* elapsed time in various code sections: at the top level of our HTTP dispatcher but also in some lower level functions that we want to monitor.
* statistics on CPU slices. The idea here is to be able to spot functions which perform CPU intensive operations and may block the event loop.

For the elapsed time, we have a very small library which uses the high res timers. We use it as follows:
  function foo(_) { // the function we want to monitor
    var timing = perfmon && perfmon.start("FOO"); // creates a small object with a FOO tag and the current hr time
    // real code here (with _ calls).
   timing &&  timing.end(); // increments the hit count for FOO and records the time
}

If perfmon is null, nothing is recorded and we have very little overhead.

For the CPU slices we use an undocumented API of the streamline library: if you allocate an EventEmitter and you assign it to require('streamline/lib/globals').emitter, then you will get a "resume" event at the beginning of every CPU slice and a "yield" event at the end of every slice through this emitter. Thanks to this we can build a histogram of the time taken by our CPU slices. Note: there is very little overhead if you do not set the emitter, just a test on the emitter being not null.

And we correlate the two when we detect big CPU slices. So we can build the list of the perfmon tags that are responsible for the longest CPU slices.

I did not publish this module because it is a bit coupled to the rest of our framework but there is nothing really fancy in it.

The only difficulty with this "soft" monitoring of CPU slices is that GC passes run synchronously so the long CPU slices that we observe are usually related to GC passes rather than to CPU intensive computations on our side (which is not a bad sign because it means our slices are short).

There are no events directly hooked up to context changes but this could also be done through the emitter. Something like "contextEnter" and "contextExit" events. Probably not too difficult to implement but I haven't done it. The main difficulty is that it needs to be implemented 5 times (5 runtimes, with the fast modes) and tested.

Bruno

anodos

unread,
Feb 5, 2014, 11:34:26 AM2/5/14
to stream...@googlegroups.com
The resume and yield sound almost perfect. Would I have access to the Context API from within my resume/yield listeners? I could then accumulate time per context. Taking a quick glance at the source it appears that I do (I just looked at the "callback" implementation since I am most familiar with it). 
I could create a timing API where something like your perfmon.start('FOO') would create a timing entry in the Context, and then the resume/yield listener would start accumulating time for FOO in this Context/thread. That way if Streamline is trampolining between various parallel Contexts then each will accumulate on their own times.

Bruno Jouhier

unread,
Feb 5, 2014, 3:06:15 PM2/5/14
to stream...@googlegroups.com
Yes, the context is set before emitting the resume event. Same thing in the other modes (fibers and generators). So it should work!

Bruno

Aseem Kishore

unread,
Feb 9, 2014, 4:19:58 PM2/9/14
to stream...@googlegroups.com
Very cool stuff! Thanks guys.


--
You received this message because you are subscribed to the Google Groups "streamline.js" group.
To unsubscribe from this group and stop receiving emails from it, send an email to streamlinejs...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

anodos

unread,
Mar 25, 2014, 1:57:31 PM3/25/14
to stream...@googlegroups.com
OK, So I've finally gotten around to working on this and I've noticed something interesting in the callbacks runtime. If I'm reading the code correctly it looks like the 'yield' event is fired after the trampoline is flushed, which means my times could be inflated by the trampolined items if I am timing based on resume and yield events.
Here is the code in question (in the '__cb' function):
} finally {
frame.active = false;
__g.frame = oldFrame;
if (--__g.depth === 0 && __g.trampoline) __g.trampoline.flush();
__g.depth === 0 && __g.emitter && __g.emitter.emit("yield");
}
Am I incorrect in thinking that the flushing of the trampoline is, in effect, yielding to other execution 'slices' (each of which will also fire their own resume and yield)? If so, it makes sense to me to fire the 'yield' before actually yielding so that I can accumulate accurate timing.

anodos

unread,
Mar 25, 2014, 2:03:25 PM3/25/14
to stream...@googlegroups.com
Looking at the __g.depth === 0 && __g.emitter... maybe resume and yield are meant to indicate when Streamline is resuming and yielding to the underlying event loop? Hmmm... I was hoping for a mechanism where each trampolined piece of code could accumulate time against its particular function invocation.

anodos

unread,
Mar 25, 2014, 2:36:36 PM3/25/14
to stream...@googlegroups.com
Just to make this clearer, in case I am confusing anyone... we have code that ends up "spawning" many Streamlined functions in parallel (under certain cases) and uses futures (most of the time) to manage them. This can even happen recursively. These things interleave with each other, trampolining with each other until they either finish or encounter the need for IO. Our trampoline can get quite full of various parallel execution paths, with every trampoline flush possibly switching between paths. This isn't a bad thing - it actually works quite well for our use case. We execute them in parallel because they do occasionally (and in unpredictable ways) need to perform IO, so the parallel execution helps makes the best use of CPU during those IO opportunities.
The only problem we have with this approach is that it becomes difficult to profile functions in our complex parallel activities. Every time a function returns (or a Streamline created sub function returns) there is the chance that we will hit the trampoline and execute code from another execution path. 
Say, for example, you have functions A and B interleaving with each other in this way... With conventional timing in the function itself A and B end up contributing their times to each other. Both A and B's times are just about the total of what A and B would have been if run alone (minus maybe a little overlap). So, for example, if A normally takes 1ms and B normally takes 0.5 ms, then their times can both end up being something around 1.5ms, give or take.
What we would like is the ability to start and stop high performance timers on every trampoline (and every __func) and accumulate the time against the function (or function "fragment") that just executed. I can do this using a combination of the context and __g.frame if I can get at it before and after every __func and every __cb calls into its target code.
This isn't necessarily a request for such functionality in Streamline, just a description of what I'm trying to accomplish. I recognize that adding it now has the potential to introduce a tiny performance penalty (it would be more frequent emitter checks with possibly a couple of new event types). And it may not be functionality that you would find useful in Streamline proper.
However, I may go ahead and fork streamline to play around with this idea and if it ends up being useful I'll at least run it by you to see what you think.

anodos

unread,
Mar 28, 2014, 11:53:40 AM3/28/14
to stream...@googlegroups.com
OK, guys, here is an update. I have this working great on our system. I have a couple of benchmarks in our system and the changes I've made to Streamline introduce no performance penalty - well, as long as you aren't actually profiling. Keep in mind that these are not mirco-benchmarks. I benchmark entire functional operations of our application, so they probably won't detect performance differences in Streamline unless those differences are a little more significant.
I haven't committed anything yet as I'm still tweaking it a bit. If you are interested I can push it out when finished and do a pull request so you can check it out.

Just FYI,
I wrote a quick and dirty profiler (kind of specific to our application) that can give me back the actual CPU time spent in each particular Streamline function executed, including "own" time and "all" time (which recursively includes time spent in any functions called), # of invocations, etc. These times will not contain time spent trampolining along different functions, waiting for IO, etc... in other words they are focused on returning only CPU time in each Streamline function as if it was the only thing running with no IO wait... so they are useful for seeing which Streamline functions needs to be optimized at a pure code level.
The profiler itself does introduce overhead once enabled. My benchmarks saw throughput drop in half, and it seems to be my call to hrtime() that is the culprit (not that it is a slow function, but that it gets called so much). This isn't a problem for me, since I only profile when trying to track down a hotspot, but is an interesting point.

Bruno Jouhier

unread,
Mar 29, 2014, 11:26:13 AM3/29/14
to stream...@googlegroups.com
Hi anodos,

This makes complete sense.

I think that we need to distinguish two monitoring use cases:
  • Monitoring functions: we need to collect several times: function's own code, own code + called functions code, own code + called functions code + I/O.
  • Monitoring CPU slices: what's important is to get the distribution of slice times and to know which functions are responsible for long slices.

The yield/resume events were designed to monitor CPU slices. The yield event was sent after processing the trampoline because the control will only return to node's event loop after the trampoline. So I would not change this but I would introduce a separate set of events (like enter/exit) to handle the timing of functions.

I'm very interested by your profiling extensions (and by your profiling tool if you can share it). So you can send a PR and I'll review it.

Q: did you try the other modes (fibers, generators)? If you didn't, I can help hooking the new events into the other runtimes.

I'm also planning to experiment with http://concurix.com. If people have experience with concurix+streamline we should share here (start another thread).

Bruno

anodos

unread,
Mar 31, 2014, 12:21:03 PM3/31/14
to stream...@googlegroups.com
The code is very simple and I was tempted to just add it to fibers and generators... but I was afraid to commit code that I would have a hard time testing (since our project doesn't use those modes).
For now I've only changed the callbacks code.
What would be the best way to get the code to you? I can do a pull request on my fork for the 'enter' and 'exit' events. I'm also willing to share my profiler code. Just let me know how you'd like me to get it to you.

I did make a couple of other changes that I'd like get your take on... My profiler adds "all" CPU time up the stack, so that if A calls B then A will include B's time in its "all" time... but I did not want truly asynchronous call 'branches" to contribute CPU time to the function hierarchy beyond the asynchronous point. For example:
function A(_)
{
  ... do stuff here, call other Streamlined functions, etc...
  var future = B(!_);
  .. do more stuff here, call some other Streamlined functions ..
  .. handle the future somewhere in here ..
  return ... ;
}

I didn't want B's CPU time increasing A's "all" time. Also, if B calls C (not using a future), then I wanted C's time to contribute to B's "all' time but not A's all time... so I needed a good way to detect asynchronous branch points.
I did that by modifying __cb to set a __streamlined flag on callbacks it creates. I also modified the future code to set a __futurecb flag on 'future' callbacks.
There are two places 'enter' is fired: during trampoline and also right before body() is called in __func. In the __func case I pass the callback that was passed to the Streamline function. The event handler uses this callback to detect when a new function is being invoked as well as checking the __streamlined and __futurecb flags to detect an asynchronous branch point (which can be created both by futures and also by passing in a custom non-streamlined callback). If my description is confusing, it might make more sense once you see the code.
Thanks for everything!

Bruno Jouhier

unread,
Apr 4, 2014, 6:39:14 PM4/4/14
to stream...@googlegroups.com
Hi Anodos,

The best way to integrate your changes it to send me a PR with the "enter/exit" events. I'll review and adapt the fibers/generators modes.

For the profiler, I would prefer that you create a separate project/repo. This is more decoupled and it keeps the issues/discussions separate. This is one of the reasons I separated the streams module.

I agree with the way you handle the time aggregates with futures (your A/B/C example). I get the basic gist behind the two flags that you introduced but it will be easier with the code. I don't see any danger as you are only adding flags to the callbacks that streamline creates.

I may not process the PR immediately because I'm really overbooked but this won't mean I'm forgetting it.

Bruno

anodos

unread,
Apr 10, 2014, 5:10:33 PM4/10/14
to stream...@googlegroups.com
I created a pull request but haven't had time to put the profiler into its own project. I'll try to make some time for that in the near future.

Bruno Jouhier

unread,
Apr 12, 2014, 10:19:19 AM4/12/14
to stream...@googlegroups.com
Hi Anodos,

Thanks for the PR. I just integrated it and I published 0.10.9.

Bruno
Reply all
Reply to author
Forward
0 new messages