perf claim of ~2 cycles of overhead in disabled state

14 views
Skip to first unread message

Vlad

unread,
Nov 20, 2009, 11:45:11 AM11/20/09
to rlog-users
Hi,

I am curious about how this claim was tested. By examining rlog macros
I see that when logging is disabled the overhead reduces to checking a
memory variable and possibly skipping a chunk of code.

I would imagine that such overhead would have at least these main cost
components:

1. loading the is-enabled variable (possible cache miss)
2. cost of possible branch misprediction

In my informal testing I found it easy to reproduce the 2 cycle number
if I measure a branch that is not taken in a loop. However, this
scenario is not representative of real life usage for a couple of
reasons:

- both the is-enabled variable and the code would stay in the L1i and
L1d caches while in real life scenario the rest of the app code is
likely to cause cache interference
- for a simple loop the branch prediction for a single branch location
will work almost perfectly. But in a realistic code base such if-
checks will be scattered around the code in thousands of places and
likely will not all fit in the branch prediction cache all the time.

I am not trying to be critical, I am trying to understand. What
aspects of rlog design make it more suitable for tight loops/low
overhead than other logging libs? I would imagine nearly every such C+
+ lib will have an is-enabled check (possibly aided by branch
prediction hints), so what's rlog's extra edge?...

Regards,
Vlad

Valient Gough

unread,
Nov 23, 2009, 1:30:03 AM11/23/09
to rlog-...@googlegroups.com

Hi,

The are a few tricks used to improve the performance of disabled logging statements:

- don't evaluate logging statements if not used
- provide hint to branch predictor
- cluster log data

Not evaluating logging statements is the big one - most libraries will first evaluate the logging statement before deciding if it should be thrown away. The idiom that you run into is to manually insert a check for logging state around your log statements. I'd rather have the library do that.

The branch predictor hint does two things: causes GCC to hoist the unlikely branch code out so to minimize impact on L1i cache, and cause the processor's branch predictor to guess that the branch won't be taken. That way the processor can continue to do real work while waiting for results of the test. Compare assembly output from an rlog statement vs a manual is_enabled check to see the difference.

The log data is clustered by putting data into a separate data section. This seems to be a fair win in test code, but I don't know if there would really be much impact in practice. The idea is that if you're going to load one log variable, you may as well cache the nearby variables since they are likely to be encountered next.

That said, the first time I heard about a company that had chosen to use rlog, they asked how to switch it so that the *enabled* state was the fastest, since they intended to leave logging on all the time.. So, certainly no one-size-fits-all.

regards,
Valient

Vlad Roubtsov

unread,
Nov 23, 2009, 9:55:09 PM11/23/09
to rlog-...@googlegroups.com

On Nov 23, 2009, at 12:30 AM, Valient Gough wrote:

> The are a few tricks used to improve the performance of disabled logging statements:
>
> - don't evaluate logging statements if not used
> - provide hint to branch predictor
> - cluster log data
>
> Not evaluating logging statements is the big one - most libraries will first evaluate the logging statement before deciding if it should be thrown away. The idiom that you run into is to manually insert a check for logging state around your log statements. I'd rather have the library do that.


Thanks for the details, Valient. I am familiar with the first two tricks (e.g. log4cxx uses both). So, aside from clustering, the scheme is something like

if (UNLIKELY(is_enabled))
{
... do something, mostly likely build an arg list and call some lib function ...
}

with a compiler-specific UNLIKELY() hint. I think that most logging libs that expose themselves via macros (to hide the outside check) will choose to follow this scheme...

Regards,
Vlad

Valient Gough

unread,
Nov 23, 2009, 10:37:32 PM11/23/09
to rlog-...@googlegroups.com

On Nov 23, 2009, at 6:55 PM, Vlad Roubtsov wrote:


Thanks for the details, Valient. I am familiar with the first two tricks (e.g. log4cxx uses both). So, aside from clustering, the scheme is something like

if (UNLIKELY(is_enabled))
{
... do something, mostly likely build an arg list and call some lib function ...
}

with a compiler-specific UNLIKELY() hint. I think that most logging libs that expose themselves via macros (to hide the outside check) will choose to follow this scheme...


The technology here is a means, not an ends.  The goal wasn't to make a faster logging library, the goal was to make a dynamic logging library.  But to make a convincing argument that logging can be left in production code, it is easier if the unused logging is as cheap as possible.

Have you used any Java loggers?  A nice features is that most will allow you to set logging level based on class hierarchy.  For example, I might turn on debug level for my own hierarchy - net.arg0.*, but only warning level for org.apache.*, etc.  That's done at runtime and lets you fine-tune what you want to log.

That's what I wanted in a C++ logger.  Not static compile-time selection, but the ability to change logging level if a user passed a "--debug" option on the command line, or potentially if they connected a remote debugger.   I want to be able to dynamically enable just logging statements from one file, without enabling all debug.  One place that was using rlog had a remote logger for their server.  They could connect to it remotely, change logging (on a file by file basis - say turn debug on just for file Foo.cpp), without having to restart the service.

That sort of flexibility changes how you use logging, and the optimizations are just how you get people to try it.  That said, most work I do these days is in Java or C#, so I haven't followed what's happened with C++ loggers over the last few years..

regards,
Valient

Vlad Roubtsov

unread,
Nov 23, 2009, 11:27:59 PM11/23/09
to rlog-...@googlegroups.com

On Nov 23, 2009, at 9:37 PM, Valient Gough wrote:

> The technology here is a means, not an ends. The goal wasn't to make a faster logging library, the goal was to make a dynamic logging library. But to make a convincing argument that logging can be left in production code, it is easier if the unused logging is as cheap as possible.
>
> Have you used any Java loggers? A nice features is that most will allow you to set logging level based on class hierarchy. For example, I might turn on debug level for my own hierarchy - net.arg0.*, but only warning level for org.apache.*, etc. That's done at runtime and lets you fine-tune what you want to log.
>
> That's what I wanted in a C++ logger. Not static compile-time selection, but the ability to change logging level if a user passed a "--debug" option on the command line, or potentially if they connected a remote debugger. I want to be able to dynamically enable just logging statements from one file, without enabling all debug. One place that was using rlog had a remote logger for their server. They could connect to it remotely, change logging (on a file by file basis - say turn debug on just for file Foo.cpp), without having to restart the service.
>
> That sort of flexibility changes how you use logging, and the optimizations are just how you get people to try it. That said, most work I do these days is in Java or C#, so I haven't followed what's happened with C++ loggers over the last few years..

Understood. I have been a C++ programmer, then a Java programmer for years, and am now back in the C++ world. I am well aware of the log4j paradigms -- the hierarchical organization of loggers (that maps to the class hierarchy in common practice) is very convenient. I had used Apache's log4cxx port of log4j for a couple of years precisely because they'd ported that aspect of log4j well.

In my spare time I am tinkering with ideas for a logger that would also guarantee *latency* bounds even in the *enabled* state. More like an instrumentation tracer than a human-readable output logger. This explains my perf bias in the very first question.

Things of this kind are commonly attempted by offloading the actual message persistence to a separate thread. However, it has also been observed that such designs still result in high-ish overheads, due to extra thread context switching. But why does a "thread context switch" has a cost, really? Besides the traditional overheads of saving/restoring registers and scheduling maintenance, it is quite possible that these days the *larger* contribution comes from cache interference: thread A is doing something in one memory area, then gets preempted by thread B that will access *different* memory addresses and consequently will pollute the data cache with its own stuff. Then a switch back to thread A which might end up re-loading some of the cache lines it would have kept if not for B. Check out [1].

Now comes the interesting realization: if cache interference can dominate (100s of cycles) then it does not really matter whether there is an actual thread switch. What matters is the switch in memory access pattern. So, in rlog's case, for example, we can imagine this: the CPU is jamming along, doing what it's meant to be doing, then boom: it is asked to load one of the is-enabled memory vars for a conditional branch. If the var is far enough away from the working data set (like in your special cluster segment) this is likely to cause a cache miss. Then the CPU goes back to the "normal" app logic. You see what I mean? The real cost of those branches may be in those sporadic cache interferences. The steady stream of rlog is-enabled checks will maintain a those vars in the cache and will make the effective size of the cache available to the "normal" logic somewhat smaller.

This was on my mind when I posted the first time. The true cost of logging a la rlog/log4cxx style may only manifest itself when embedded in a sea of real code. And it's kind of spread around so it's hard to measure. It kind of seemed to me that the real test would be to take a realistic app, add rlog log macros, and then measure the total cycle counts for a piece of app work with those macros as they are and with them defined to noops. I wouldn't be surprised if the average of the extra cycles per log statement were a bit more than 2...

Cheers,
Vlad

[1] http://www.cs.rochester.edu/u/cli/research/switch.pdf
Reply all
Reply to author
Forward
0 new messages