Worthwhile getting DCHECK instrumentation from the field?

15 views
Skip to first unread message

Randy Smith

unread,
Sep 13, 2011, 11:09:50 AM9/13/11
to Chromium-dev, Andy Hendrickson
Andy Hendrickson came up with an idea that seemed plausible and
tempting to me, so I wanted to air it more widely. Would it be
worthwhile to have DCHECKs compiled to some form of UMA statistics in
release builds? On the plus side we'd get notification of violation
of a lot more things we thought were invariants (which might help
indirectly with stability). On the negative side, we'd need some
different type of data gathering--what you want is a
(filename,lineno)->count hash that gets uploaded on each data send,
and that doesn't fit well into the UMA stats gathering I'm currently
familiar with. And we might be drowned in data :-} :-{. But it's
hard for me to think that more data would be a bad thing. Opinions?

-- Randy

Thomas Van Lenten

unread,
Sep 13, 2011, 11:13:08 AM9/13/11
to rds...@google.com, Chromium-dev, Andy Hendrickson
I think you might have to audit all the DCHECK() calls, I think I've seen some that call internal validation routines, and we don't want those slowing down the app on end users.  The rule has always been that code copiles away in a build we ship to uses, so changing something that fundamental probably means a revisit of all the uses.

TVL
 

-- Randy

--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
   http://groups.google.com/a/chromium.org/group/chromium-dev

Dominic Mazzoni

unread,
Sep 13, 2011, 12:04:31 PM9/13/11
to thom...@chromium.org, rds...@google.com, Chromium-dev, Andy Hendrickson
On Tue, Sep 13, 2011 at 8:13 AM, Thomas Van Lenten <thom...@chromium.org> wrote:
I think you might have to audit all the DCHECK() calls, I think I've seen some that call internal validation routines, and we don't want those slowing down the app on end users.  The rule has always been that code copiles away in a build we ship to uses, so changing something that fundamental probably means a revisit of all the uses.

Perhaps NOTREACHED() would be a good candidate to try - it doesn't normally have any side effects. I'm sure it wouldn't catch as much, but if we learned anything useful at all it might suggest creating a new variant of DCHECK that logs errors in release builds.

- Dominic

James Hawkins

unread,
Sep 13, 2011, 2:02:29 PM9/13/11
to rds...@google.com, Chromium-dev, Andy Hendrickson
Bucketing on (filename, lineno) might be useful as a signal for how trustworthy certain assertions in the code are; however, not having the stack trace to see how the execution arrive at the DCHECK means the results might not be very actionable.

James

Randy Smith

unread,
Sep 13, 2011, 2:06:31 PM9/13/11
to James Hawkins, Chromium-dev, Andy Hendrickson

My take on that is that it would be actionable in some cases and not
in others, and if you wanted more data you could change the
DCHECK->CHECK in a canary. Conceivably, we could also send back the
stack, but that starts feeling like a lot more information than we
want to cram into UMA. (I'm also not sure when privacy concerns start
to creep in here.)

-- Randy

> James
>

Mark Mentovai

unread,
Sep 13, 2011, 2:12:07 PM9/13/11
to thom...@chromium.org, rds...@google.com, Chromium-dev, Andy Hendrickson
Thomas Van Lenten wrote:
> I think you might have to audit all the DCHECK() calls, I think I've seen
> some that call internal validation routines, and we don't want those slowing
> down the app on end users.  The rule has always been that code copiles away
> in a build we ship to uses, so changing something that fundamental probably
> means a revisit of all the uses.

Worse, we actually have some uses of DCHECK that aren’t really
checking anything, but are just used as a convenient way to enable
other forms of checking that shouldn’t be happening at all in official
released builds.

One recent example is the Mac Objective-C zombie catcher. Its exact
details are unimportant, but it’s new code that we don’t necessarily
trust yet in all cases, so we turn it on like this:

bool ZombieEnable() {
// Set up a bunch of infrastructure that will LOG(FATAL) or cause CHECK
// failures in the presence of the undead.

return true; // Unconditionally
}

int main(int argc, char* argv[]) {
DCHECK(ZombieEnable());
}

This relies on the DCHECK’s condition not executing in builds where
it’s not appropriate to test DCHECKs.

The above example may not be the best in terms of clear coding style,
and the DCHECK above actually does come with a comment in the codebase
that explains what it’s doing, but the fact remains that we do have
uses of DCHECK expressions that mask expensive logic or checks that
are otherwise unwanted in official release builds.

Let’s not forget the added bloat impact.

If we have something that we want to CHECK in release builds, it
should be a CHECK, not a DCHECK.

Scott Hess

unread,
Sep 13, 2011, 2:27:59 PM9/13/11
to rds...@google.com, Chromium-dev, Andy Hendrickson
I've thought on this before, and I think there's a real problem in
generating useful data. Given how the stats come up, you'd have to
have some sort of mapping to bins, and I think you'd probably end up
with something like a bloom filter, where information is all smeared
together and you end up with little to no real results. People use
DCHECK() a LOT!

I think a more productive option might be to build a new something,
call it an UMA_CHECK(), which would histogram on failure. You could
probably narrow down to specific items by having multiple histograms,
perhaps one on hash(#condition), one on hash(__FUNCTION__), and one on
hash(__FILE__). [I'm assuming __LINE__ is too variable to be very
helpful.] It could devolve to DCHECK() in debug builds, though I can
see arguments either way.

A core problem with anything of this sort is that if you wake up one
day thinking "I wonder if there's a problem", and you look at the
histogram, you WILL find a problem. You just won't know when the
problem landed.

-scott


On Tue, Sep 13, 2011 at 8:09 AM, Randy Smith <rds...@chromium.org> wrote:

Stuart Morgan

unread,
Sep 14, 2011, 8:17:18 AM9/14/11
to sh...@google.com, rds...@google.com, Chromium-dev, Andy Hendrickson
Le 13 septembre 2011 20:27, Scott Hess <sh...@chromium.org> a écrit :
> I think a more productive option might be to build a new something,
> call it an UMA_CHECK(), which would histogram on failure.

What does "histogram on failure" mean exactly? If you don't track
successes, then you don't get much information out of tracking
failures, because you have no scale. It sounds like this would just a
boolean histogram with a unique name, say
UnexpectedCondition.<DescriptiveNameHere>, and then you'd have a
percentage of how often it's happening. No creation of new macros
necessary--or is the concern that we'd want to do too many of these to
make it convenient to give each a unique name?

-Stuart

Randy Smith

unread,
Sep 15, 2011, 11:27:40 AM9/15/11
to Stuart Morgan, sh...@google.com, Chromium-dev, Andy Hendrickson
I'm disinclined to track both successes and failures.  I agree that a percentage failure rate would be a more useful metric than a count, but we are tracking invariants here, so the primary signal is "has this ever happened?" and someone who knows the code is in a better position to decide on a secondary signal after getting the primary one.  There may be invariants where *any* failure is worth committing a CHECK to canary and seeing where it comes from, and failures where more fine-grained statistics are useful.  And if we don't generate a histogram except when it fails, picking out the failures to look at will be much easier on this end.

If no-one objects, I'l give this a try by creating the UMA_CHECK() macro, converting (almost) all of the DCHECKs in the download system to use it, and report back to this forum about the experiment.  Opinions?

-- Randy

Nico Weber

unread,
Sep 15, 2011, 12:59:13 PM9/15/11
to rds...@google.com, Stuart Morgan, sh...@google.com, Chromium-dev, Andy Hendrickson

Do you have a concrete problem you're trying to solve, or is this
mostly out of curiosity? (Our Dbg test bots see DCHECKs – does the
download system not have enough unit test coverage?)

Nico

Denis Lagno

unread,
Sep 15, 2011, 1:23:22 PM9/15/11
to tha...@chromium.org, rds...@google.com, Stuart Morgan, sh...@google.com, Chromium-dev, Andy Hendrickson
On Thu, Sep 15, 2011 at 8:59 PM, Nico Weber <tha...@chromium.org> wrote:
> (Our Dbg test bots see DCHECKs – does the
> download system not have enough unit test coverage?)

Just example: currently we do not support non-UTF8 cookies (crbug
77771). In the field it easily triggers DCHECK violation in
base/values.
IMHO collecting such statistics on failing DCHECKs would highlight
those issues and give additional incentive to fix them.

Peter Kasting

unread,
Sep 15, 2011, 1:26:41 PM9/15/11
to dil...@chromium.org, tha...@chromium.org, rds...@google.com, Stuart Morgan, sh...@google.com, Chromium-dev, Andy Hendrickson
If you know you are hitting code that DCHECKs, you need to do something to address the problem immediately, even if it's not the correct/ultimate fix.  We should never be hitting CHECKs or DCHECKs at all.

PK

Randy Smith

unread,
Sep 15, 2011, 1:31:37 PM9/15/11
to Nico Weber, Stuart Morgan, sh...@google.com, Chromium-dev, Andy Hendrickson

No concrete problem I'm trying to solve, just looking for new ways to
get more of a handle on stability related issues. The downloads
system is indeed weak on test coverage--one of our action items--but
while it's weak, I don't think it's completely out of striking range
of the rest of the system (i.e. I don't actually believe that Chrome
as a whole has really good unit test coverage). My goals here is on
improving overall Chrome stability--I was just offering downloads as a
possible test case.

-- Randy

>
> Nico
>

Stuart Morgan

unread,
Sep 15, 2011, 1:42:23 PM9/15/11
to Denis Lagno, tha...@chromium.org, rds...@google.com, sh...@google.com, Chromium-dev, Andy Hendrickson
Le 15 septembre 2011 19:23, Denis Lagno <dil...@chromium.org> a écrit :
> Just example: currently we do not support non-UTF8 cookies (crbug
> 77771).  In the field it easily triggers DCHECK violation in
> base/values.
> IMHO collecting such statistics on failing DCHECKs would highlight
> those issues and give additional incentive to fix them.

It sounds like you already know the answer to the boolean query "is
the DCHECK being hit", which is all Randy's proposal would really tell
you. If you want to know how common something is, you should actually
add a real metric for it.

(And, as Peter said, fix the DCHECK.)

-Stuart

Alexei Svitkine

unread,
Sep 15, 2011, 2:17:49 PM9/15/11
to pkas...@google.com, dil...@chromium.org, tha...@chromium.org, rds...@google.com, Stuart Morgan, sh...@google.com, Chromium-dev, Andy Hendrickson
> If you know you are hitting code that DCHECKs, you need to do something to
> address the problem immediately, even if it's not the correct/ultimate fix.
>  We should never be hitting CHECKs or DCHECKs at all.

Is there something more I can do beyond filing a bug for DCHECK()'s
that I may encounter, to ensure they get "addressed immediately"?
Should I be setting priority or adding some tags to them?

-Alexei

Peter Kasting

unread,
Sep 15, 2011, 2:26:48 PM9/15/11
to Alexei Svitkine, dil...@chromium.org, tha...@chromium.org, rds...@google.com, Stuart Morgan, sh...@google.com, Chromium-dev, Andy Hendrickson
We don't have any tags for these.  Normally DCHECK failures are real bugs which may be hard to track down in the field, so hopefully anyone assigned one takes it seriously.

PK
Reply all
Reply to author
Forward
0 new messages