Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Fatal Assertions, Non-Fatal Assertions and Warnings, Oh My!

91 views

Skip to first unread message

Bob Clary

unread,

Jun 14, 2011, 1:57:26 AM6/14/11

This is a modified re-post of my comment to Nicholas' blog
<http://blog.mozilla.com/nnethercote/2011/06/07/what-is-the-point-of-non-fatal-assertions/>.

Much of the discussion in Nicolas' "Gecko assertions, errors, warnings,
et al" post [1] to dev.platform has centered around the mechanics of the
various means of reporting fatal and non-fatal messages. I am not in a
position to make specific recommendations about the code but given my
experience in crash testing I would like to recommend a general approach
to how we treat them.

First, let me say that I love fatal assertions as they make it clear
when I’ve found an exceptional condition which should have a bug filed
in order for it to be fixed.

Second, let me say that I hate fatal assertions as they stop the browser
and make it impossible to find or diagnose other bugs which are hidden
by the fatal assertion.

During crash testing, there have been cases where common, frequently
occurring fatal assertions have gone unfixed for significant periods of
time. These unfixed fatal assertions have the effect of hiding other
bugs from detection which may be far more serious than the condition
flagged by the unfixed fatal assertion. It has also been the case where
bugs have been filed for non-fatal assertions where it has been
determined that the condition is not a bug after all.

I would like to propose that we adopt the following procedure for
dealing with fatal assertions, non-fatal assertions and warnings.

Fatal assertions should only be used in cases where the condition
reflects an unacceptable state which must be fixed immediately. In other
words, *fatal assertions are automatically blockers*. If a fatal
assertion is not important enough to be fixed immediately, it should be
immediately converted to a non-fatal assertion, a warning or removed
altogether.

Non-fatal assertions should be only used for conditions which reflect
the existence of bugs which should be fixed but are not blockers. If a
non-fatal assertion is not considered to be a bug, then it should
immediately either be modified to fire in a more restricted case that is
considered to be a bug, converted to a warning or removed altogether.
Non-fatal assertions should not fire during unit testing when first
introduced.

Warnings should not be used to indicate the presence of a bug. If a
warning is found to be a bug, then it should be modified to be either a
non-fatal or fatal assertion.

How should this be applied to the Aurora and Beta branches?

Currently, when we fix a fatal assertion on trunk we may decide not to
fix it on the branches. For fatal assertions which are frequently
reproduced during crash testing, this has the affect of reducing the
test coverage on the branches which will be shipping soonest. I propose
that if it is decided that a fatal assertion is not important enough or
too risky to fix on the Aurora or Beta branches, that it be demoted to a
non-fatal assertion. This will help the crash and top site testing to
find other bugs on the branches which may be hidden by the fatal assertion.

I don't believe that there would be any real benefit in demoting
non-fatal assertions or warnings on the branches though if non-fatal
assertions are promoted to fatal assertions or warnings are promoted to
assertions on the trunk it may be useful to do the same on the branches.

I believe this procedure would result in fatal assertions either being
fixed or demoted quickly thereby improving test coverage during crash
and top site testing. The fixing of bugs related to non-fatal
assertions, the demotion of non-fatal assertions which are not bugs with
the addition of restricting the introduction of new non-fatal assertions
which fire during normal unit test runs will help us put the number of
failing non-fatal assertions on the decline.

[1]
http://groups.google.com/group/mozilla.dev.platform/browse_thread/thread/556cb7debb174cca

Ehsan Akhgari

unread,

Jun 14, 2011, 10:44:29 AM6/14/11

to Bob Clary, dev-pl...@lists.mozilla.org

What is the scope of this proposal? I'm assuming the Javascript engine,
because I think the premise of the severity of a runtime check to have
any significance on the priority of fixing the bugs filed about hitting
that condition is false for Gecko.

Cheers,
Ehsan

> _______________________________________________
> dev-platform mailing list
> dev-pl...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform

Bob Clary

unread,

Jun 14, 2011, 10:52:40 AM6/14/11

On 6/14/11 7:44 AM, Ehsan Akhgari wrote:
> What is the scope of this proposal? I'm assuming the Javascript engine,
> because I think the premise of the severity of a runtime check to have
> any significance on the priority of fixing the bugs filed about hitting
> that condition is false for Gecko.

The proposal is not just for the JavaScript engine, but anywhere that
uses a fatal assertion.

I disagree that the severity of the runtime check should have no
significance on the priority of fixing it.

A fatal assertion stops all further testing of the offending content
until it is either fixed or demoted. If we don't think the fatal
assertion is important enough to be immediately fixed, it is not
important enough to block further testing and should be demoted to a
non-fatal assertion.

Ehsan Akhgari

unread,

Jun 14, 2011, 11:04:55 AM6/14/11

to Bob Clary, dev-pl...@lists.mozilla.org

OK, so let me reply to the original post now.

On 11-06-14 1:57 AM, Bob Clary wrote:

> Fatal assertions should only be used in cases where the condition
> reflects an unacceptable state which must be fixed immediately. In other
> words, *fatal assertions are automatically blockers*. If a fatal
> assertion is not important enough to be fixed immediately, it should be
> immediately converted to a non-fatal assertion, a warning or removed
> altogether.

You're assuming that we use fatal assertions for things which have a
high priority of fixing if they break. To my experience, this is not
how we use them in Gecko. We use them to either indicate incorrect
program state which can cause the program execution to have undefined
results if we let it continue. For example, if you're writing a set
data structure, and at some point you detect a duplicate value in the
map, it may be a good idea to abort, since this will break the
assumptions of the consumers, and cause them to behave in unexpected
ways. The other reason why we use fatal assertions (we call them
runtime aborts in this case) is to get crashstats data on a pathological
situation from users when a condition causes the program to abort.

> Non-fatal assertions should be only used for conditions which reflect
> the existence of bugs which should be fixed but are not blockers. If a
> non-fatal assertion is not considered to be a bug, then it should
> immediately either be modified to fire in a more restricted case that is
> considered to be a bug, converted to a warning or removed altogether.
> Non-fatal assertions should not fire during unit testing when first
> introduced.

Non-fatal assertions (NS_ASSERTIONs) indicate assumptions which don't
hold, but they may not be severe enough to stop the program execution.
For example, laying out a table one pixel wider than it should be could
trigger an assertion. Whether we should consider it a blocker to fix
depends on where this failure is observed, and what it leads to. For
example, if this bug causes the homepage of the Google website to be
laid out incorrectly, I'm sure we'll be treating it very seriously (but
not because the original assertion has been too serious).

> Warnings should not be used to indicate the presence of a bug. If a
> warning is found to be a bug, then it should be modified to be either a
> non-fatal or fatal assertion.

I agree, warnings should indicate things which could possibly lead into
other things going wrong, and I do believe that this is how we generally
use them in Gecko.

> How should this be applied to the Aurora and Beta branches?

Given the above, I don't think that the rest of your post applies to
Aurora and Beta.

Cheers,
Ehsan

Bob Clary

unread,

Jun 14, 2011, 12:09:59 PM6/14/11

On 6/14/11 8:04 AM, Ehsan Akhgari wrote:
> OK, so let me reply to the original post now.
>
> On 11-06-14 1:57 AM, Bob Clary wrote:
>> Fatal assertions should only be used in cases where the condition
>> reflects an unacceptable state which must be fixed immediately. In other
>> words, *fatal assertions are automatically blockers*. If a fatal
>> assertion is not important enough to be fixed immediately, it should be
>> immediately converted to a non-fatal assertion, a warning or removed
>> altogether.
>
> You're assuming that we use fatal assertions for things which have a
> high priority of fixing if they break. To my experience, this is not how
> we use them in Gecko. We use them to either indicate incorrect program
> state which can cause the program execution to have undefined results if
> we let it continue. For example, if you're writing a set data structure,
> and at some point you detect a duplicate value in the map, it may be a
> good idea to abort, since this will break the assumptions of the

No, I am not assuming that we *currently* use fatal assertions for
things which have a high priority of fixing if they break. Although I do
believe that currently a new patch which begins causing fatal assertions
during tinderbox unit test runs will either be backed out or immediately
fixed.

I am *proposing* that we use fatal assertions only for things which
should have a high priority for fixing if they break. I do not believe
that is always the case now.

Our users do not immediately see the crashes or other problems due to
the inconsistent states which are being guarded by the fatal assertion
but instead continue to happily execute with the chance that they will
run into the consequence of the inconsistency sometime later.

It is my contention that if the fatal assertion is left unfixed or
undemoted then it will hide either related bugs or unrelated bugs from

crash and top site testing.

To take a completely contrived example:

Current Scenario: We discover fatal_assertion firing frequently, file a
bug on it and then either ignore it for some time or decide that it is
either not important enough to fix or too risky to fix. The underlying
bug causes our users to crash at crash_signature with an alarming degree
of frequency. With fatal_assertion in place, the crash and top site
testing will continue to hit it but will fail to reproduce
crash_signature and the connection between fatal_assertion and
crash_signature will remain unknown.

Proposed Scenario: We discover fatal_assertion firing and file a bug on
it. If we fix fatal_assertion and it is the cause of other crashes such
as crash_signature we benefit. If we decide fatal_assertion is not
important enough to fix or too risky to fix, we demote it to a non-fatal
assertion. Crash and top site testing no longer fatally assert with
fatal_assertion on the urls where we hit fatal_assertion but do crash at
some crash_signature. If crash_signature was due to the failing fatal
assertion we have learned that the fatal assertion was important enough
to fix and it should be promoted back to fatal assertion when it is
fixed. If crash_signature is unrelated to the original fatal_assertion,
we benefit because we have found a bug which would have been otherwise
hidden from us.

It is my belief that fatal assertions exact a definite cost on crash and
top site testing and that the cost exceeds any benefit from leaving a
fatal assertion unfixed or undemoted.

> consumers, and cause them to behave in unexpected ways. The other reason
> why we use fatal assertions (we call them runtime aborts in this case)
> is to get crashstats data on a pathological situation from users when a
> condition causes the program to abort.
>

The runtime aborts which affect releases builds are a somewhat different
example in that we decided at the time it was added that it was more
important to crash a user on a release build than to allow the browser
to continue. I agree that those conditions were carefully chosen and the
cost of crashing the user weighed against the cost of allowing the
browser to continue.

Even with this example there may be cases where the runtime abort for
say an infallible allocator should not be used in particular
circumstances such as when the content controls allocation sizes.

>> Non-fatal assertions should be only used for conditions which reflect
>> the existence of bugs which should be fixed but are not blockers. If a
>> non-fatal assertion is not considered to be a bug, then it should
>> immediately either be modified to fire in a more restricted case that is
>> considered to be a bug, converted to a warning or removed altogether.
>> Non-fatal assertions should not fire during unit testing when first
>> introduced.
>
> Non-fatal assertions (NS_ASSERTIONs) indicate assumptions which don't
> hold, but they may not be severe enough to stop the program execution.
> For example, laying out a table one pixel wider than it should be could
> trigger an assertion. Whether we should consider it a blocker to fix
> depends on where this failure is observed, and what it leads to. For
> example, if this bug causes the homepage of the Google website to be
> laid out incorrectly, I'm sure we'll be treating it very seriously (but
> not because the original assertion has been too serious).
>

I did not mean that only bugs which are triggered by fatal assertions
could be made a blocker or that non fatal assertions could not in
particular cases be made a blocker.

>> Warnings should not be used to indicate the presence of a bug. If a
>> warning is found to be a bug, then it should be modified to be either a
>> non-fatal or fatal assertion.
>
> I agree, warnings should indicate things which could possibly lead into
> other things going wrong, and I do believe that this is how we generally
> use them in Gecko.
>
>> How should this be applied to the Aurora and Beta branches?
>
> Given the above, I don't think that the rest of your post applies to
> Aurora and Beta.

I disagree.

I am currently seeing frequent "Assertion failure:
obj->containsSlot(slot)" and "ABORT: aRelevantLinkVisited should only be
set when we have a separate style: 'aRulesIfVisited ||
!aRelevantLinkVisited'" on Beta.

The cost to crash and top site testing of these fatal assertions is that
interesting content which may exhibit other more important bugs which
should be fixed on Beta are going untested and potentially unfixed on Beta.

Other examples are "ABORT: What is the next frame supposed to be?:
'haveFullNextFrame || (mDecoder && mFrames.Length() >
mDecoder->GetCompleteFrameCount())'" and "ABORT: Can't be done decoding
if we're mid-frame!: '!mInFrame'" are very frequent on Nightly, Aurora
and Beta. They cost testing on all branches. Even if they are fixed or
demoted on Nightly but remain fatal on Aurora and Beta they continue to
cost our testing efforts to find other bugs on the branches closest
being released to our users. Are these fatal assertions so important
that the cost to our testing of Aurora and Beta is not important?

Let me clarify something about my proposal and Beta. Beta is special in
that it is the branch closest to being released to our users. As such,
we are very interested in fixing important bugs but are also very
interested in not creating too much risk or churn too close to the
release. I am not proposing that the fixing or demotion of fatal
assertions be done very late in the release cycle unless there are good
reasons to do so. For example, I am not proposing any changes to the
currently shipping Firefox 5 beta. However if this proposal takes
effect, then a natural by product will be that the reproducible fatal
assertions which were either fixed or demoted on Nightly will "trickle
down" to the Aurora and Beta during the earlier stages of the release cycle.

Ehsan Akhgari

unread,

Jun 14, 2011, 3:51:45 PM6/14/11

to Bob Clary, dev-pl...@lists.mozilla.org

I was not 100% sure that I understand what problem you're trying to
solve after reading your first post, and reading this post made me doubt
my understanding even more. So let me try to see if I'm understanding
you correctly.

On 11-06-14 12:09 PM, Bob Clary wrote:
> No, I am not assuming that we *currently* use fatal assertions for
> things which have a high priority of fixing if they break.

That's true.

> Although I do
> believe that currently a new patch which begins causing fatal assertions
> during tinderbox unit test runs will either be backed out or immediately
> fixed.

If the assertion is in one of the test suites which can detect it, then
yes (basically, crashtest* and *reftest*).

> I am *proposing* that we use fatal assertions only for things which
> should have a high priority for fixing if they break. I do not believe
> that is always the case now.

It's not. I was trying to convey that it's not always possible to tell
what the severity of an assumption not holding would be. To reiterate
my examples, finding duplicate items in a set implementation is pretty
serious. Layout out an element 1 pixel wider than what we should may be
disastrous, or it may not be. We won't know unless the bug happens.

> Our users do not immediately see the crashes or other problems due to
> the inconsistent states which are being guarded by the fatal assertion
> but instead continue to happily execute with the chance that they will
> run into the consequence of the inconsistency sometime later.

This is only relevant for "runtime aborts", since our users are not
running debug builds, right?

> It is my contention that if the fatal assertion is left unfixed or
> undemoted then it will hide either related bugs or unrelated bugs from
> crash and top site testing.

I'm assuming that you mean a bug which includes hitting a fatal
assertion, right? Such a bug would seem pretty serious to me, and I
think we take it very seriously as things are today.

> To take a completely contrived example:
>
> Current Scenario: We discover fatal_assertion firing frequently, file a
> bug on it and then either ignore it for some time or decide that it is
> either not important enough to fix or too risky to fix. The underlying
> bug causes our users to crash at crash_signature with an alarming degree
> of frequency. With fatal_assertion in place, the crash and top site
> testing will continue to hit it but will fail to reproduce
> crash_signature and the connection between fatal_assertion and
> crash_signature will remain unknown.

Do you know of real examples of runtime abort bugs which have not been
taken seriously?

> Proposed Scenario: We discover fatal_assertion firing and file a bug on
> it. If we fix fatal_assertion and it is the cause of other crashes such
> as crash_signature we benefit. If we decide fatal_assertion is not
> important enough to fix or too risky to fix, we demote it to a non-fatal
> assertion. Crash and top site testing no longer fatally assert with
> fatal_assertion on the urls where we hit fatal_assertion but do crash at
> some crash_signature. If crash_signature was due to the failing fatal
> assertion we have learned that the fatal assertion was important enough
> to fix and it should be promoted back to fatal assertion when it is
> fixed. If crash_signature is unrelated to the original fatal_assertion,
> we benefit because we have found a bug which would have been otherwise
> hidden from us.

If the runtime abort is happening because of corrupted program state
(for example, duplicate items in a set), then whatever bugs we see after
that could be a result of broken promises in the code (for example, set
consumers assuming that they won't see the same item twice in a set),
right? How would you filter out these invalid bugs when you "demote" a
runtime abort? (FWIW, I'm assuming that there's a way to demote a
runtime abort, when in reality, that's the only level of runtime check
we have for non-debug builds, to the best of my knowledge).

> It is my belief that fatal assertions exact a definite cost on crash and
> top site testing and that the cost exceeds any benefit from leaving a
> fatal assertion unfixed or undemoted.

You're basing this conclusion based on two premises:

1. We leave runtime abort bugs unfixed, and treat them as not very
important.
2. The extra bug reports that we would get by demoting (effectively
taking out) runtime aborts would be helpful, because they indicate bugs
that are not caused by incorrect program state.

I think we should see some evidence that each of these premises hold in
order to be able to draw this conclusion.

> The runtime aborts which affect releases builds are a somewhat different
> example in that we decided at the time it was added that it was more
> important to crash a user on a release build than to allow the browser
> to continue. I agree that those conditions were carefully chosen and the
> cost of crashing the user weighed against the cost of allowing the
> browser to continue.

Now this really confuses me. Our users don't get assertions at all, so
aren't runtime aborts what we've been discussing all along?

> Even with this example there may be cases where the runtime abort for
> say an infallible allocator should not be used in particular
> circumstances such as when the content controls allocation sizes.

Right. In those cases, the caller should be corrected (by calling the
fallible allocator). I don't think that taking out runtime aborts
because of this reason is the right thing to do.

> I am currently seeing frequent "Assertion failure:
> obj->containsSlot(slot)" and "ABORT: aRelevantLinkVisited should only be
> set when we have a separate style: 'aRulesIfVisited ||
> !aRelevantLinkVisited'" on Beta.

Are any of these runtime aborts? Are there bugs filed for them? Why is
the usual Beta branch strategy (trivial fixes or backouts) not enough to
address these bugs?

> The cost to crash and top site testing of these fatal assertions is that
> interesting content which may exhibit other more important bugs which
> should be fixed on Beta are going untested and potentially unfixed on Beta.
>
> Other examples are "ABORT: What is the next frame supposed to be?:
> 'haveFullNextFrame || (mDecoder && mFrames.Length() >
> mDecoder->GetCompleteFrameCount())'" and "ABORT: Can't be done decoding
> if we're mid-frame!: '!mInFrame'" are very frequent on Nightly, Aurora
> and Beta. They cost testing on all branches. Even if they are fixed or
> demoted on Nightly but remain fatal on Aurora and Beta they continue to
> cost our testing efforts to find other bugs on the branches closest
> being released to our users. Are these fatal assertions so important
> that the cost to our testing of Aurora and Beta is not important?

I don't think that anybody can answer these questions without having
more information and analysis. Given the assertion X, we can't judge
how severe it is all the time (if we know that it's a show-stopper, it
should be an abort in the first place)...

> Let me clarify something about my proposal and Beta. Beta is special in
> that it is the branch closest to being released to our users. As such,
> we are very interested in fixing important bugs but are also very
> interested in not creating too much risk or churn too close to the
> release. I am not proposing that the fixing or demotion of fatal
> assertions be done very late in the release cycle unless there are good
> reasons to do so. For example, I am not proposing any changes to the
> currently shipping Firefox 5 beta. However if this proposal takes
> effect, then a natural by product will be that the reproducible fatal
> assertions which were either fixed or demoted on Nightly will "trickle
> down" to the Aurora and Beta during the earlier stages of the release
> cycle.

I think that we should first decide that the demotion plan is what we
really want to do before we try to figure out how that would work on
Aurora and Beta. So far I'm not convinced that this is something which
we need to do.

Cheers,
Ehsan

Bob Clary

unread,

Jun 14, 2011, 4:27:48 PM6/14/11

On 6/14/11 12:51 PM, Ehsan Akhgari wrote:
> I was not 100% sure that I understand what problem you're trying to
> solve after reading your first post, and reading this post made me doubt
> my understanding even more. So let me try to see if I'm understanding
> you correctly.
>
> On 11-06-14 12:09 PM, Bob Clary wrote:
>> No, I am not assuming that we *currently* use fatal assertions for
>> things which have a high priority of fixing if they break.
>
> That's true.
>
> > Although I do
>> believe that currently a new patch which begins causing fatal assertions
>> during tinderbox unit test runs will either be backed out or immediately
>> fixed.
>
> If the assertion is in one of the test suites which can detect it, then
> yes (basically, crashtest* and *reftest*).
>
>> I am *proposing* that we use fatal assertions only for things which
>> should have a high priority for fixing if they break. I do not believe
>> that is always the case now.
>
> It's not. I was trying to convey that it's not always possible to tell
> what the severity of an assumption not holding would be. To reiterate my
> examples, finding duplicate items in a set implementation is pretty
> serious. Layout out an element 1 pixel wider than what we should may be
> disastrous, or it may not be. We won't know unless the bug happens.

What is your point with these examples?

Finding duplicate items in a set may be very serious. I could imagine
that it could lead to deleting or accessing already freed memory which
would definitely qualify as disastrous. Depending on the situation this
may very well be a good case for a fatal assertion. However, if it is
decided that the situation does not call for an immediate fix then why
should it remain a fatal assertion? If it is not to be fixed on Aurora
or Beta what are the benefits for leaving it as a fatal assertion on
those branches?

Laying out an element 1 pixel wider may be a serious bug that affects
user experience and may even be considered a blocker but I would hardly
classify it as *disastrous*. Why would we want to block further testing
of content simply due to a 1 pixel difference ?

>
>> Our users do not immediately see the crashes or other problems due to
>> the inconsistent states which are being guarded by the fatal assertion
>> but instead continue to happily execute with the chance that they will
>> run into the consequence of the inconsistency sometime later.
>
> This is only relevant for "runtime aborts", since our users are not
> running debug builds, right?
>

Wrong.

When I say "fatal assertion" I am mostly talking about debug builds
which we run during testing.

>> It is my contention that if the fatal assertion is left unfixed or
>> undemoted then it will hide either related bugs or unrelated bugs from
>> crash and top site testing.
>
> I'm assuming that you mean a bug which includes hitting a fatal
> assertion, right? Such a bug would seem pretty serious to me, and I
> think we take it very seriously as things are today.
>

We take them pretty seriously but there are cases where they remain for
significant periods of time. An example is:

ABORT: aRelevantLinkVisited should only be set when we have a separate style

https://bugzilla.mozilla.org/show_bug.cgi?id=611922
filed by Jesse 2010-11-12
flagged by me in crash testing on 2011-03-17
fixed on mozilla-central on 2011-05-21

>> To take a completely contrived example:
>>
>> Current Scenario: We discover fatal_assertion firing frequently, file a
>> bug on it and then either ignore it for some time or decide that it is
>> either not important enough to fix or too risky to fix. The underlying
>> bug causes our users to crash at crash_signature with an alarming degree
>> of frequency. With fatal_assertion in place, the crash and top site
>> testing will continue to hit it but will fail to reproduce
>> crash_signature and the connection between fatal_assertion and
>> crash_signature will remain unknown.
>
> Do you know of real examples of runtime abort bugs which have not been
> taken seriously?
>

No, but you are incorrectly conflating my discussion of fatal assertions
with runtime aborts in release builds.

>> Proposed Scenario: We discover fatal_assertion firing and file a bug on
>> it. If we fix fatal_assertion and it is the cause of other crashes such
>> as crash_signature we benefit. If we decide fatal_assertion is not
>> important enough to fix or too risky to fix, we demote it to a non-fatal
>> assertion. Crash and top site testing no longer fatally assert with
>> fatal_assertion on the urls where we hit fatal_assertion but do crash at
>> some crash_signature. If crash_signature was due to the failing fatal
>> assertion we have learned that the fatal assertion was important enough
>> to fix and it should be promoted back to fatal assertion when it is
>> fixed. If crash_signature is unrelated to the original fatal_assertion,
>> we benefit because we have found a bug which would have been otherwise
>> hidden from us.
>
> If the runtime abort is happening because of corrupted program state
> (for example, duplicate items in a set), then whatever bugs we see after
> that could be a result of broken promises in the code (for example, set
> consumers assuming that they won't see the same item twice in a set),
> right? How would you filter out these invalid bugs when you "demote" a
> runtime abort? (FWIW, I'm assuming that there's a way to demote a
> runtime abort, when in reality, that's the only level of runtime check
> we have for non-debug builds, to the best of my knowledge).
>

Again you conflate fatal assertions and runtime aborts in release builds.

>> It is my belief that fatal assertions exact a definite cost on crash and
>> top site testing and that the cost exceeds any benefit from leaving a
>> fatal assertion unfixed or undemoted.
>
> You're basing this conclusion based on two premises:
>
> 1. We leave runtime abort bugs unfixed, and treat them as not very
> important.

False. I do not make this premise. Reread the thread again.

> 2. The extra bug reports that we would get by demoting (effectively
> taking out) runtime aborts would be helpful, because they indicate bugs
> that are not caused by incorrect program state.
>

False. Again you are conflating fatal assertions and runtime aborts in
release builds.

> I think we should see some evidence that each of these premises hold in
> order to be able to draw this conclusion.
>

I think you need to reread this thread again.

>> The runtime aborts which affect releases builds are a somewhat different
>> example in that we decided at the time it was added that it was more
>> important to crash a user on a release build than to allow the browser
>> to continue. I agree that those conditions were carefully chosen and the
>> cost of crashing the user weighed against the cost of allowing the
>> browser to continue.
>
> Now this really confuses me. Our users don't get assertions at all, so
> aren't runtime aborts what we've been discussing all along?
>

No.

>> Even with this example there may be cases where the runtime abort for
>> say an infallible allocator should not be used in particular
>> circumstances such as when the content controls allocation sizes.
>
> Right. In those cases, the caller should be corrected (by calling the
> fallible allocator). I don't think that taking out runtime aborts
> because of this reason is the right thing to do.
>

I am not proposing to do that.

>> I am currently seeing frequent "Assertion failure:
>> obj->containsSlot(slot)" and "ABORT: aRelevantLinkVisited should only be
>> set when we have a separate style: 'aRulesIfVisited ||
>> !aRelevantLinkVisited'" on Beta.
>
> Are any of these runtime aborts? Are there bugs filed for them? Why is
> the usual Beta branch strategy (trivial fixes or backouts) not enough to
> address these bugs?
>

Yes bugs have been filed for many if not all of them. I do not see a
strategy in affect to deal with fatal assertions which are fixed on
Nightly but not in Aurora or Beta or fixed in Nightly and Aurora but not
Beta.

>> The cost to crash and top site testing of these fatal assertions is that
>> interesting content which may exhibit other more important bugs which
>> should be fixed on Beta are going untested and potentially unfixed on
>> Beta.
>>
>> Other examples are "ABORT: What is the next frame supposed to be?:
>> 'haveFullNextFrame || (mDecoder && mFrames.Length() >
>> mDecoder->GetCompleteFrameCount())'" and "ABORT: Can't be done decoding
>> if we're mid-frame!: '!mInFrame'" are very frequent on Nightly, Aurora
>> and Beta. They cost testing on all branches. Even if they are fixed or
>> demoted on Nightly but remain fatal on Aurora and Beta they continue to
>> cost our testing efforts to find other bugs on the branches closest
>> being released to our users. Are these fatal assertions so important
>> that the cost to our testing of Aurora and Beta is not important?
>
> I don't think that anybody can answer these questions without having
> more information and analysis. Given the assertion X, we can't judge how
> severe it is all the time (if we know that it's a show-stopper, it
> should be an abort in the first place)...
>

That is why my proposal does not recommend going through every single
fatal assertion in the tree and making the call apriori but does
recommend that when we do see them, then we make the call.

Take https://bugzilla.mozilla.org/show_bug.cgi?id=611922 for example.
When Jesse first discovered it, it was the result of a fuzzer. There may
or may not have been enough information at the time to determine if this
was hit by significant real content on the real web. At that point I
don't think we had enough information to classify it as a blocker nor to
demote it to an assertion. Once it was discovered on the real web, we
did have enough information to make that call. If this policy was in
affect at the time I first saw it during crash testing I would have been
able to ask for a determination whether the fatal assertion was
important enough to block or if it could be demoted to a non-fatal
assertion.

>> Let me clarify something about my proposal and Beta. Beta is special in

Ehsan Akhgari

unread,

Jun 14, 2011, 6:06:40 PM6/14/11

to dev-pl...@lists.mozilla.org

On 11-06-14 4:27 PM, Bob Clary wrote:
> When I say "fatal assertion" I am mostly talking about debug builds
> which we run during testing.

Oh, OK. I was confused because you were talking about our users hitting
these assertions...

>> I'm assuming that you mean a bug which includes hitting a fatal
>> assertion, right? Such a bug would seem pretty serious to me, and I
>> think we take it very seriously as things are today.
>>
>
> We take them pretty seriously but there are cases where they remain for
> significant periods of time. An example is:
>
> ABORT: aRelevantLinkVisited should only be set when we have a separate
> style
> https://bugzilla.mozilla.org/show_bug.cgi?id=611922
> filed by Jesse 2010-11-12
> flagged by me in crash testing on 2011-03-17
> fixed on mozilla-central on 2011-05-21

OK, good example.

>> 2. The extra bug reports that we would get by demoting (effectively
>> taking out) runtime aborts would be helpful, because they indicate bugs
>> that are not caused by incorrect program state.
>>
>
> False. Again you are conflating fatal assertions and runtime aborts in
> release builds.

Actually, this is also true if we're talking about debug mode aborts.
Converting them to non-fatal assertions doesn't necessarily mean that
we'll get more meaningful bug reports (unless there is not a very good
reason for the abort to exist, in which case I agree that we should take
it out).

>>> I am currently seeing frequent "Assertion failure:
>>> obj->containsSlot(slot)" and "ABORT: aRelevantLinkVisited should only be
>>> set when we have a separate style: 'aRulesIfVisited ||
>>> !aRelevantLinkVisited'" on Beta.
>>
>> Are any of these runtime aborts? Are there bugs filed for them? Why is
>> the usual Beta branch strategy (trivial fixes or backouts) not enough to
>> address these bugs?
>>
>
> Yes bugs have been filed for many if not all of them. I do not see a
> strategy in affect to deal with fatal assertions which are fixed on
> Nightly but not in Aurora or Beta or fixed in Nightly and Aurora but not
> Beta.

Why is the current approval strategy for Aurora and Beta not sufficient?
You could just attach a patch, request approval and provide
justification for it.

> Take https://bugzilla.mozilla.org/show_bug.cgi?id=611922 for example.
> When Jesse first discovered it, it was the result of a fuzzer. There may
> or may not have been enough information at the time to determine if this
> was hit by significant real content on the real web. At that point I
> don't think we had enough information to classify it as a blocker nor to
> demote it to an assertion. Once it was discovered on the real web, we
> did have enough information to make that call. If this policy was in
> affect at the time I first saw it during crash testing I would have been
> able to ask for a determination whether the fatal assertion was
> important enough to block or if it could be demoted to a non-fatal
> assertion.

In this case, I think what I would personally do is to convert the abort
into an assertion locally and seeing what difference it makes on those
two websites. I wouldn't want to land that patch on m-c though, since
it causes the failure to be ignored on other sites, where it might be
happening for different reasons. That would hide those possible bugs,
isn't that correct?

Cheers,
Ehsan

Bob Clary

unread,

Jun 14, 2011, 7:19:55 PM6/14/11

On 6/14/11 3:06 PM, Ehsan Akhgari wrote:
> On 11-06-14 4:27 PM, Bob Clary wrote:
>> When I say "fatal assertion" I am mostly talking about debug builds
>> which we run during testing.
>
> Oh, OK. I was confused because you were talking about our users hitting
> these assertions...
>

No I didn't. I spoke of our users hitting the consequences of the
inconsistent states guarded by the fatal assertions. The user does not
crash at the fatal assertion (I'm not talking about runtime aborts in
release builds here), but they do sometimes eventually crash somewhere
else due to the inconsistent state of the browser which occurred because
the condition of the fatal assertion failed. Other times our users may
crash at locations and conditions which are completely unrelated to the
fatal assertion.

>>> I'm assuming that you mean a bug which includes hitting a fatal
>>> assertion, right? Such a bug would seem pretty serious to me, and I
>>> think we take it very seriously as things are today.
>>>
>>
>> We take them pretty seriously but there are cases where they remain for
>> significant periods of time. An example is:
>>
>> ABORT: aRelevantLinkVisited should only be set when we have a separate
>> style
>> https://bugzilla.mozilla.org/show_bug.cgi?id=611922
>> filed by Jesse 2010-11-12
>> flagged by me in crash testing on 2011-03-17
>> fixed on mozilla-central on 2011-05-21
>
> OK, good example.
>
>>> 2. The extra bug reports that we would get by demoting (effectively
>>> taking out) runtime aborts would be helpful, because they indicate bugs
>>> that are not caused by incorrect program state.
>>>
>>
>> False. Again you are conflating fatal assertions and runtime aborts in
>> release builds.
>
> Actually, this is also true if we're talking about debug mode aborts.
> Converting them to non-fatal assertions doesn't necessarily mean that
> we'll get more meaningful bug reports (unless there is not a very good
> reason for the abort to exist, in which case I agree that we should take
> it out).
>

When we first hit the fatal assertion and file a bug on it, we get the
first bit of important information that the fatal assertion can fire
under some circumstances. At that point we should look at the fatal
assertion with a critical eye to determine does this really qualify to
be a fatal assertion. We may or may not be able to make such a
determination right away for a variety of reasons up to and including
the developer is busy on other issues. If however the crash or top site
testing finds that this fatal assertion is very common in our crash url
reports from Socorro or on top sites we have the next piece of important
information about the fatal assertion. Our users are visiting sites
which violate the fatal assertion's condition (Not that they are
experiencing the fatal assertion directly).

At this point we have a choice to make. Is this fatal assertion really
something important or was it originally just made a fatal assertion in
the belief that the condition would never be violated. If it is really
important... for example we are deleting already freed memory or the
vtable contains deleted memory or we are executing javascript with the
wrong principle then it really should be a blocker and be fixed.

If on the other hand it was originally made a fatal assertion but is not
that important to be fixed relatively soon, there is no real benefit to
keeping it as a fatal assertion while there is a very real cost to
keeping as fatal assertion.

The first cost is related to the difficulty in mapping user crashes
which are reported to Socorro with fatal assertions which occur when
testing the Socorro urls with debug builds. If we fix the fatal
assertion relatively soon and see the user crashes in Socorro disappear
then all is well. But if we leave the fatal assertion as is then the
automated crash testing will always fail at the fatal assertion and will
never reproduce the crash signature seen by the user and reported in
Socorro.

The second cost is the fact that the fatal assertion acts as a guard on
the execution path for the debug builds used in crash testing and it
prevents us from testing the code paths which the user traveled. It may
be that we find a crash which is not related to the one originally
experienced by the user and which has absolutely nothing to do with the
fatal assertion's condition. These are still important and may actually
be more important than the failure of the original fatal assertion's
condition.

>>>> I am currently seeing frequent "Assertion failure:
>>>> obj->containsSlot(slot)" and "ABORT: aRelevantLinkVisited should
>>>> only be
>>>> set when we have a separate style: 'aRulesIfVisited ||
>>>> !aRelevantLinkVisited'" on Beta.
>>>
>>> Are any of these runtime aborts? Are there bugs filed for them? Why is
>>> the usual Beta branch strategy (trivial fixes or backouts) not enough to
>>> address these bugs?
>>>
>>
>> Yes bugs have been filed for many if not all of them. I do not see a
>> strategy in affect to deal with fatal assertions which are fixed on
>> Nightly but not in Aurora or Beta or fixed in Nightly and Aurora but not
>> Beta.
>
> Why is the current approval strategy for Aurora and Beta not sufficient?
> You could just attach a patch, request approval and provide
> justification for it.
>

It is difficult enough to get traction on mozilla-central without a
clear policy. With a clear policy it would be easier to do just that on
mozilla-central, mozilla-aurora and mozilla-beta without having to
fight the same battles over and over again.

I am asking for us to agree on a clear policy that myself, Tomcat and
others can point to when we make such requests.

>> Take https://bugzilla.mozilla.org/show_bug.cgi?id=611922 for example.
>> When Jesse first discovered it, it was the result of a fuzzer. There may
>> or may not have been enough information at the time to determine if this
>> was hit by significant real content on the real web. At that point I
>> don't think we had enough information to classify it as a blocker nor to
>> demote it to an assertion. Once it was discovered on the real web, we
>> did have enough information to make that call. If this policy was in
>> affect at the time I first saw it during crash testing I would have been
>> able to ask for a determination whether the fatal assertion was
>> important enough to block or if it could be demoted to a non-fatal
>> assertion.
>
> In this case, I think what I would personally do is to convert the abort
> into an assertion locally and seeing what difference it makes on those
> two websites. I wouldn't want to land that patch on m-c though, since it
> causes the failure to be ignored on other sites, where it might be
> happening for different reasons. That would hide those possible bugs,
> isn't that correct?
>

I would think the developer might do something very similar initially.
But if the fatal assertion is occurring on dozens or hundreds of sites
then it may also be hiding other possible bugs.

Converting the fatal assertion to a non-fatal assertion would not hide
the bug nor necessarily cause the failure to be ignored so long as we
keep track that the non-fatal assertion is still occurring at the
original site as well as the other sites. It would simply convert it to
a non-fatal assertion which would still be reported in the log. The
actual bugzilla bug would be about the condition failure which was
originally a fatal assertion but is now a non-fatal one.

> Cheers,
> Ehsan

Robert O'Callahan

unread,

Jun 14, 2011, 8:38:13 PM6/14/11

to Ehsan Akhgari, dev-pl...@lists.mozilla.org

On Wed, Jun 15, 2011 at 10:06 AM, Ehsan Akhgari <ehsan....@gmail.com>wrote:

>
> ABORT: aRelevantLinkVisited should only be set when we have a separate
>> style
>> https://bugzilla.mozilla.org/show_bug.cgi?id=611922
>> filed by Jesse 2010-11-12
>> flagged by me in crash testing on 2011-03-17
>> fixed on mozilla-central on 2011-05-21
>>
>

> OK, good example.

That is a good example of something that should not be a fatal assertion
IMHO.

Rob
--
"Now the Bereans were of more noble character than the Thessalonians, for
they received the message with great eagerness and examined the Scriptures
every day to see if what Paul said was true." [Acts 17:11]

Nicholas Nethercote

unread,

Jun 14, 2011, 8:45:18 PM6/14/11

to Bob Clary, dev-pl...@lists.mozilla.org

On Tue, Jun 14, 2011 at 3:57 PM, Bob Clary <bcl...@bclary.com> wrote:
>
> given my
> experience in crash testing I would like to recommend a general approach to
> how we treat them.

This all sounds pretty sensible to me. Furthermore, bclary has a ton
of experience in this area, so if he thinks it's a good idea I'm
inclined to believe him.

If we were to do as suggested, what would the next step be? Write it
down as official policy on a wiki, so that when bclary encounters a
failing assertion, he can tell the relevant person "please fix this or
change it's severity"?

Nick

Ehsan Akhgari

unread,

Jun 15, 2011, 12:39:03 AM6/15/11

to Bob Clary, dev-pl...@lists.mozilla.org

How about this alternative proposal: we can start supporting an
environment variable which controls the abort behavior, similar to
XPCOM_DEBUG_BREAK. Let's call it XPCOM_ABORT_BREAK. If the value is
"abort" (or if it's not specified), we can proceed with the default
behavior (aborting). If the value is "assert", we treat it as a regular
assertion, and if the value is "warn", we treat it as a warning.

Would that address your concerns?

Cheers,
Ehsan

Bob Clary

unread,

Jun 15, 2011, 1:21:32 PM6/15/11

So, basically your idea is to just introduce a feature to turn all fatal
assertions into non-fatal assertions to address my concern that fatal
assertions act as code path guards?

I think this would lose the imperative to fix incorrect uses of fatal
assertions, non-fatal assertions and warnings through iterative
promotion and demotion as they are identified.

If this is not for all of the different fatal assertions in the tree and
all of the branches then I don't see this as addressing the issue as
there would still be classes of fatal assertions and still no clear policy.

The messages emitted to the log could not look like normal non-fatal
assertions as that would lose their distinctiveness.

They would have to produce stacks which can be accurately processed
across the platforms and branches.

I think this would lose some of the information about the fatal
assertion from the processing of the minidumps.

I don't think I would use it in crash testing.

Considering your active and continuing opposition and the lukewarm
reception my proposal has received, I withdraw it. You win.

Clint Talbert

unread,

Jun 15, 2011, 2:27:06 PM6/15/11

to Ehsan Akhgari, Bob Clary, dev-pl...@lists.mozilla.org

No, this doesn't work at all, Ehsan. I don't think it's entirely clear
what's at stake here. This is fundamentally a problem of *granularity*.
By having these fatal assertions remain fatal in the codebase for a
long period of time, we reduce the granularity of our automated crash
testing and we *cannot* provide developers with actionable intelligence
on reproduced crashes.

Bob's goal with Bughunter (the crashtest/topsite testing system) has
always been to provide actionable intelligence on reproduced crashes in
a way that gives someone a starting point to fix a crash. Fundamentally,
it's all about saving people time.

Here's what we do:
1. Users Crash on some website, we get the socorro reports and the URLs
from the crash reporter
2. Our crashtest automation reads those dumps, runs machines close to
the user's specifications (OS etc) and tries to reproduce the crash by
browsing to the reported URL.

There are several outcomes of this process that inform Bob's proposal.

Outcome 1 (happens sometimes):
We crash with the same stack - we've reproduced it, we file bugs.
Developers have a quick place to look and see that the bug was
reproduced, tied to a specific socorro signature and we have an
actionable quantity. Life is good.

Outcome 2 (happens more often):
We crash on a fatal assertion while browsing the URL (the user doesn't
see the fatal assertion because the user isn't running a debug build).
This is incredibly informative. It tells us that whatever condition the
assertion was guarding against failed, the user blew through that point
in the code and crashed in some unrecognized state down the road.

Once more, our strategy here is to file a bug, provide the amended stack
the developer will see when working on the bug, and let them know that
this is equivalent to socorro crash X. That way, the developers know
they have a reproducible case and they have an exact place to start
looking for a fix.

If we do some kind of XPCOM_ABORT_BREAK, we wouldn't see the assertion,
we'd either crash the same (or more likely) a different way than the
user and we wouldn't have any smoking gun to point developers at. We
wouldn't save anyone time. And that is our goal -- *saving developers
time*.

Now, outcome 2 has a corollary:
We report the bug from outcome 2. The bug isn't fixed -- lots of good
reasons things aren't fixed immediately, that's fine. But, our
automation will still hit that fatal assertion while trying to reproduce
various crash signatures from many different URLs. Once the automation
hits any fatal assertion, it stops and collects data for a bug report.
It cannot test past them. This keeps us at a course granularity of testing.

Now, if we demoted the assertion once folks decided that it was
unimportant, that change would roll down into Aurora and Beta. And it
would allow the automation to test at a higher granularity on those
branches. Potentially then, the automation would fail on the next fatal
assertion it finds. And that deeper assertion may point us to
conclusions that lead us to reconsider the importance of the reported
crash. Even if the deeper assertion doesn't lead us to reconsider the
importance of the crash, we would at least have better data for crashes
that our users are experiencing on Aurora/Beta.

By demoting unimportant fatal assertions, we *improve* the granularity
of our testing on our about-to-be released builds. This is what's at
stake. Without implementing a targeted, considered, plan like Bob's to
demote unimportant fatal assertions, we are willfully tying a blindfold
around our eyes and slamming the gas pedal to the floor, praying we
don't hit anything. We can do better. And I argue, we have to do better.

Clint

Boris Zbarsky

unread,

Jun 15, 2011, 2:32:11 PM6/15/11

On 6/15/11 2:27 PM, Clint Talbert wrote:
> If we do some kind of XPCOM_ABORT_BREAK, we wouldn't see the assertion,

Why not? Our test automation runs with nonfatal NS_ASSERTION but most
definitely sees assertions and reports them as test failures...

Is there a reason your automation only looks for fatal assertions right now?

-Boris

Clint Talbert

unread,

Jun 28, 2011, 1:51:24 PM6/28/11

Mostly because we're trying to reproduce crashes. When I spoke of
"seeing" I should have been more clear and said "crashing on". We want
to try to crash on the call that the user crashed on originally. But if
we crash early on some assertion that we've collectively decided isn't a
bug then we're not providing useful data.

Clint

0 new messages