Idea for getting a better total crashes per day baseline

Ozten

unread,

Nov 18, 2009, 1:30:04 PM11/18/09

to

During the last CrashKill meeting we were talking a lot about the
baseline for ADU and 100 crashes per user metric.

Crazy Idea:
When we send ADU, can we also send
- total number of crashes since last ADU ping

From a privacy perspective this seem very benign, it would immediately
get aggregated as crashes per day. Sending this idea to a wider
audience to vet the privacy implications of collecting this data.

Why?
This would resolve the issue were we can't know how many total crashes
due to user's opting out, throttling, and other unknown effects (like
virus protection etc).

Ozten

unread,

Nov 23, 2009, 7:40:37 PM11/23/09

to

Please reply to give feedback.

We've also created a wiki page with more information (it also has a
discussion tab).

https://wiki.mozilla.org/Extension_Blocklisting:New_Attribute_Crashes_Since_Last_Request

Axel Hecht

unread,

Nov 23, 2009, 8:18:32 PM11/23/09

to

Two things:

- The data to send is probably MTBF and not crashes in the last 24 hours.

- I'm not necessarily OK with gathering that data.

To the latter, we're all keen to understand better how we're doing in
fighting crashes, but gathering user data should not only correspond to
the privacy policy, it should also follow the principle of only
collecting data that one needs to do the job.

The argument why we'd need to piggyback additional data (we didn't do
our home work when playing with the ratios, and oh, some folks might hit
a checkbox) doesn't convince me.

Axel

Robert Strong

unread,

Nov 23, 2009, 8:32:29 PM11/23/09

to dev-pl...@lists.mozilla.org

Seems to me like something that should be added to the data sent via
crashreporter especially since it is crash data related and not by
adding it to / overloading the blocklist.

Robert

> _______________________________________________
> dev-planning mailing list
> dev-pl...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-planning
>

Mike Beltzner

unread,

Nov 24, 2009, 9:09:42 AM11/24/09

to rst...@mozilla.com, dev-pl...@lists.mozilla.org

I've proposed this idea before, and believe that I even filed a bug on it,
but generally agree that the time to submit data about stability is when we
specifically ask users if they want to send us that data (ie: the crash
reporter).

There are several pieces of data we can send along with a crash that are not
personally identifiable (though some are what's considered potentially
personally identifiable):

- time since last crash
- total number of crashes experienced
- total number of crashes experienced in last 10 minutes, hour, 12 hours
and 24 hours
- crash report UUIDs of last N crashes*
- plugins and add-ons registered / loaded

(* this is potentially most controversial, as it basically gives us the
possibility to track a Firefox install across crashes, but if we process the
data properly, we can use it to generate a "duplicate crash" count)

cheers,
mike

Ozten

unread,

Nov 24, 2009, 11:36:19 AM11/24/09

to

On Nov 23, 5:18 pm, Axel Hecht <a...@pike.org> wrote:
> On 24.11.09 01:40, Ozten wrote:
>
> > Please reply to give feedback.
>
> > We've also created a wiki page with more information (it also has a
> > discussion tab).
>

> >https://wiki.mozilla.org/Extension_Blocklisting:New_Attribute_Crashes...

>
> Two things:
>
> - The data to send is probably MTBF and not crashes in the last 24 hours.

A crash currently submits uptime. This proposed baseline allows us to
calculate a metric like "number of crashes per hundred users per day"
which could be referenced across releases.

>
> - I'm not necessarily OK with gathering that data.
>
> To the latter, we're all keen to understand better how we're doing in
> fighting crashes, but gathering user data should not only correspond to
> the privacy policy, it should also follow the principle of only
> collecting data that one needs to do the job.
>
> The argument why we'd need to piggyback additional data (we didn't do
> our home work when playing with the ratios, and oh, some folks might hit
> a checkbox) doesn't convince me.

I agree that ADU might not be the right channel,
but I didn't state clearly in the wiki page why crash submission is
too late to fix any bias.

Old:
ADU is a shaky baseline for the following reasons
Updated:
Our current "total crashes" numbers are a shaky baseline for the
following reasons
...
By piggybacking on ADU we eliminate these opt out issues. ADU on the
other hand is also opt out. The reasoning behind this number being
more trustworthy is

* ease of opting out during a crash reporter versus blocklist opt
out
* stress of "just wanting to restart"
* change of opt-in versus opt-out skewing this rate across prod/
version

Updated in https://wiki.mozilla.org/Extension_Blocklisting:New_Attribute_Crashes...

Austin

>
> Axel

Daniel Veditz

unread,

Nov 24, 2009, 1:25:50 PM11/24/09

to

On 11/23/09 5:18 PM, Axel Hecht wrote:
>
> The argument why we'd need to piggyback additional data (we didn't do
> our home work when playing with the ratios, and oh, some folks might hit
> a checkbox) doesn't convince me.

Even when throttling is completely disabled the current data gives us no
clue to the distribution of crashes. Is everyone suffering equally? What
are the peaks and valleys?

Mostly we want to know this about particular crashes, so piggybacking on
the blocklist is too broad. To do that kind of analysis we'd need UUIDs
with the crash data, which we used to have and rejected on privacy grounds.

Sending this data through a different channel does have one useful
feature for reliability measures, though -- everyone will report it.
With the current system we can calculate an average uptime, but that's
not the same as MTBF. If 99.9% of our users were rock solid and never
crashed but .1% crashed every hour our calculated Uptime of 60 minutes
would have no relation to overall stability. Although I guess it is an
accurate reflection of how unhappy the crashy people are.

Crashes per user isn't MTBF either except maybe in broad calendar terms.

Jonas Sicking

unread,

Jan 8, 2010, 5:06:39 PM1/8/10

to

On 11/24/2009 8:36, Ozten wrote:
> On Nov 23, 5:18 pm, Axel Hecht<a...@pike.org> wrote:
>> On 24.11.09 01:40, Ozten wrote:
>>
>>> Please reply to give feedback.
>>
>>> We've also created a wiki page with more information (it also has a
>>> discussion tab).
>>
>>> https://wiki.mozilla.org/Extension_Blocklisting:New_Attribute_Crashes...
>>
>> Two things:
>>
>> - The data to send is probably MTBF and not crashes in the last 24 hours.
>
> A crash currently submits uptime. This proposed baseline allows us to
> calculate a metric like "number of crashes per hundred users per day"
> which could be referenced across releases.

This is not enough information to calculate MTBF. You also need to
account for times when the user started firefox, ran it for a long time,
and then shut it down, without ever crashing.

As an extreme example, imagine a bug where we on occasion crash on
startup, but never during a run. This would mean that every time we
submit crash data, uptime is in the order of a few seconds. This would
mean that MTBF would look like it was just a few seconds. However in
reality the user usually runs for several hours or days without crashing.

I *think* this is currently the biggest problem we have with calculating
MTBF.

/ Jonas

Cheng Wang

unread,

Jan 11, 2010, 4:50:17 PM1/11/10

to Jonas Sicking

I'm with Jonas, if we're submitting num_crashes with ADU pings, I'd say
we also have to submit uptime hours and/or num_starts. Unfortunately, I
think both these stats are more privacy-sensitive so I don't know what
the best course of action is.

Axel, I'd say the most damning argument in favor of this idea is we have
no idea how many crashes don't trigger the crash reporter. We may or
may not have a Mac Flash crash that at least the majority of the time
doesn't trigger crash reporter. We know nothing about these users and
assume that they're just chugging along with their ADUs without
crashing. Yes this is a problem with the crash reporter but we won't
know how much of a problem this is until we get some kind of data from
these users.

Cww.

Ted Mielczarek

unread,

Jan 11, 2010, 5:05:48 PM1/11/10

to Cheng Wang, dev-pl...@lists.mozilla.org

On Mon, Jan 11, 2010 at 4:50 PM, Cheng Wang <c...@mozilla.com> wrote:

>
> Axel, I'd say the most damning argument in favor of this idea is we have no
> idea how many crashes don't trigger the crash reporter. We may or may not
> have a Mac Flash crash that at least the majority of the time doesn't
> trigger crash reporter. We know nothing about these users and assume that
> they're just chugging along with their ADUs without crashing. Yes this is a
> problem with the crash reporter but we won't know how much of a problem this
> is until we get some kind of data from these users.
>
>

We won't get this data from submitting "num_crashes" or anything with ADU
anyway. If we're not handling the crash, we're not recording anything about
it, so how can we report anything about it?

-Ted

Robert O'Callahan

unread,

Jan 12, 2010, 3:52:43 PM1/12/10

to

We could record starts and shutdowns, and submit some kind of report
when the browser starts and we notice that we did not observe a
shutdown, nor did we observe a crash report.

Of course, this would include things like power-off and force-quits.
Maybe we could detect whether there was a reboot since the previous
browser session started.

Rob

Mike Beltzner

unread,

Jan 12, 2010, 6:11:14 PM1/12/10

to Robert O'Callahan, dev-pl...@lists.mozilla.org

On 2010-01-12, at 3:52 PM, Robert O'Callahan wrote:

> We could record starts and shutdowns, and submit some kind of report when the browser starts and we notice that we did not observe a shutdown, nor did we observe a crash report.

I'm confused as to why we'd report the raw data instead of having the client calculate a MTBF value and just submit that.

cheers,
mike

Mike Shaver

unread,

Jan 12, 2010, 10:12:21 PM1/12/10

to Mike Beltzner, dev-pl...@lists.mozilla.org, Robert O'Callahan

On Tue, Jan 12, 2010 at 6:11 PM, Mike Beltzner <belt...@mozilla.com> wrote:
> On 2010-01-12, at 3:52 PM, Robert O'Callahan wrote:
>

>> We could record starts and shutdowns, and submit some kind of report when the browser starts and we notice that we did not observe a shutdown, nor did we observe a crash report.
>

> I'm confused as to why we'd report the raw data instead of having the client calculate a MTBF value and just submit that.

Can we get from per-user MTBF to aggregate MTBF? It seems like we
need to know the total time and failures if we're going to
appropriately weight

- me seeing an MTBF of 10 seconds because I started it once, it
crashed, and I stopped using it (10 seconds, 1 crash)
- you seeing an MTBF of 200 hours, because you leave the browser up
all the time and get this one crash every now and then on
secretsofbugzillamaster.com/forums (4000 hours, 20 crashes)

If we see the total time and crashes, we get an MTBF of about 190
hours (2000 hours of use, 21 crashes). If we just average the two
MTBFs, for a 2-user base, we get an MTBF of about 100 hours.

But there may be math tricks available to someone who actually knows
math, meaning not me.

Mike

Boris Zbarsky

unread,

Jan 12, 2010, 10:51:59 PM1/12/10

to

On 1/12/10 10:12 PM, Mike Shaver wrote:
> Can we get from per-user MTBF to aggregate MTBF? It seems like we
> need to know the total time and failures if we're going to
> appropriately weight
>
> - me seeing an MTBF of 10 seconds because I started it once, it
> crashed, and I stopped using it (10 seconds, 1 crash)
> - you seeing an MTBF of 200 hours, because you leave the browser up
> all the time and get this one crash every now and then on
> secretsofbugzillamaster.com/forums (4000 hours, 20 crashes)
>
> If we see the total time and crashes, we get an MTBF of about 190
> hours (2000 hours of use, 21 crashes). If we just average the two
> MTBFs, for a 2-user base, we get an MTBF of about 100 hours.

It really depends on how we want to define "MTBF"....

I suspect that for any sort of sane definition, I think we do need to
know at least the MTBF and the total time (or the total time and number
of crashes, or anything equivalent to those two data points) for each
user, I think. Just directly averaging user MTBF values will, as shaver
points out, over-weight the users who use the browser less.

On the other hand, I do wonder what it is we're trying to measure. If
we have one user who's crashing every 10 seconds and only using the
browser an hour a week and one user who is using the browser 15 hours a
day and never crashing.... what number do we actually want to get out of
our MTBF-calculation process?

Fundamentally, the concept of MTBF somewhat assumes that failures are
randomly distributed and that the distributions are the same for all
users. Both assumptions are, of course, false (except maybe for the
crashes triggered by flash ads).

-Boris