Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Anonymous metrics collection from Firefox

552 views
Skip to first unread message

Benjamin Smedberg

unread,
Feb 6, 2012, 1:56:53 PM2/6/12
to mozilla.dev.planning group
There has been a project being worked on for some time to collect
metrics from Firefox installations in an "on by default" manner. This is
different from off-by-default telemetry. I became aware of this project
recently when I was asked to review some implementation code, and I have
some concerns about our privacy stance in this feature. Because the bugs
are getting a bit out of hand, I wanted to move the discussion to the
proper newsgroup.

For background, the feature page (not strictly a feature page) is here:
https://wiki.mozilla.org/MetricsDataPing

Note that this page contains data from several different authors and
isn't a coherent proposal page any more. See the wiki history for
context if necessary.

The tracking bug is https://bugzilla.mozilla.org/show_bug.cgi?id=718066
from which several other bugs (core implementation, preference UI) are
available.

I understand that this opt-out data collection is vastly superior than
telemetry in terms of collecting a representative sample and controlling
for bias. But it's not clear to me why that makes it "ok" from a privacy
perspective, compared with telemetry, to make this opt-out instead of
opt-in. Just from my personal experience, I would be surprised by any
data submitted by Firefox to Mozilla which was not part of regular
Firefox functionality (app update seems pretty straightforward,
extension update also, crash submission is opt-in). It seems that if
this data submission contains any information which is potentially
personally identifying, then it would be a "surprise". As already
identified in the bug, there are so many different ways in which data
can be potentially identifying:

* unique sets of themes (theme collection was removed)addons
* unique sets of addons (addon collection is still proposed)
* the unique IDs used to keep track of particular installations can
potentially track data back to users (note that the UUID proposal has
changed somewhat due to privacy concerns, but that there is still a
local ID -> remote data mapping)

A fair bit of the proposal is focused on how we would be protecting and
anonymizing the data. But if we're not actually collecting personally
identifyable data, why couldn't we make the entire server system public
and queryable? It seems that any system that requires server-side
anonymization to meet user privacy expectations is an unexpected privacy
risk. Might it also open up our users to potential tracking via court
order (search warrants) from both U.S. courts and whatever countries we
put data centers in?

It seems as if we are saying that since we already collect most of this
data via various product features, that makes it ok to also collect this
data in a central place and attach an ID to it. Or, that because we
*need* this data in order to make the product better, it's ok to collect
it. This makes me intensely uncomfortable. At this point I think we'd be
better off either collecting only the data which cannot be used to track
individual installs, or not implementing this feature at all.

Note that while Ben Bucksch has also brought up legal concerns about
whether German or European law forbids this kind of data collection, I'm
not particular interested in that portion of the discussion because very
few of us in the project are legal experts who can have an informed
opinion. So please let's avoid ratholing on those legal issues instead
of the basic privacy issue.

--BDS

Dao

unread,
Feb 6, 2012, 2:24:55 PM2/6/12
to
I was just going to post this to bug 718066, now commenting here instead:

(In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #54)
> (In reply to Dão Gottwald [:dao] from comment #52)
> > I'd consider add-ons problematic, partly because the IDs alone can let you
> > track down a person, partly because the use of some add-ons could be illegal
> > in some countries. I also second Ben's view that IP addresses + GUIDs need
> > to be considered personally identifiable information. You say you don't
> > store IP addresses, but this just brings us back to good intentions vs.
> > systems that inherently protect privacy by just not sending out problematic
> > data.
>
> Based on your feedback, we removed persona and theme IDs from the list of
> data submitted. We also implemented the honoring of the setting that an
> add-on developer can put into the manifest to prevent submitting the add-on
> ID to Mozilla services. That preference was originally set up as part of
> the services.addons.mozilla.org features that support the Add-on manager.

There's no direct link between the use of an add-on being illegal in
some country and the developer setting that pref. In general, I wouldn't
count on people setting that pref.

> > The client has the list of installed add-ons, knows about crashes and could
> > be told what to consider "slow". Providing it with a list of add-ons that
> > generally tend to be problematic would probably cover 99.9+%. It's unclear
> > why this requires fain-grained data from hundreds of millions of users.
>
> That presumes that we can know with accuracy what add-ons tend to be
> problematic for most of our users. If we don't collect data from the
> general usage base, the best we could ever hope to know is what AMO hosted
> add-ons cause problems on our own specific test machines and what add-ons
> people have told us cause problems for them.

No, there's also telemetry, which I think we haven't fully utilized yet.
I don't see how some user selection bias would hinder linking add-ons
with performance and stability problems.

(In reply to Blake Cutler from comment #57)
> 2) I didn't say Mozilla is going to die. I implied it's headed toward
> irrelevance. Let's look a the numbers:
> * Webkit's market share is already 10 points higher than Gecko's.
> * Gecko is losing .5% market share per month and has no meaningful presence
> mobile devices.
> * Webkit is gaining over 1% market share per month and dominates mobile
> browsing.
> * Mobile browsing is rapidly overtaking desktop browsing (gaining nearly 1%
> share per month)

It's unclear how the proposed Metrics Data Ping would change this. See
again the questions I asked in
<https://bugzilla.mozilla.org/show_bug.cgi?id=718066#c35>.

Ben Bucksch

unread,
Feb 6, 2012, 2:47:27 PM2/6/12
to
On 06.02.2012 19:56, Benjamin Smedberg wrote:
There has been a project being worked on for some time to collect metrics from Firefox installations in an "on by default" manner. This is different from off-by-default telemetry. I became aware of this project recently when I was asked to review some implementation code, and I have some concerns about our privacy stance in this feature. Because the bugs are getting a bit out of hand, I wanted to move the discussion to the proper newsgroup.

For background, the feature page (not strictly a feature page) is here: https://wiki.mozilla.org/MetricsDataPing

Note that this page contains data from several different authors and isn't a coherent proposal page any more. See the wiki history for context if necessary.

The tracking bug is https://bugzilla.mozilla.org/show_bug.cgi?id=718066 from which several other bugs (core implementation, preference UI) are available.

I understand that this opt-out data collection is vastly superior than telemetry in terms of collecting a representative sample and controlling for bias. But it's not clear to me why that makes it "ok" from a privacy perspective, compared with telemetry, to make this opt-out instead of opt-in. Just from my personal experience, I would be surprised by any data submitted by Firefox to Mozilla which was not part of regular Firefox functionality (app update seems pretty straightforward, extension update also, crash submission is opt-in). It seems that if this data submission contains any information which is potentially personally identifying, then it would be a "surprise". As already identified in the bug, there are so many different ways in which data can be potentially identifying:

* unique sets of themes (theme collection was removed)addons
* unique sets of addons (addon collection is still proposed)
* the unique IDs used to keep track of particular installations can potentially track data back to users (note that the UUID proposal has changed somewhat due to privacy concerns, but that there is still a local ID -> remote data mapping)

Thanks, Benjamin.

A few additions:
  • Finterprinting: The data we submit under the current proposal from the Metrics group is highly fingerprintable. For example, it has not only the list of addons (which in many cases will already be unique in its combination, or even pinpoint company association with custom addons), but also install date of each addon.
  • UUID: The "document UUID" proposal (actually simply a submission ID) sends the previous submission ID as well, which allows the server to trivially connect them together and still have a server-side UUID. The submission ID may have some advantages in some cases, but it doesn't remove the ability to track individual users.


To fingerprinting: I doubt that we really critically need all of that data to answer the most pressing questions. More data can always be nice and justified somehow, but it's not necessarily critical.

To UUID: I also think that there are solutions without tracking individual users. I proposed one, one that even allows to see when users stopped using Firefox. See https://wiki.mozilla.org/MetricsDataPing#Anonymous_alternative

---

Another, additional way to limit the privacy impact is to only take a representative sample. Instead of collecting the data from all of 200,000,000 users, we only pick a random (!) sample of 10,000. Concretely: if ( ! pref.userSet()) pref.set(Math.random * 20000 > 1). If true, submit, otherwise no data collection. Given that the sample is random, it's guaranteed to be statistically representative.
It makes a huge difference whether you collect data from 200,000,000 people or just 10,000.

Again, you can find arguments why it's better to get a lot more data, but when you consider the user interest of privacy, I think that's a fair balance of needs.

---

I would like to add that this feature has a serious potential of actively decreasing Firefox market share. Firefox is biggest in Europe, and there still has the largest market share, from what I know. The reason why people here in Europe use Firefox is mostly philosophical, including privacy. It is not so much pure technical merits that wins users, these are only the second priority. Now, if the users get the idea that Firefox is not dramatically and fundamentally different than, say, Google Chrome, then people see no reason to be loyal to Firefox, and switch to Chrome.

This project will make very bad news, that is almost certain. The Telemetry question already gave a bad impression.

This project has a very real risk of actively decreasing the market share that it is trying to preserve.

----

There are other ways to get the needed data without offending users. I propose to 1) remove the UUID and use the algorithm I proposed, which still allows to gather the critically needed data, but without tracking users, 2) remove any data which has a high likeliness of being unique when fingerprinting 3) reduce the collected sample to a random sample of 10,000.

If all 3 are done, I would have a good conscience that this is a good balance between need of data for produce decisions and user interests for privacy, and I'd even be OK with an opt-out. But only if the tracking of individual users is removed and the sample is limited to 10,000.

Ben

Ben Bucksch

unread,
Feb 6, 2012, 2:50:05 PM2/6/12
to
On 06.02.2012 20:47, Ben Bucksch wrote:
> To UUID: I also think that there are solutions without tracking
> individual users. I proposed one, one that even allows to see when
> users stopped using Firefox. See
> https://wiki.mozilla.org/MetricsDataPing#Anonymous_alternative

Sorry for posting this again, but Usenet lives a lot longer than Wikis
or bugzilla, so here's a copy for future reference.


Anonymous alternative

The following is an alternative approach, proposed by Ben Bucksch:

For simplicity, I will take the number of crashes (e.g. in the last week
or overall) as data point that you want to gather. The data itself is
anonymous and can (apart from fingerprinting, more to that later) not
identify a single user.


Avoiding UUID

You wanted to know which profiles are not used anymore (dormant,
retention problem) and which characteristics they have. This is
inherently difficult without tracking individual users (installations),
but it is possible with the following algo:

The client submits:

* Date of last submission - e.g. 2012-01-18
* Current date (from client perspective) - only date, not time - e.g.
2012-01-20
* Age of profile (Firefox installation) in days - e.g. 500
* (Last submitted age is implied or explicit - e.g. 498 )
* Number of crashes - e.g. 15
* Number of crashes submitted last time - e.g. 10

Then, on the server, you write that information in a database, as such:

Date of submission | Age of installation | Crash count | Number of users
2012-01-20 | 500 | 15 | 100000

Any additional user also submitting today the same combination "age 500,
crash count 15" increases the "number of users" column by 1, new value
is 100001. Also, you look up the row for the last submission, namely

2012-01-18 | 498 | 10 | 20000

and decrease the number of users by 1, new value is 19999.

If the user later that day decided that there were too many crashes and
switches to Chrome, he will now be stranded on the row

2012-01-20 | 500 | 15 | 5000

while other users who have continued to use FF have been subtracted
after a while. So, you can say with certainty that there were 5000 users
who used Firefox the last time on 2012-01-20, after having used Firefox
for 500 days, and they had 15 crashes (per day/week/total, whatever you
submit) when they stopped using Firefox.

That is exactly the information you are so desperately seeking. Tsere,
you has it. Without tracking any individual user: it's completely
anonymous.


Avoiding Fingerprinting

Now, what about all the other information that you need: startup times,
addons, etc.? If we just add all that information to the same table and
row, it would allow fingerprinting. But that is not necessary. You
merely make one table per atomic information. I.e.

Table A
Date of submission | Age of installation | Crash count | Number of users
Table B
Date of submission | Age of installation | Startup time | Number of users

or of course whatever other database schema you want, as long as each
value is separate. That takes care of the fingerprinting.

At least on the server side, not on the submission side. I would have to
trust you, and anything between you and me. It would be possible to
separate the calls and submit each value separately, but I think that
would be overdoing it.


beltzner

unread,
Feb 6, 2012, 3:12:22 PM2/6/12
to Benjamin Smedberg, mozilla.dev.planning group
The wiki page is pretty clear about goals for the feature ("ability to
measure adoption, retention, stability, performance, and aggregated
search counts by engine") as well as requirements for success. What
it's lacking, other than in terms of caveats and warnings throughout
the documentation, is against which privacy principles those
requirements must be evaluated.

Recently Ben Adida posted on the Mozilla Privacy Blog
(http://blog.mozilla.com/privacy/2012/01/13/mozilla-to-offer-new-user-centric-services-in-2012/)
outlining a series of design guidelines to use when designing new
features, and committing Mozilla to a basic policy of "no surprises,
real choices, sensible settings, limited data, and user control." I
think that the Data Safety Team he outlines in that post should
evaluate the proposal (once it reaches a final stage, see below!)
using those guidelines and making a judgement on whether or not it
meets the plain-language policy as stated.

The other thing the wiki page is lacking is an understanding of who is
running the project aside from the "metrics team." A clear project
owner should be identified, I think, so that we can better know what's
in plan, in flux, etc. Once there's a final proposal about what's to
happen, it can be judged and evaluated from a privacy perspective.

Our shared goal should be to try and design a system by which we
accomplish the laudable goals and requirements of the metrics team
(plainly: better understanding our product, its users, and how it's
being used) in a way that meets our high standards for data
sovereignty and privacy. We must build a better mousetrap. I suggest
people look to the Crash Stats efforts to this end, as they have long
avoided privacy-invasive actions (at non-trivial cost) while still
mining the available data to gain significant understanding of our
broad user base's experience with the browser.

Finally, and my own personal $0.02 on the issue: I think there are
ways of pre-cleaning data so that you get the benefit of aggregate
data collection (double-blinding, binning and grouping, etc) and the
easiest way to figure those ways out is to begin with the question:
what is the end state we're trying to get to? No data should be
collected without understanding exactly how that data will be
presented to its consumers; that way you can be sure to only collect
the minimum amount of data required to answer the question.

cheers,
mike

ps: let's remember that we're all on the same team here, and all want
what's best for Firefox and its users; think carefully before writing,
and always assume the best of your colleagues and community members
when participating in this discussion!

Ben Bucksch

unread,
Feb 6, 2012, 3:30:39 PM2/6/12
to
Blake Cutler wrote on
https://bugzilla.mozilla.org/show_bug.cgi?id=718066#c57 :
> The short answer is that correlation is not causation.

How do you want to get causation, and *not* by correlation, from the
data delivered by your proposal? I think that's impossible, but maybe
I'm missing something. (If so, maybe I can improve my proposal.)

E.g. you may see that all users of a certain addon stop using Firefox.
But maybe that's just a custom internal addon that a company created,
and the CEO decided to switch to MSIE, because he played golf with
somebody. The cause is bribe, not technical.

Also see the case of a government agency recommending Google Chrome that
you mentioned yourself. The agency *told* in the announcement what the
reason is, it was only one: Chrome's sandbox, which is better than that
of competitors and leads to an inherently more security browser. So,
users switch to Chrome as a result of that recommendation (or as one of
the reasons). You will never get that cause by metrics.

---

I think: If you want the cause, just 1) listen to people when they
scream at you, and 2) ask them with surveys (small random set, free-form
answers, not multiple choice), that's the only way. Mozilla has been a
bit too stubborn recently, and more metrics data is not going to turn
the ship around. Listening to users is.

Justin Lebar

unread,
Feb 6, 2012, 3:58:03 PM2/6/12
to dev-pl...@lists.mozilla.org
Daniel Einspanjer wrote in comment 26
(https://bugzilla.mozilla.org/show_bug.cgi?id=718066#c26):

> I just need to make one small clarification. I am happy to have more people looking
> at the problem and challenge, and I would love to see a mechanism that provides a
> feasible alternative to the current ID-centric solution. The only thing I can honestly
> promise is to collaborate on the thinking of, and consider such a solution if
> presented, and if it meets the stated needs of the project

I think this is the wrong way of looking at this discussion.

The question must be not "is there a better way", but rather "is this
way acceptable"? We need to be careful not to take this project as a
fait accompli.

Yeah, it sucks that we can't tell why people stop using Firefox. But
our principals are more important than that.

To that end, the discussion shouldn't center on why these metrics are
important or difficult to obtain another way. The discussion is about
whether we can at once collect the proposed metrics and stay true to
our values. If we can't, then we can't collect the data, no matter
how important it may be.

If the current proposal is in violation of our values, it's up to the
metrics team (and whoever wants to help) to come up with an
alternative. It is explicitly *not* up to those of us opposing the
current proposal to propose an alternative.

I think bsmedberg laid out a good case for why the proposal is
troubling. I'm curious to hear the metrics team respond to his
points, again *without* referencing the critical need for the data.

-Justin
> _______________________________________________
> dev-planning mailing list
> dev-pl...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-planning

blake....@gmail.com

unread,
Feb 6, 2012, 4:04:18 PM2/6/12
to
We can better determine causation because we can build models that account for multiple product and usage dimensions at once.

i.e. retention = # crashes + startup speed + sync use + # add-ons + ...

Collecting data this way is not sufficient to turn Firefox growth around, but I believe it is necessary. For the first time, Mozilla will have concrete answers to important, long-standing questions. Answers that Mozilla's competitors already have.

It's not about gathering more metrics. It about collecting and analyzing metrics correctly.

That's not to say there isn't room for improvement. I like your ideas on sampling, for example.

Joshua Cranmer

unread,
Feb 6, 2012, 4:18:56 PM2/6/12
to
On 2/6/2012 12:56 PM, Benjamin Smedberg wrote:
> I understand that this opt-out data collection is vastly superior than
> telemetry in terms of collecting a representative sample and
> controlling for bias.

This part troubles me a bit. I do realize that opt-in data collection
does have a bias, but do we have any reason to expect that any data we
would collect opt-out would be affected by this bias to a degree that it
would change decision making processes?

Opt-out is a scary thing, especially when Mozilla has a brand reputation
of user privacy. As a bit of a data junkie myself, I can definitely see
the temptation to want data just to answer questions. But first, I think
there needs to be a clear goal that any data that would have to be
collected opt-out needs to satisfy these guidelines:

1. The data needs to be useful in answering a specific question.
2. This question needs to be identified as one whose answer matters: the
answer needs to be crucial to some active policy discussion (like "do we
drop support for X feature?")
3. The data cannot be reliably collected or estimated from other means.
It shouldn't be a case of "we suspect that this is most likely the case,
but we need confirmation first"; it needs to be "we have no idea".
4. Collecting opt-in would create serious bias that cannot be overcome.

Reading the page and skimming the bug has started to lead me to the
impression that the data being collected is more oriented in "just in
case" or "we want hard numbers to back up what we know", which is
definitely not the kind of collection we want to be encouraging.

Daniel E

unread,
Feb 6, 2012, 5:16:45 PM2/6/12
to
On Feb 6, 1:56 pm, Benjamin Smedberg <benja...@smedbergs.us> wrote:
> I understand that this opt-out data collection is vastly superior than
> telemetry in terms of collecting a representative sample and controlling
> for bias. But it's not clear to me why that makes it "ok" from a privacy
> perspective, compared with telemetry, to make this opt-out instead of
> opt-in. Just from my personal experience, I would be surprised by any
> data submitted by Firefox to Mozilla which was not part of regular
> Firefox functionality (app update seems pretty straightforward,
> extension update also, crash submission is opt-in). It seems that if
> this data submission contains any information which is potentially
> personally identifying, then it would be a "surprise". As already
> identified in the bug, there are so many different ways in which data
> can be potentially identifying:
>
> * unique sets of themes (theme collection was removed)addons
> * unique sets of addons (addon collection is still proposed)
> * the unique IDs used to keep track of particular installations can
> potentially track data back to users (note that the UUID proposal has
> changed somewhat due to privacy concerns, but that there is still a
> local ID -> remote data mapping)

It is an unfortunate fact that even in the other data available to us
today, there are occasional ways in which a user can modify their
system or browser such that some private information is leaked out.
One of the best examples I can give of that is the ability to change
variables that are used in the update or blocklist checks. There are
requests to those systems that have an e-mail address in the place of
the product name ("Firefox"). There are systems that have a changeset
or bug number or username in the channel or distribution name.
Obviously these are rare cases, but we have seen them. That is why we
instituded early aggregation of the data before it goes into our data
warehouse so we can filter those identifying long tail requests out.
I would still qualify it as a surprise to some unsuspecting developer
though.

That is actually one of the things that I hope could be improved by
this system. Unlike AUS or Blocklist, this proposal has a user facing
component that can allow a user to easily see the data being sent in.
It provides an actual value to the user to let them look at statistics
about their browser and compare them to aggregates from other
installations. If a developer went in to about:metrics and saw their
username in the channel field, they could take immediate action. They
could delete the data from our data warehouse, and they could change
the config of their profile so it isn't there anymore. On our end, we
would continue to do what we have always done which is to attempt to
aggregate that data and drop long tail values which we have no value
in seeing anyway.

> A fair bit of the proposal is focused on how we would be protecting and
> anonymizing the data. But if we're not actually collecting personally
> identifyable data, why couldn't we make the entire server system public
> and queryable? It seems that any system that requires server-side
> anonymization to meet user privacy expectations is an unexpected privacy
> risk. Might it also open up our users to potential tracking via court
> order (search warrants) from both U.S. courts and whatever countries we
> put data centers in?

It was critical for us when we proposed this system to have data
collection that was focused on the browser installation rather than
any attempt to learn anything about an individual person. If there
was any reasonable way we could get the information without using TCP/
IP and having an IP address, I would jump on trying to use that.
Since we don't have that, we have made sure that part of the proposal
was a commitment not to store the IP address with the data and we have
taken several extra steps with how we propose the data is stored and
used so that if another party were to have access to the data, it
would not be of any interest because it would have only information in
it about browser metrics and not PII.

blake....@gmail.com

unread,
Feb 6, 2012, 5:29:51 PM2/6/12
to
On Monday, February 6, 2012 1:18:56 PM UTC-8, Joshua Cranmer wrote:
> Reading the page and skimming the bug has started to lead me to the
> impression that the data being collected is more oriented in "just in
> case" or "we want hard numbers to back up what we know", which is
> definitely not the kind of collection we want to be encouraging.

I understand where you're coming from, but this data isn't being collected "just in case." We need this data to 1) calculate Firefox's retention rate and 2) identify factors that drive retention.

It's hard to overstate how important these questions are right now. Firefox is rapidly losing market share everywhere in the world, Europe included.

Nicholas Nethercote

unread,
Feb 6, 2012, 6:15:13 PM2/6/12
to Ben Bucksch, dev-pl...@lists.mozilla.org
On Mon, Feb 6, 2012 at 11:47 AM, Ben Bucksch
<ben.buck...@beonex.com> wrote:
>
> This project will make very bad news, that is almost certain. The Telemetry
> question already gave a bad impression.

Can you give more details about this? I haven't heard anything about it.

Nick

Ben Bucksch

unread,
Feb 6, 2012, 6:26:52 PM2/6/12
to
On 07.02.2012 00:15, Nicholas Nethercote wrote:
>> The Telemetryquestion already gave a bad impression.
> Can you give more details about this? I haven't heard anything about it.

FYI: It's the question that comes up at the top of the browser window
when you start Firefox the second time (with a new profile). It asks you
whether you want to submit performance data etc.

It makes a bad impression on *me*, because Mozilla wants to collect data
from me. Other companies have abused that "anonymous" so badly that any
such question for me now is suspicious. I think that many users feel the
same. (Obviously, not asking is even worse.)

As for hard numbers about Telemetry for other people and in general, I
can't speak about that. Somebody else would need to give that information.

David E. Ross

unread,
Feb 6, 2012, 8:02:58 PM2/6/12
to
On 2/6/12 10:56 AM, Benjamin Smedberg wrote [in part]:
>
> Note that while Ben Bucksch has also brought up legal concerns about
> whether German or European law forbids this kind of data collection, I'm
> not particular interested in that portion of the discussion because very
> few of us in the project are legal experts who can have an informed
> opinion. So please let's avoid ratholing on those legal issues instead
> of the basic privacy issue.

I think you have this backwards.

An enterprise the size of Mozilla must surely have attorneys on staff or
retainer. You should find out if what is proposed is legal before
expending any efforts to implement it. Besides Germany, there might be
other nations with laws impacting on this concept.

Furthermore, where such laws do not exist, Mozilla needs to have a firm
policy on how the organization would respond to a warrant or subpoena
for the data. That policy must be in place before the data collection
begins and should address not only a government's request for the data
but also a request resulting from a civil lawsuit.

--

David E. Ross
<http://www.rossde.com/>.

Anyone who thinks government owns a monopoly on inefficient, obstructive
bureaucracy has obviously never worked for a large corporation.
© 1997 by David E. Ross

beltzner

unread,
Feb 6, 2012, 8:12:09 PM2/6/12
to Nicholas Nethercote, Ben Bucksch, dev-pl...@lists.mozilla.org
On Mon, Feb 6, 2012 at 6:15 PM, Nicholas Nethercote
<n.neth...@gmail.com> wrote:
> On Mon, Feb 6, 2012 at 11:47 AM, Ben Bucksch
> <ben.buck...@beonex.com> wrote:
>>
>> This project will make very bad news, that is almost certain. The Telemetry
>> question already gave a bad impression.
>
> Can you give more details about this?  I haven't heard anything about it.

I'm not sure that it's really germane to the discussion at hand - I
don't think our choices here should be governed significantly by our
fear of bad press, or our belief that the issue will not garner
significant public notice at all. We should be making our choices
based on:

- an actual need for the information (how will we use it to better
the product?)
- our ability to design a feature that meets the stated goals while
still meeting our strict stance on privacy

cheers,
mike

Justin Wood (Callek)

unread,
Feb 7, 2012, 1:28:28 AM2/7/12
to mozilla.de...@googlegroups.com
blake....@gmail.com wrote:
> On Monday, February 6, 2012 1:18:56 PM UTC-8, Joshua Cranmer wrote:
>> Reading the page and skimming the bug has started to lead me to the
>> impression that the data being collected is more oriented in "just in
>> case" or "we want hard numbers to back up what we know", which is
>> definitely not the kind of collection we want to be encouraging.
>
> I understand where you're coming from, but this data isn't being collected "just in case." We need this data to 1) calculate Firefox's retention rate and 2) identify factors that drive retention.

We need to make sure the data metrics issues reflect properly our
privacy policies/plans, and not reflect this with just a "we need this".
As has been said elsewhere.

> It's hard to overstate how important these questions are right now. Firefox is rapidly losing market share everywhere in the world, Europe included.

Using this logic, SeaMonkey should gather all data about all users, we
possibly can, because we have been losing market share heavily every
since we became SeaMonkey from "the Mozilla Suite".

From where I sit, the largest fault of our market share is the fact
that Google has heavy brand awareness, and is doing LOTS of expensive
advertising campaigns, and well-done in most cases. So "Google Chrome"
is interesting to the ignorant-of-computer users.

Also Microsoft is (Finally) developing a Sane IE, which means less
reason for people to install a different web browser on Windows.

Lastly Apple has a lead on Mobile in general, and we can't even offer a
Firefox for mobile, and instead we are stuck with doing a Firefox Home
to share bookmarks, while the default webkit-based browser[s] are
pulling ahead there, given the iPhone/iPad proliferation.

Now I admit my observations are not based on concrete data I can cite
right now, but are based on sporadic news research I have done, as well
as hours of TV and Internet use over the past years.

--
~Justin Wood (Callek)

Henri Sivonen

unread,
Feb 7, 2012, 3:25:54 AM2/7/12
to mozilla.dev.planning group
On Mon, Feb 6, 2012 at 8:56 PM, Benjamin Smedberg <benj...@smedbergs.us> wrote:
> I became aware of this project recently when
> I was asked to review some implementation code, and I have some concerns
> about our privacy stance in this feature.
...
> I understand that this opt-out data collection is vastly superior than
> telemetry in terms of collecting a representative sample and controlling for
> bias. But it's not clear to me why that makes it "ok" from a privacy
> perspective, compared with telemetry, to make this opt-out instead of
> opt-in.

Thanks for posting this.

This reminds me of Sync and Fennec Native. First, Sync was very
carefully designed to have privacy characteristics that suit Mozilla's
stated privacy principles and those characteristics were bragged about
(which is good since the characteristics were special in the
industry). Then (as I understand it) another team than the one who had
designed the feature suggested that Fennec Native write (part of) the
Sync data to storage that could get synced to Google without the same
privacy characteristics and suggested that crypto characteristics
could be weakened in the name of ease of use (without even
demonstrating that losing the crypto would have been the key to making
the setup flow better).

Now Telemetry has been very carefully designed to have privacy
characteristics that suit Mozilla's stated privacy principles and
those characteristics have been bragged about. And then another team
comes along, treats that design as a bug wants to send a per-user ID
to enable longitudinal study. If doing what this metrics feature
suggests to be done was OK, surely Telemetry would already have UUIDs
and support for "longitudinal study".

It bothers me that this scenario repeats. While in general discussing
various ideas is good, having this scenario repeat makes it look like
Mozilla's privacy principles are constantly on the verge of getting
overturned instead of being something that users can trust on the long
term. (Fortunately, the Fennec Native situation turned out OK. Fennec
Native now has its own data store and the crypto flow is what it used
to be.)

As for the Germany/EU aspect: (Note the rest of this paragraph says
nothing about law. I'm not trying to play a lawyer here.) Even if
sending an UUID had no real privacy impact, sending an UUID would be
bad publicity in Europe. The usage share of Firefox is in the decline.
Europe in general and Germany in particular is a place where the usage
share of Firefox is high. It seems like a bad idea to hurt that market
share in order to study metrics related to it.

--
Henri Sivonen
hsiv...@iki.fi
http://hsivonen.iki.fi/

Gervase Markham

unread,
Feb 7, 2012, 6:19:15 AM2/7/12
to Daniel E
On 06/02/12 22:16, Daniel E wrote:
> It is an unfortunate fact that even in the other data available to us
> today, there are occasional ways in which a user can modify their
> system or browser such that some private information is leaked out.
> One of the best examples I can give of that is the ability to change
> variables that are used in the update or blocklist checks. There are
> requests to those systems that have an e-mail address in the place of
> the product name ("Firefox"). There are systems that have a changeset
> or bug number or username in the channel or distribution name.

I have no reason to doubt you that this happens, but there is a big
difference between designing your system to request particular data, and
accidentally receiving some of it because a user misconfigures their
browser.

If I have a web "contact me" form, and someone pastes their entire
medical history into it and hits Submit, I probably want to delete the
data - but I don't have to engineer my data handling process for content
coming from that form so that it's robust for handling medical data!

> That is actually one of the things that I hope could be improved by
> this system. Unlike AUS or Blocklist, this proposal has a user facing
> component that can allow a user to easily see the data being sent in.
> It provides an actual value to the user to let them look at statistics
> about their browser and compare them to aggregates from other
> installations. If a developer went in to about:metrics and saw their
> username in the channel field, they could take immediate action. They
> could delete the data from our data warehouse, and they could change
> the config of their profile so it isn't there anymore. On our end, we
> would continue to do what we have always done which is to attempt to
> aggregate that data and drop long tail values which we have no value
> in seeing anyway.

These sound like excellent ideas, but they don't seem to have a bearing
on the question of opt-in or the question of a unique identifier.

> It was critical for us when we proposed this system to have data
> collection that was focused on the browser installation rather than
> any attempt to learn anything about an individual person.

I'm not sure that's a distinction we can make. I am the only user of my
browser, and I'm sure that's true of lots of other people too. What can
you tell about me from my list of installed add-ons? I won't give you
the full list, but I suspect you could tell:

- I do web development of RESTful services using JSON
- I work for Mozilla
- I care about my privacy

Gerv

beltzner

unread,
Feb 7, 2012, 9:40:48 AM2/7/12
to mozilla.dev.planning group
On Mon, Feb 6, 2012 at 4:04 PM, <blake....@gmail.com> wrote:
> Collecting data this way is not sufficient to turn Firefox growth around, but I believe it is necessary. For the first time, Mozilla will have concrete answers to important, long-standing questions. Answers that Mozilla's competitors already have.

That's a laudable and excellent goal; the wiki page should specify
exactly what those questions are, and how the data will be used to
answer them, *before* any action is taken to collect the data. If the
questions are indeed important and long-standing, it shouldn't be hard
to generate that list!

> It's not about gathering more metrics. It about collecting and analyzing metrics correctly.

I'm very comforted to hear that, as it implies that things are being
thought of in terms of "what do we need to know? what is the minimal
amount of data that can be collected to answer those questions?" which
is the right way to go about things.

I don't think anyone here is questioning the motives of the metrics
team, or indeed the benefit of being able to answer those questions.
Instead, we're trying to review the proposal so that we can do this
better than any other company, and prove that the questions can be
answered in a way that is sensitive to individual privacy and data
collection processes.

cheers,
mike

Daniel E

unread,
Feb 7, 2012, 9:32:48 AM2/7/12
to
On Feb 6, 8:02 pm, "David E. Ross" <nob...@nowhere.invalid> wrote:
> An enterprise the size of Mozilla must surely have attorneys on staff or
> retainer.  You should find out if what is proposed is legal before
> expending any efforts to implement it.  Besides Germany, there might be
> other nations with laws impacting on this concept.
>
> Furthermore, where such laws do not exist, Mozilla needs to have a firm
> policy on how the organization would respond to a warrant or subpoena
> for the data.  That policy must be in place before the data collection
> begins and should address not only a government's request for the data
> but also a request resulting from a civil lawsuit.
>

We do have a legal team, we also engaged outside legal council
specifically on the question of European and German law for this
project. We have asked the legal and privacy teams to share the
results of their reviews.




On Feb 7, 1:28 am, "Justin Wood (Callek)" <Cal...@gmail.com> wrote:
> Using this logic, SeaMonkey should gather all data about all users, we
> possibly can, because we have been losing market share heavily every
> since we became SeaMonkey from "the Mozilla Suite".
>

Please reconsider the phrase "should gather all data about all users
we possibly can". This project is not about gathering all data
possible. It has a very specific list of the minimal data that was
determined to be required to answer the questions determined as
necessary to answer. There has been a lot of information shared about
what those questions are and the justifications for most of the data
points on other mediums such as the bugs and the wiki. I am happy to
continue to work toward sharing justifications and considerations for
any of the data listed. It is right for Mozilla and the community to
ask for those explanations. It is difficult to maintain a productive
discussion where everyone has a clear picture of the facts when using
exaggerated phrases though.

> From where I sit, the largest fault of our market share is the fact
> that Google has heavy brand awareness, and is doing LOTS of expensive
> advertising campaigns, and well-done in most cases. So "Google Chrome"
> is interesting to the ignorant-of-computer users.
>
> Also Microsoft is (Finally) developing a Sane IE, which means less
> reason for people to install a different web browser on Windows.
>

Both of these are great concerns that tie in to this project. These
changes in the market are significant changes that primarily deal with
a large class of mainstream users that are under-represented in our
current understanding. These other companies are focusing a lot of
attention on understanding how the browser is used by mainstream
users. We are striving to improve our own understanding.

We don't want to just do things the same way as others though. We
have tried to develop a project that can analyze usage without
collecting personally identifying information. We have worked with
the privacy and legal teams to propose policies to mitigate the
unavoidable PII such as ensuring that IP addresses are never tied to
the data and that we don't leave any easy way to associate identifying
information such as an e-mail address or name with the data. We have
also put into the project a set of goals around giving the users
visibility, functionality, and control of the data generated by their
browser.




On Feb 7, 3:25 am, Henri Sivonen <hsivo...@iki.fi> wrote:
> ...
> Now Telemetry has been very carefully designed to have privacy
> characteristics that suit Mozilla's stated privacy principles and
> those characteristics have been bragged about. And then another team
> comes along, treats that design as a bug wants to send a per-user ID
> to enable longitudinal study. If doing what this metrics feature
> suggests to be done was OK, surely Telemetry would already have UUIDs
> and support for "longitudinal study".

We definitely spent a lot of time looking at Telemetry and working
with that team. The data that Telemetry collects and the purpose that
it exists for is different though. Telemetry was designed to enable
developers to understand the performance characteristics of individual
features or code paths "in the wild". It does not require retention
or the same sort of longitudinal data that MDP proposes to meet those
requirements. Putting those characteristics into Telemetry would be
doing the very thing that several people have spoken out against,
adding data to a system that is not directly needed by that system.

There is a significant value in judiciously partitioning data by
purpose. It enables better policy governing the data. It allows
finer control over what data is collected and how it is reviewed. It
allows walls to be put up to prevent associations from being made
where the organization does not wish them to be made (for instance
tying usage data directly to crash reports).


> As for the Germany/EU aspect: (Note the rest of this paragraph says
> nothing about law. I'm not trying to play a lawyer here.) Even if
> sending an UUID had no real privacy impact, sending an UUID would be
> bad publicity in Europe. The usage share of Firefox is in the decline.
> Europe in general and Germany in particular is a place where the usage
> share of Firefox is high. It seems like a bad idea to hurt that market
> share in order to study metrics related to it.

I just want to clarify precisely what is being discussed when we say
"sending an UUID". MDP is generating cumulative data on the client
and submitting that data as a document. That document is given a new
UUID and the client retains that document ID. Every time a new
submission is made, it will have a new document identifier. It is
even possible for the identifier to not be part of the URL (which is
sent using SSL). If the user wishes to delete the usage data for
their installation, the browser submits a delete request with last
submitted ID. When a new document is generated on another day and
submitted, the client also sends the old document ID to be deleted so
that there are not two copies of the data on the server. This allows
us to look at retention. If a document is older than N days, we know
that there have been no further submissions from that installation.
This implementation does still require policy and trust. It requires
that we not record IP addresses with the data set. It requires that
we do not longitudinally track location. There might be further ways
we can make it easier to follow those policies.



On Feb 7, 6:19 am, Gervase Markham <g...@mozilla.org> wrote:
> On 06/02/12 22:16, Daniel E wrote:
>
> > It is an unfortunate fact that even in the other data available to us
> > today, there are occasional ways in which a user can modify their
> > system or browser such that some private information is leaked out.
> > One of the best examples I can give of that is the ability to change
> > variables that are used in the update or blocklist checks. There are
> > requests to those systems that have an e-mail address in the place of
> > the product name ("Firefox"). There are systems that have a changeset
> > or bug number or username in the channel or distribution name.
>
> I have no reason to doubt you that this happens, but there is a big
> difference between designing your system to request particular data, and
> accidentally receiving some of it because a user mis-configures their
> browser.
>
> If I have a web "contact me" form, and someone pastes their entire
> medical history into it and hits Submit, I probably want to delete the
> data - but I don't have to engineer my data handling process for content
> coming from that form so that it's robust for handling medical data!
>

We need the legitimate data that is expected to be in those
variables. We are designing the system to be able to use that data.
We do not want to be burdened by illegitimate data that is available
as the result of a mistake on the part of a developer or user, so we
have made sure that the system has checks and features to restrict and
eliminate that data easily.


> > It was critical for us when we proposed this system to have data
> > collection that was focused on the browser installation rather than
> > any attempt to learn anything about an individual person.
>
> I'm not sure that's a distinction we can make. I am the only user of my
> browser, and I'm sure that's true of lots of other people too. What can
> you tell about me from my list of installed add-ons? I won't give you
> the full list, but I suspect you could tell:
>
> - I do web development of RESTful services using JSON
> - I work for Mozilla
> - I care about my privacy

I believe that it is important to consider even the worst cases, but
please keep in mind that this is not a normal case. The system is
designed such that it would have no way of telling that Gerv is a web
developer who works for Mozilla and cares about privacy. There are
specific policies and features put in place to prevent the system from
ever being able to associate those conclusions with a person. We
don't keep IP addresses with the data to prevent the possibility of
using that IP address to identify the person using an installation.
We use a document identifier so that even if one document ID were ever
leaked or shared by you (say via an e-mail), the ID would change at
the next submission so we would not be able to use that ID to look up
the data from your installation next month and see if you still care
about privacy.


Justin Lebar

unread,
Feb 7, 2012, 10:51:53 AM2/7/12
to Daniel E, dev-pl...@lists.mozilla.org
>> Now Telemetry has been very carefully designed to have privacy
>> characteristics that suit Mozilla's stated privacy principles and
>> those characteristics have been bragged about. And then another team
>> comes along, treats that design as a bug wants to send a per-user ID
>> to enable longitudinal study. If doing what this metrics feature
>> suggests to be done was OK, surely Telemetry would already have UUIDs
>> and support for "longitudinal study".
>
>We definitely spent a lot of time looking at Telemetry and working
>with that team. The data that Telemetry collects and the purpose that
>it exists for is different though. Telemetry was designed to enable
>developers to understand the performance characteristics of individual
>features or code paths "in the wild". It does not require retention
>or the same sort of longitudinal data that MDP proposes to meet those
>requirements. Putting those characteristics into Telemetry would be
>doing the very thing that several people have spoken out against,
>adding data to a system that is not directly needed by that system.
>
>There is a significant value in judiciously partitioning data by
>purpose. It enables better policy governing the data. It allows
>finer control over what data is collected and how it is reviewed. It
>allows walls to be put up to prevent associations from being made
>where the organization does not wish them to be made (for instance
>tying usage data directly to crash reports).

I feel like we're talking past each other. This does not address
Henri's point, nor my earlier point.

When Telemetry was started, we explicitly decided not to include a
UUID, not because it wouldn't be useful -- in fact, it would have been
extremely useful! -- but because we decided that doing so would have
violated our principals.

So what Henri was saying is, we decided that a UUID is not acceptable
in a ping-type thing. This metrics ping is a ping-type thing. And
yet we did not apply our earlier decisions about ping-type things to
the metrics ping. Why is that?

Again, this has nothing to do with what you're going to use the
metrics ping data for, or what Telemetry was designed for. It has
nothing to do with partitioning data. It has nothing to do with
metrics' requirements for MDP. And most importantly, it has nothing
to do with what is or isn't needed in order to generate useful data.

As a point of comparison, when telemetry started sending the list of
add-ons, Sid insisted we re-prompt every user who had agreed to
telemetry with new text explicitly saying we were sending the list of
add-ons. So again, here we have a decision made about sending the
list of add-ons in a ping-type thing, that we cannot do it without
explicit permission, even for people who already opted in to data
collection. Henri's point is that we're reversing this decision, yet
not explicitly acknowledging that we're doing so.

Opt-out, UUID, and every other piece of data in this proposed
telemetry ping needs to be shown to be consistent with our privacy
principals, absent any appeal to why it's needed. So far, nobody has
attempted to do so in this thread.

-Justin

Gervase Markham

unread,
Feb 7, 2012, 11:35:39 AM2/7/12
to
On 06/02/12 18:56, Benjamin Smedberg wrote:
> There has been a project being worked on for some time to collect
> metrics from Firefox installations in an "on by default" manner. This is
<snip>

http://www.theregister.co.uk/2012/02/07/mozilla_telemetry_controversy/

Gerv

Boris Zbarsky

unread,
Feb 7, 2012, 11:46:37 AM2/7/12
to
On 2/7/12 9:32 AM, Daniel E wrote:
> When a new document is generated on another day and
> submitted, the client also sends the old document ID to be deleted so
> that there are not two copies of the data on the server. This allows
> us to look at retention. If a document is older than N days, we know
> that there have been no further submissions from that installation.

A question.

Would the concerns some people have about sending the old id and new one
together be at all alleviated if the sending of the delete request and
the new report were somewhat decorrelated? That is, if the delete
request were sent some random amount of time after the new report? If
so, is that setup reasonable?

> This implementation does still require policy and trust. It requires
> that we not record IP addresses with the data set. It requires that
> we do not longitudinally track location. There might be further ways
> we can make it easier to follow those policies.

One problem is that some people will assume that if data is being sent
then it's being used, no matter what we actually do with it and say we
do with it. So if we _can_ design things such that we couldn't misuse
them even if we were to want to, we should. I understand that in
general this is pretty difficult....

-Boris

beltzner

unread,
Feb 7, 2012, 11:57:19 AM2/7/12
to Boris Zbarsky, dev-pl...@lists.mozilla.org
On Tue, Feb 7, 2012 at 11:46 AM, Boris Zbarsky <bzba...@mit.edu> wrote:
> Would the concerns some people have about sending the old id and new one
> together be at all alleviated if the sending of the delete request and the
> new report were somewhat decorrelated?  That is, if the delete request were
> sent some random amount of time after the new report?  If so, is that setup
> reasonable?

Indeed, I've been wondering similar things, but in a more general
sense. Part of the issue here is what data the client sends, but
there's also design work we can do in terms of how we receive and
store that information. If we strip identifying data the instant it
comes in, essentially double-blinding our data storage system, then we
can protect user privacy by getting at the derivative data and not
storing any personally identifiable information. (We can be trusted
unlike anyone else in this regard because we will publish the code of
the server receiving the data for community inspection.)

But I'm no expert. This is just the sort of thinking I believe we need
to be following in order to, as I keep saying, build a better
mousetrap for solving this age old tension between a product
organization's thirst for data and understanding and our own mission's
dedication to user control and privacy.

> One problem is that some people will assume that if data is being sent then
> it's being used, no matter what we actually do with it and say we do with
> it.  So if we _can_ design things such that we couldn't misuse them even if
> we were to want to, we should.  I understand that in general this is pretty
> difficult....

I believe we can resolve this by being 100% transparent (as in,
allowing user inspection) of the data being sent, and then the data as
it's collected and used internally. It's the only way, truly, to build
the required trust for those interested.

cheers,
mike

Gian-Carlo Pascutto

unread,
Feb 7, 2012, 12:15:53 PM2/7/12
to
The author seems to mix up Telemetry with the proposed Metrics Data
Ping. Rather strange thing to do given that Benjamin' post which started
the discussion here begins by outlining the differences between the two.

--
GCP

Robert Kaiser

unread,
Feb 7, 2012, 1:32:08 PM2/7/12
to
Justin Wood (Callek) schrieb:
> Using this logic, SeaMonkey should gather all data about all users, we
> possibly can, because we have been losing market share heavily every
> since we became SeaMonkey from "the Mozilla Suite".

Be careful there. SeaMonkey's active daily installations have been
growing faster than the Internet population recently, which doesn't
sound like "losing market share". ;-)

> Lastly Apple has a lead on Mobile in general, and we can't even offer a
> Firefox for mobile, and instead we are stuck with doing a Firefox Home
> to share bookmarks, while the default webkit-based browser[s] are
> pulling ahead there, given the iPhone/iPad proliferation.

For one thing, Firefox Home seems to not even be actively developed
nowadays - but even more importantly, mobile is not equal to iOS,
actually IIRC Android has more sales than iOS now, and even overtook
Apple in market share in some regions (probably not the US though).

There some truth in what you are saying, though - and from reading
webmaster@m.o email, I know that there are a couple small-looking things
that massively annoy some users, like us having some kind of problem in
first-run detection and showing the first-run page on every start for
some people, or the request for telemetry opt-in. And there are others.

Also, not using sandboxing of websites is something that many play up as
a very major security argument against Firefox.

And if we lose our image as caring more about privacy than anyone else,
that would probably make us lose more market share than this data can
bring us back.

Robert Kaiser

beltzner

unread,
Feb 7, 2012, 1:39:17 PM2/7/12
to dev-pl...@lists.mozilla.org
IMO, all of this discussion about the necessity of the data and value
is not only off-topic, but not really all that contentious. With more
data we can (potentially) make more informed decisions, with less data
we cannot.

Let's keep the discussion about ways in which we can obtain the data
which are consistent with our privacy values. I think it will be more
productive, and lead us down fewer ratholes.

cheers,
mike

Robert Kaiser

unread,
Feb 7, 2012, 1:56:40 PM2/7/12
to
beltzner schrieb:
> We must build a better mousetrap. I suggest
> people look to the Crash Stats efforts to this end, as they have long
> avoided privacy-invasive actions (at non-trivial cost) while still
> mining the available data to gain significant understanding of our
> broad user base's experience with the browser.

I wouldn't put crash reports/stats up as the thing to follow, for two
reasons:
1) We know that there's data in there that indeed is very much
fingerprintable, such as the full list of libraries loaded into the
process and add-ons installed.
2) Sending crash reports is opt-in and the proposed data ping would be
opt-out.

Robert Kaiser

Daniel Cater

unread,
Feb 7, 2012, 1:37:27 PM2/7/12
to
I don't know if the article has changed since you read it (it may well have done as the author obviously reads these newsgroups) but I didn't get that impression.

I see: "However, unlike the Telemetry Project, the proposed MetricsDataPing project will be opt out..." as a quite clear indication that it's different.

The author is entirely welcome to read these groups and to publish articles about them. When those articles are accurate I'm often glad that issues have been publicised to a wider audience. My main concern when articles like this are published is that Mozilla employees will take more of their discussions to internal e-mail to keep them out of the newsgroups. That is even more un-Mozilla than an opt-out, uniquely identifiable data collection system.

Daniel E

unread,
Feb 7, 2012, 12:47:44 PM2/7/12
to
I was thinking about this implementation yesterday. It would be
possible, and it does provide one more small technical hurdle in the
path of trying to link together the datasets. The downsides are
fairly minimal. It isn't difficult to implement, and it doesn't seem
likely to significantly degrade the quality of the metrics other than
forcing a delay before we can look at retention until after enough
time has passed for the documents to be deleted and replaced. It is
slightly more complicated however, and that means it would be more
difficult to understand by all interested parties.

The biggest question I have is whether it is a significantly useful
change. It doesn't prevent the possibility fingerprinting which
means it still ultimately relies on policy and would eventually fail
if we systematically violated that policy. If it were determined to
be a significantly useful change by the privacy, security, and legal
teams, then I would be happy to work on implementing it. At the
moment, that value is questionable to me, so we haven't considered
modified our proposal to include it.

Justin Lebar's statements still apply though. We should defintely
keep focus on reviewing the determinations of our privacy and legal
teams on the existing proposal and after that, we can begin looking at
further improvements.

Daniel Cater

unread,
Feb 7, 2012, 2:22:37 PM2/7/12
to mozilla.dev.planning group
In terms of extensions, I would be very surprised if there is even one other user with the same set of extensions that I have installed. I wasn't particularly happy that it was even added to the opt-in Telemetry (see https://bugzilla.mozilla.org/show_bug.cgi?id=668392#c22). It was mentioned in https://bugzilla.mozilla.org/show_bug.cgi?id=661881#c1 that users should be able to opt-out of specific probes but that hasn't yet happened.

Daniel Cater

unread,
Feb 7, 2012, 2:22:37 PM2/7/12
to mozilla.de...@googlegroups.com, mozilla.dev.planning group

Daniel E

unread,
Feb 7, 2012, 2:24:28 PM2/7/12
to
On Feb 7, 1:37 pm, Daniel Cater <djca...@gmail.com> wrote:
> On Tuesday, 7 February 2012 17:15:53 UTC, Gian-Carlo Pascutto  wrote:
> > The author seems to mix up Telemetry with the proposed Metrics Data
> > Ping. Rather strange thing to do given that Benjamin' post which started
> > the discussion here begins by outlining the differences between the two.
>
> I don't know if the article has changed since you read it (it may well have done as the author obviously reads these newsgroups) but I didn't get that impression.
>
> I see: "However, unlike the Telemetry Project, the proposed MetricsDataPing project will be opt out..." as a quite clear indication that it's different.
>
> The author is entirely welcome to read these groups and to publish articles about them. When those articles are accurate I'm often glad that issues have been publicised to a wider audience. My main concern when articles like this are published is that Mozilla employees will take more of their discussions to internal e-mail to keep them out of the newsgroups. That is even more un-Mozilla than an opt-out, uniquely identifiable data collection system.

Fortunately the author very quickly responded to feedback and
corrected some of the confusion in the original article tying any
piece of this proposal to Telemetry.

Blake

unread,
Feb 7, 2012, 2:48:35 PM2/7/12
to
On Feb 7, 6:40 am, beltzner <mbeltz...@gmail.com> wrote:

> That's a laudable and excellent goal; the wiki page should specify
> exactly what those questions are, and how the data will be used to
> answer them, *before* any action is taken to collect the data. If the
> questions are indeed important and long-standing, it shouldn't be hard
> to generate that list!

Thanks for the feedback Mike. In my mind, the two most important
question to answer are:
1) What is Firefox's retention rate?
2) What factors drive retention?

Do you feel that these questions were not adequately identified the
the requirements section of the wiki? (https://wiki.mozilla.org/
MetricsDataPing#Requirements)

If Mozilla doesn't collect longitudinal user data, it cannot answer
these questions. As a Firefox user and a community member, I want
Mozilla to have the data it needs to build a better product. Mozilla
needs to rely more on science and less on gut instinct.

Blake

unread,
Feb 7, 2012, 2:55:56 PM2/7/12
to
On Feb 6, 11:24 am, Dao <d...@design-noir.de> wrote:
> I was just going to post this to bug 718066, now commenting here instead:
>
> (In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #54)
>
> > (In reply to Dão Gottwald [:dao] from comment #52)
> > > I'd consider add-ons problematic, partly because the IDs alone can let you
> > > track down a person, partly because the use of some add-ons could be illegal
> > > in some countries. I also second Ben's view that IP addresses + GUIDs need
> > > to be considered personally identifiable information. You say you don't
> > > store IP addresses, but this just brings us back to good intentions vs.
> > > systems that inherently protect privacy by just not sending out problematic
> > > data.
>
> > Based on your feedback, we removed persona and theme IDs from the list of
> > data submitted.  We also implemented the honoring of the setting that an
> > add-on developer can put into the manifest to prevent submitting the add-on
> > ID to Mozilla services.  That preference was originally set up as part of
> > the services.addons.mozilla.org features that support the Add-on manager.
>
> There's no direct link between the use of an add-on being illegal in
> some country and the developer setting that pref. In general, I wouldn't
> count on people setting that pref.
>
> > > The client has the list of installed add-ons, knows about crashes and could
> > > be told what to consider "slow". Providing it with a list of add-ons that
> > > generally tend to be problematic would probably cover 99.9+%. It's unclear
> > > why this requires fain-grained data from hundreds of millions of users.
>
> > That presumes that we can know with accuracy what add-ons tend to be
> > problematic for most of our users.  If we don't collect data from the
> > general usage base, the best we could ever hope to know is what AMO hosted
> > add-ons cause problems on our own specific test machines and what add-ons
> > people have told us cause problems for them.
>
> No, there's also telemetry, which I think we haven't fully utilized yet.
> I don't see how some user selection bias would hinder linking add-ons
> with performance and stability problems.
>
> (In reply to Blake Cutler from comment #57)
>
> > 2) I didn't say Mozilla is going to die. I implied it's headed toward
> > irrelevance. Let's look a the numbers:
> > * Webkit's market share is already 10 points higher than Gecko's.
> > * Gecko is losing .5% market share per month and has no meaningful presence
> > mobile devices.
> > * Webkit is gaining over 1% market share per month and dominates mobile
> > browsing.
> > * Mobile browsing is rapidly overtaking desktop browsing (gaining nearly 1%
> > share per month)
>
> It's unclear how the proposed Metrics Data Ping would change this. See
> again the questions I asked in
> <https://bugzilla.mozilla.org/show_bug.cgi?id=718066#c35>.

The Metrics Data Ping is an attempt to apply scientific principles to
product design and development. Mozilla relies too much on gut
decisions, which directly translates to poor product decisions.
Firefox analytics are stuck in the dark ages. It shows.

blake....@gmail.com

unread,
Feb 7, 2012, 3:23:23 PM2/7/12
to dev-pl...@lists.mozilla.org
On Tuesday, February 7, 2012 10:39:17 AM UTC-8, beltzner wrote:

> IMO, all of this discussion about the necessity of the data and value
> is not only off-topic, but not really all that contentious. With more
> data we can (potentially) make more informed decisions, with less data
> we cannot.

This distinction is not entirely correct. I imagine the Metrics team would be happy with less data if that data was properly collected.

To build informative statistical models, the data needs to be:
* Relatively unbiased (unlike telemetry)
* Longitudinal

This is the same way data is collected on every single Mozilla website.

Joshua Cranmer

unread,
Feb 7, 2012, 4:08:12 PM2/7/12
to
On 2/6/2012 4:29 PM, blake....@gmail.com wrote:
> On Monday, February 6, 2012 1:18:56 PM UTC-8, Joshua Cranmer wrote:
>> Reading the page and skimming the bug has started to lead me to the
>> impression that the data being collected is more oriented in "just in
>> case" or "we want hard numbers to back up what we know", which is
>> definitely not the kind of collection we want to be encouraging.
> I understand where you're coming from, but this data isn't being collected "just in case." We need this data to 1) calculate Firefox's retention rate and 2) identify factors that drive retention.

I do not want to dwell on this, but I would like to point out that
Firefox is a brand, and a brand's image is not driven by the objective
truth as much as it is driven by people's beliefs about the brand. The
data you collect would only record the objective truth, and I want to be
assured that the people looking at it are in full awareness that people
might be switching due to reasons other than what this data collection
would be able to show.

David E. Ross

unread,
Feb 7, 2012, 4:50:44 PM2/7/12
to
There are a large number of potential investigations where applying
scientific principles is unethical or entirely illegal. This is true,
of course, in human physiology and psychology. But it is also true in
sociology, which encompasses how people use tools.

Gian-Carlo Pascutto

unread,
Feb 7, 2012, 5:56:49 PM2/7/12
to
On 7/02/2012 19:37, Daniel Cater wrote:

> I don't know if the article has changed since you read it (it may
> well have done as the author obviously reads these newsgroups) but I
> didn't get that impression.

The article was in fact edited after I posted. There's a few traces of
the original mixup remaining, i.e.:

"Bucksch is concerned because a proposal has been put forward for
Telemetry to include a universally unique identifier (UUID) for
longitudinal analysis"

"his claims proved that the proposed feature for Telemetry was "illegal""

Also, the actual link (which is probably a permalink that's trickier to
fix) to the article still refers to Telemetry.

--
GCP

Justin Wood (Callek)

unread,
Feb 7, 2012, 6:20:27 PM2/7/12
to
Robert Kaiser wrote:
> Justin Wood (Callek) schrieb:
>> Using this logic, SeaMonkey should gather all data about all users, we
>> possibly can, because we have been losing market share heavily every
>> since we became SeaMonkey from "the Mozilla Suite".
>
> Be careful there. SeaMonkey's active daily installations have been
> growing faster than the Internet population recently, which doesn't
> sound like "losing market share". ;-)

Well I see I didn't word it the way I meant it. I more meant "lost lots
of market share vs what we were when we were the Mozilla Suite -- and
have not regained the majority of that yet". But yes we are GAINING
[relatively] well lately.

--
~Justin Wood (Callek)

Justin Wood (Callek)

unread,
Feb 7, 2012, 6:22:32 PM2/7/12
to
Robert Kaiser wrote:
>> Lastly Apple has a lead on Mobile in general, and we can't even offer a
>> Firefox for mobile, and instead we are stuck with doing a Firefox Home
>> to share bookmarks, while the default webkit-based browser[s] are
>> pulling ahead there, given the iPhone/iPad proliferation.
>
> For one thing, Firefox Home seems to not even be actively developed
> nowadays - but even more importantly, mobile is not equal to iOS,
> actually IIRC Android has more sales than iOS now, and even overtook
> Apple in market share in some regions (probably not the US though).

IMO, part of the Android market share increase is E-Reader specific
devices, that have locked stores (Barnes and Noble Nook -- Which I can't
install Firefox Mobile on, is Android -- Amazon Kindle, which we just
recently attempted to deploy a Firefox to JUST their "Fire" version. But
last I saw on Amazon we had to retract that choice due to them changing
what we deployed).

--
~Justin Wood (Callek)

Ken Saunders

unread,
Feb 8, 2012, 9:53:57 AM2/8/12
to Nicholas Nethercote, Ben Bucksch, dev-pl...@lists.mozilla.org

> I'm not sure that it's really germane to the discussion at hand - I
> don't think our choices here should be governed significantly by our
> fear of bad press, or our belief that the issue will not garner
> significant public notice at all...
> cheers,
> mike

All true, and I understand that this is one of the tough parts of working in the open, but (as Gerv pointed out) the bad press has already started and those who aren't as pro-Mozilla as we are, will more than likely try and exploit this especially with such great headline elements as Mozilla considering collecting personal data despite what the facts may be.

Then you start getting people from all over who know very little to nothing about the topic muddying up your discussions here like I am now.
I don't remember the topic, but that very thing happened sometime last year (I believe) in bug.

I'm just saying that it's all something to keep in mind.

Ken

Ken Saunders

unread,
Feb 8, 2012, 9:53:57 AM2/8/12
to mozilla.de...@googlegroups.com, Ben Bucksch, dev-pl...@lists.mozilla.org, Nicholas Nethercote

Robert Kaiser

unread,
Feb 8, 2012, 12:59:15 PM2/8/12
to
Blake schrieb:
> Thanks for the feedback Mike. In my mind, the two most important
> question to answer are:
> 1) What is Firefox's retention rate?
> 2) What factors drive retention?

Do we need the list of add-ons including a lot of detail on them for
answering those questions? It looks to me like a significant factor in
controversy here is the fingerprintability of that data (correct me if
I'm wrong) and I wonder if we really need that data to answer the core
questions we have or if it's just a "would be nice to answer those
questions as well" point to include that info.

Robert Kaiser

Daniel E

unread,
Feb 8, 2012, 4:45:51 PM2/8/12
to
On Feb 8, 12:59 pm, Robert Kaiser <ka...@kairo.at> wrote:

> Do we need the list of add-ons including a lot of detail on them for
> answering those questions? It looks to me like a significant factor in
> controversy here is the fingerprintability of that data (correct me if
> I'm wrong) and I wonder if we really need that data to answer the core
> questions we have or if it's just a "would be nice to answer those
> questions as well" point to include that info.
>
> Robert Kaiser

We are working with the Privacy team to put together a comprehensive
set of information that covers each data point and the questions that
require them.
I can go ahead and answer this one right now though. Note that these
are questions that specifically center around add-ons. It is by no
means the complete set of questions.



For the user looking at about:metrics and their own local data, the
questions are:
Has the performance or stability of my installation changed since I
installed this add-on?
Has the performance or stability of my installation changed since I
updated this add-on to this version?
How does my performance or stability compare to other installations
with the same add-on, version, or set of add-ons?
Did the performance or stability of installations with this add-on
change recently? (i.e. a problem with the add-on's interaction with
the web)

Obviously, not everyone will know or wish to go and look here. That
is part of why we are trying to get this data so we can understand it
at an aggregate level too. That said, I believe that empowering users
to discover the answer to these questions is a good thing. It provides
another tool for the SUMO community to help people. It also gives
people a factual basis to communicate issues -- "I have this chart
that shows the add-on caused a problem for me" instead of intuition
and approximations -- "I think it started getting a bit crashier after
I updated a couple of add-ons".

For Mozilla looking at the data on our side, the questions are:
What add-ons significantly contribute to a degradation of performance
or stability for the majority of installations?
Do certain versions of those add-ons increase or decrease their
impact?
Is the amount of impact constant (i.e. 1000ms sleep on start) or does
it have a high variance? (It affects some installations severely,
others not at all. This is a likely indication of a confounding
variable such as conflict with another add-on).
Do certain sets of add-ons significantly contribute to a degradation
of performance or stability for the majority of installations?
Do add-ons which use binary components significantly contribute to a
degradation of performance or stability for the majority of
installations?
Are there "dark-matter" add-ons or plug-ins that significantly
contribute to a degradation of performance or stability for the
majority of installations? (Dark-matter add-ons are ones that are not
hosted on AMO.)

Specific retention analysis questions:
Are installations with few add-ons more likely to be abandoned than
installations with a lot?
Do certain sets of add-ons contribute significantly to the continued
use of Firefox?
Do certain add-ons or versions of add-ons significantly contribute to
the abandonment of installations?
Do certain add-ons or versions of add-ons significantly contribute to
the delay of updates of Firefox?
Is the combination of add-ons plus usage frequency statistically
significant?


All of these questions are things that have been asked over the past
several years and they are determined to be critically important as a
part of understanding usage and improving Firefox. In the past, we
have tried to answer some of these questions through debugging,
analyzing crash-stats, analyzing SAMO data, and looking at Telemetry.
Each one of these data sources doesn't provide a clear enough picture
though. Mostly because they can't show a trend over time (i.e. the
installation was good before add-on version X and bad after it) and
they can't show abandoned installations (i.e. retention).

Some people feel that dark-matter add-ons are a significant cause of
difficult to understand performance and stability problems. It is
difficult for us to find and investigate those since they aren't
available on AMO and we don't have many metrics on them. It is easy
to look at some crash data and decide that an add-on causes crashes.
In many cases, that analysis will be right, but without knowing how
many installations were already having trouble before that add-on was
installed, or how many installations are running fine with it
installed, we aren't making the best decisions we could be making.

Jim

unread,
Feb 8, 2012, 4:55:36 PM2/8/12
to
On 02/06/2012 12:56 PM, Benjamin Smedberg wrote:
> For background, the feature page (not strictly a feature page) is here:
> https://wiki.mozilla.org/MetricsDataPing

Is there a reason this couldn't be handled with Test Pilot? The way
Thunderbird does it is to install Test Pilot by default on all channels
(including release), but the studies themselves will prompt the user
before submission. Could Firefox do the same? It seems strange to have
so many different metrics-gathering features for one project.

- Jim

Daniel E

unread,
Feb 8, 2012, 5:33:27 PM2/8/12
to
Yes. Test Pilot was primarily designed for focused one-shot
experiments. It doesn't have any framework for being able to perform
retention analysis. The population of users opting in to any given
study is subject to self-selection bias and are heavily composed of
technically savy engaged collaborators rather than representing the
hundreds of millions of Firefox users who are focused on browsing a
web page more than reading about a study and opting in to it.

We could attempt to change some or all of those things to make Test
Pilot more suitable, but then it would become less suitable for what
it was primarily designed to be.

I agree that four major platforms for metrics gathering is a lot, but
I believe there is significant value in keeping each of them focused
to a particular domain and isolating the data from each other to
prevent potential privacy leaks:

Crash Stats -- Diagnostic data about a specific crash. Suitable for
developers to study the crash and attempt to fix code problems.

Test Pilot -- Short term experimentation platform. Suitable for an in-
depth study on a particular facet of product user experience, possibly
modifying the UI to evaluate the difference.

Telemetry -- Single point in time performance data for specific code
areas. Suitable for developers to implement and verify performance
improvements to specific parts of code and test for and prevent
regressions.

Metrics Data Ping -- Product usage, performance, and stability
trending. Suitable for the end user or Mozilla to understand
retention characteristics and things that increase or decrease the
performance, stability, and usage of the product over a period of time.

Jim

unread,
Feb 8, 2012, 6:31:20 PM2/8/12
to
On 02/08/2012 04:33 PM, Daniel E wrote:
> Yes. Test Pilot was primarily designed for focused one-shot
> experiments. It doesn't have any framework for being able to perform
> retention analysis. The population of users opting in to any given
> study is subject to self-selection bias and are heavily composed of
> technically savy engaged collaborators rather than representing the
> hundreds of millions of Firefox users who are focused on browsing a
> web page more than reading about a study and opting in to it.

Technically, wouldn't even an opt-out system impose self-selection bias?
Specifically, you'd be selecting against privacy-conscious users, not
only for metrics gathering, but for the more privacy-conscious
(paranoid?) users, you may even be pushing them away from Firefox
entirely, which would go unmeasured.

I'm not sure how important those users are, statistically speaking, but
I expect that they are more important in a community sense than
"ordinary" users, if only because the savvier users can and do influence
other users. Anecdotally, I'm sure most of the people reading this have
evangelized Firefox to family members. (Granted, I'm being handwavy
here, which is an unfortunate thing to do when talking about statistical
validity.)

An opt-out metrics system probably has a pretty good cost/validity
ratio, but the privacy concerns are pretty significant. I won't
reiterate them here, since others have (and will, I'm sure) touch on
them in this thread, but I do want to mention that even with a good
faith effort to avoid PII, computers are a hell of a thing, and I think
there's a good chance that people could derive more information than
intended from the metrics data. While I trust Mozilla not to do anything
nefarious, I don't trust the entire toolchain to be 100% free of
security vulnerabilities.

Has the metrics team considered other ways of getting some
representative data for the "average" Firefox user? If you could get
that data, it should be possible to use it to subsample opt-in data and
get something statistically valid. I'm not sure of the best way to get
that initial data, though, which is a problem.

Of course, some of this is about the perception of evil rather than the
actual existence of evil, since there's been a lot of privacy-related
bad news lately (Facebook, Google, etc). Still, that perception does
have real-world effects.

In closing, here's a post from Jono of the Test Pilot team about how
they've tried to deal with selection bias:
<http://jonoscript.wordpress.com/2010/10/09/test-pilot-self-selection-bias-and-how-to-compensate-for-it/>.
A relevant quote: "I think informed user consent is the ethical way to
collect data and I stand behind the wa we’ve run our studies so far."

- Jim

Robert O'Callahan

unread,
Feb 8, 2012, 9:35:10 PM2/8/12
to Daniel E, dev-pl...@lists.mozilla.org
On Thu, Feb 9, 2012 at 10:45 AM, Daniel E <deins...@gmail.com> wrote:

> All of these questions are things that have been asked over the past
> several years and they are determined to be critically important as a
> part of understanding usage and improving Firefox. In the past, we
> have tried to answer some of these questions through debugging,
> analyzing crash-stats, analyzing SAMO data, and looking at Telemetry.
> Each one of these data sources doesn't provide a clear enough picture
> though. Mostly because they can't show a trend over time (i.e. the
> installation was good before add-on version X and bad after it) and
> they can't show abandoned installations (i.e. retention).
>

Solving those problems only requires UID matching. It doesn't require
abandoning the opt-in model.

It seems to me the main motivation for adopting an opt-out model is to
avoid sampling bias. Has any work been done to measure bias introduced by
our opt-in data collection channels? E.g., average number of addons used by
Telemetry-enabled users vs average number of addons used by all users?

Rob
--
"If we claim to be without sin, we deceive ourselves and the truth is not
in us. If we confess our sins, he is faithful and just and will forgive us
our sins and purify us from all unrighteousness. If we claim we have not
sinned, we make him out to be a liar and his word is not in us." [1 John
1:8-10]

David E. Ross

unread,
Feb 8, 2012, 10:44:29 PM2/8/12
to
On 2/8/12 6:35 PM, Robert O'Callahan wrote:
> On Thu, Feb 9, 2012 at 10:45 AM, Daniel E <deins...@gmail.com> wrote:
>
>> All of these questions are things that have been asked over the past
>> several years and they are determined to be critically important as a
>> part of understanding usage and improving Firefox. In the past, we
>> have tried to answer some of these questions through debugging,
>> analyzing crash-stats, analyzing SAMO data, and looking at Telemetry.
>> Each one of these data sources doesn't provide a clear enough picture
>> though. Mostly because they can't show a trend over time (i.e. the
>> installation was good before add-on version X and bad after it) and
>> they can't show abandoned installations (i.e. retention).
>>
>
> Solving those problems only requires UID matching. It doesn't require
> abandoning the opt-in model.
>
> It seems to me the main motivation for adopting an opt-out model is to
> avoid sampling bias. Has any work been done to measure bias introduced by
> our opt-in data collection channels? E.g., average number of addons used by
> Telemetry-enabled users vs average number of addons used by all users?
>
> Rob

Both opt-in and opt-out have each their own biases. Opt-in results
would likely be biased against naive users who do not understand what is
requested and against experienced users who are concerned about privacy.
Opt-out results would likely be biased against experienced users who
are concerned about privacy. It seems to me (an amateur on this issue)
that either opt-in or opt-out would provide results that are biased
against the very users that could provide the most valuable data.

Justin Lebar

unread,
Feb 9, 2012, 1:03:10 AM2/9/12
to dev-pl...@lists.mozilla.org
> Both opt-in and opt-out have each their own biases. [snip]

We're getting a low opt-in rate with Telemetry. It's likely that we'd
get a similarly low opt-out rate with MDP.

So yes, there's bias regardless of MDP is opt-out or -in. But I think
it's fair to expect that opt-out would significantly decrease the
amount of bias.

But I think this discussion about biases in the data is missing the
point. I'm still waiting to hear someone address Benjamin's original
concerns about privacy using the framework I laid out. That is,
nobody has yet explained how opt-out and the specific pieces of data
in the proposed MDP are consistent with our privacy policy and our
mission. This justification, I've argued, must not rely on the
necessity of these analytics, because we value principals first.

If people from the privacy team are lurking and preparing this
explanation, can you give us an ETA?

-Justin

JP Rosevear

unread,
Feb 9, 2012, 7:19:37 AM2/9/12
to rob...@ocallahan.org, Daniel E, dev-pl...@lists.mozilla.org
On Thu, 2012-02-09 at 15:35 +1300, Robert O'Callahan wrote:
> On Thu, Feb 9, 2012 at 10:45 AM, Daniel E <deins...@gmail.com> wrote:
>
> > All of these questions are things that have been asked over the past
> > several years and they are determined to be critically important as a
> > part of understanding usage and improving Firefox. In the past, we
> > have tried to answer some of these questions through debugging,
> > analyzing crash-stats, analyzing SAMO data, and looking at Telemetry.
> > Each one of these data sources doesn't provide a clear enough picture
> > though. Mostly because they can't show a trend over time (i.e. the
> > installation was good before add-on version X and bad after it) and
> > they can't show abandoned installations (i.e. retention).
> >
>
> Solving those problems only requires UID matching. It doesn't require
> abandoning the opt-in model.
>
> It seems to me the main motivation for adopting an opt-out model is to
> avoid sampling bias. Has any work been done to measure bias introduced by
> our opt-in data collection channels? E.g., average number of addons used by
> Telemetry-enabled users vs average number of addons used by all users?

My understanding is that this was checked for telemetry for a single
measure, start up time, by the metrics team and in that case telemetry
was found to be representative of the general population. Can you
confirm that Daniel?

-JP
--
JP Rosevear <j...@mozilla.com>
Mozilla

Robert Kaiser

unread,
Feb 9, 2012, 10:55:12 AM2/9/12
to
Justin Lebar schrieb:
>> Both opt-in and opt-out have each their own biases. [snip]
>
> We're getting a low opt-in rate with Telemetry.

I'm hearing this all the time, but AFAIK we don't even know if the rate
we do get is statistically relevant and can actually tell us enough
about our user population. As long as we don't have conclusive research
on that, I'm inclined to not accept this to be an argument at all.

Robert Kaiser

Robert Kaiser

unread,
Feb 9, 2012, 11:05:59 AM2/9/12
to
Daniel E schrieb:
> Has the performance or stability of my installation changed since I
> installed this add-on?
> Has the performance or stability of my installation changed since I
> updated this add-on to this version?

We don't need to send anything to the server to answer that, we can do
that safely with only locally stored data - and that wouldn't be as much
of a privacy risk.

> How does my performance or stability compare to other installations
> with the same add-on, version, or set of add-ons?

Difficult to measure due to the wide spread of hardware, but yes,
probably needs us to store server-side data in this way.

> Did the performance or stability of installations with this add-on
> change recently? (i.e. a problem with the add-on's interaction with
> the web)

Hard to tell even with the data you are proposing to collect, as we
don't store the history of installation and deinstallation of add-ons
and send it to the server as well, do we? (And if we did, it would
already sound quite 1984 of us.)

> What add-ons significantly contribute to a degradation of performance
> or stability for the majority of installations?
> Do certain versions of those add-ons increase or decrease their
> impact?

Can't we already tell a lot of that through the data we are gathering
with AMO pings?

> Some people feel that dark-matter add-ons are a significant cause of
> difficult to understand performance and stability problems.

I thought we would already be collecting some info about so-called
"dark-matter add-ons" through AMO?

Robert Kaiser

David E. Ross

unread,
Feb 9, 2012, 2:10:26 PM2/9/12
to
On 2/6/12 10:56 AM, Benjamin Smedberg wrote:
> There has been a project being worked on for some time to collect
> metrics from Firefox installations in an "on by default" manner. This is
> different from off-by-default telemetry. I became aware of this project
> recently when I was asked to review some implementation code, and I have
> some concerns about our privacy stance in this feature. Because the bugs
> are getting a bit out of hand, I wanted to move the discussion to the
> proper newsgroup.
>
> For background, the feature page (not strictly a feature page) is here:
> https://wiki.mozilla.org/MetricsDataPing
>
> Note that this page contains data from several different authors and
> isn't a coherent proposal page any more. See the wiki history for
> context if necessary.
>
> The tracking bug is https://bugzilla.mozilla.org/show_bug.cgi?id=718066
> from which several other bugs (core implementation, preference UI) are
> available.
>
> I understand that this opt-out data collection is vastly superior than
> telemetry in terms of collecting a representative sample and controlling
> for bias. But it's not clear to me why that makes it "ok" from a privacy
> perspective, compared with telemetry, to make this opt-out instead of
> opt-in. Just from my personal experience, I would be surprised by any
> data submitted by Firefox to Mozilla which was not part of regular
> Firefox functionality (app update seems pretty straightforward,
> extension update also, crash submission is opt-in). It seems that if
> this data submission contains any information which is potentially
> personally identifying, then it would be a "surprise". As already
> identified in the bug, there are so many different ways in which data
> can be potentially identifying:
>
> * unique sets of themes (theme collection was removed)addons
> * unique sets of addons (addon collection is still proposed)
> * the unique IDs used to keep track of particular installations can
> potentially track data back to users (note that the UUID proposal has
> changed somewhat due to privacy concerns, but that there is still a
> local ID -> remote data mapping)
>
> A fair bit of the proposal is focused on how we would be protecting and
> anonymizing the data. But if we're not actually collecting personally
> identifyable data, why couldn't we make the entire server system public
> and queryable? It seems that any system that requires server-side
> anonymization to meet user privacy expectations is an unexpected privacy
> risk. Might it also open up our users to potential tracking via court
> order (search warrants) from both U.S. courts and whatever countries we
> put data centers in?
>
> It seems as if we are saying that since we already collect most of this
> data via various product features, that makes it ok to also collect this
> data in a central place and attach an ID to it. Or, that because we
> *need* this data in order to make the product better, it's ok to collect
> it. This makes me intensely uncomfortable. At this point I think we'd be
> better off either collecting only the data which cannot be used to track
> individual installs, or not implementing this feature at all.
>
> Note that while Ben Bucksch has also brought up legal concerns about
> whether German or European law forbids this kind of data collection, I'm
> not particular interested in that portion of the discussion because very
> few of us in the project are legal experts who can have an informed
> opinion. So please let's avoid ratholing on those legal issues instead
> of the basic privacy issue.
>
> --BDS
>

Doesn't the use of a UUID in this concept conflict with bug #572650,
which is the tracking bug for reducing the fingerprint of Mozilla
products? The justification for that bug includes "good for user
privacy". See <https://bugzilla.mozilla.org/show_bug.cgi?id=572650>.

Robert Kaiser

unread,
Feb 9, 2012, 3:11:56 PM2/9/12
to
David E. Ross schrieb:
> Doesn't the use of a UUID in this concept conflict with bug #572650,
> which is the tracking bug for reducing the fingerprint of Mozilla
> products?

No. For one thing, the "UUID" in this MDP concept is not what is mostly
seen as a UUID in tracking/surveillance stuff, as it's not even constant
for a user or even installation.
For the other, bug 572650 is about the HTTP headers we send to _every_
website out there, while the MDP in the proposal here is about Mozilla
collecting a small amount of data to help improving Firefox. This is not
sent to everyone out there, but only to Mozilla ourselves.

Therefore, there's a big difference and those are two different matters.

There's still the question of how far we can go there in terms of
privacy, as one of our goals is to fight for user privacy and so we
can't implement things that actually compromise it. On the other hand,
we want to gather data so we can improve Firefox for all our users and
actually know what problems we need to fix as well as be able to check
if our fixes do work. This is the discussion we need to have and are the
questions we need to find answers for.

Robert Kaiser

Robert Strong

unread,
Feb 9, 2012, 3:17:58 PM2/9/12
to dev-pl...@lists.mozilla.org
I think there is a set of data that is either the same as or very close
to the existing ping data we get and that this data does not lend itself
to fingerprinting users and there is an additional set of data that IMO
could be provided by a subset of users that opt-in to providing to
Mozilla that could potentially be used to fingerprint users. For
example, I don't think we need to get the list of add-ons from everyone
that doesn't opt-out to be able to tell that add-on X is causing a
problem with startup time, etc.

Would it be possible to separate the data sets in such a manner to
accomplish the goals for gathering metrics?

Robert

Daniel Cater

unread,
Feb 10, 2012, 9:12:08 AM2/10/12
to mozilla.de...@googlegroups.com, Daniel E, dev-pl...@lists.mozilla.org, rob...@ocallahan.org
It was blogged about on the metrics blog.

http://blog.mozilla.com/metrics/2011/12/13/comparing-the-bias-in-telemetry-data-vs-the-typical-firefox-user/

"good news! Insofar start up time is concerned, Telemetry is representative of SAMO [services.addons.mozilla.org]"

Daniel Cater

unread,
Feb 10, 2012, 9:12:08 AM2/10/12
to rob...@ocallahan.org, Daniel E, dev-pl...@lists.mozilla.org
On Thursday, 9 February 2012 12:19:37 UTC, JP Rosevear wrote:

Daniel E

unread,
Feb 10, 2012, 10:23:49 AM2/10/12
to
On Feb 9, 11:05 am, Robert Kaiser <ka...@kairo.at> wrote:
> Daniel E schrieb:
>
> > Has the performance or stability of my installation changed since I
> > installed this add-on?
> > Has the performance or stability of my installation changed since I
> > updated this add-on to this version?
>
> We don't need to send anything to the server to answer that, we can do
> that safely with only locally stored data - and that wouldn't be as much
> of a privacy risk.

We don't need to send server data to answer that question just for the
user, but the primary goal for MDP is for Mozilla to be able to
understand these questions at an aggregate level for the majority of
our typical installations. The MDP proposal was built in such a way
that the data needed to answer those questions could reside on the
client and also deliver a direct benefit to individual users without
having to rely on whether the results from our aggregate view lined up
well with their unique configuration.

> > How does my performance or stability compare to other installations
> > with the same add-on, version, or set of add-ons?
>
> Difficult to measure due to the wide spread of hardware, but yes,
> probably needs us to store server-side data in this way.

Yes, we could definitely measure it better by collecting more data,
but one of the directives of the project was to collect the minimum
amount of data points necessary to answer the most critical questions
we currently know how to answer but don't have the data to answer.

> > Did the performance or stability of installations with this add-on
> > change recently? (i.e. a problem with the add-on's interaction with
> > the web)
>
> Hard to tell even with the data you are proposing to collect, as we
> don't store the history of installation and deinstallation of add-ons
> and send it to the server as well, do we? (And if we did, it would
> already sound quite 1984 of us.)

Same point as above. We do store the installation and last update
date of each add-on. That is good enough to answer the current
questions we are asking. If it turns out that there are more
questions to be asked on the client side or the server side in the
future, then we'll have to evaluate possible future data points and
weigh the value against the cost of collecting more data and review it
with the User Data Council and make a public proposal to add more.

> > What add-ons significantly contribute to a degradation of performance
> > or stability for the majority of installations?
> > Do certain versions of those add-ons increase or decrease their
> > impact?
>
> Can't we already tell a lot of that through the data we are gathering
> with AMO pings?

Since we don't have longitudinal data, we don't know what the
performance or stability was like before the add-on was installed or
updated. This means our analysis is lacking critical data to make
sure we have determined the most likely causal factors for the
correlations we might see.

> > Some people feel that dark-matter add-ons are a significant cause of
> > difficult to understand performance and stability problems.
>
> I thought we would already be collecting some info about so-called
> "dark-matter add-ons" through AMO?

The only data we get about dark-matter add-ons is through the SAMO
ping, and my points in the previous question apply to this data.




On Feb 10, 9:12 am, Daniel Cater <djca...@gmail.com> wrote:
> On Thursday, 9 February 2012 12:19:37 UTC, JP Rosevear wrote:
> > On Thu, 2012-02-09 at 15:35 +1300, Robert O'Callahan wrote:
> > > On Thu, Feb 9, 2012 at 10:45 AM, Daniel E <deinspan...@gmail.com> wrote:
> > > It seems to me the main motivation for adopting an opt-out model is to
> > > avoid sampling bias. Has any work been done to measure bias introduced by
> > > our opt-in data collection channels? E.g., average number of addons used by
> > > Telemetry-enabled users vs average number of addons used by all users?
>
> > My understanding is that this was checked for telemetry for a single
> > measure, start up time, by the metrics team and in that case telemetry
> > was found to be representative of the general population. Can you
> > confirm that Daniel?
>
> It was blogged about on the metrics blog.
>
> http://blog.mozilla.com/metrics/2011/12/13/comparing-the-bias-in-tele...
>
> "good news! Insofar start up time is concerned, Telemetry is representative of SAMO [services.addons.mozilla.org]"

Yep, that study found that while there was visible variance in the
distribution of startup time alone, it was still close enough to be
representative.

We are currently putting together information on a few other studies
that we've done that look at other data points such as the
installation of particular add-ons. There are very few areas that we
can currently do this sort of comparative analysis because we have
only a few sources of data that is sent by default for most
installations. Each of those sources are also very specific and
confined to just a couple of data points which makes it impossible to
study the bias for multivariate analysis.

The metrics team should have that information ready within the next
couple of business days.

-Daniel

Justin Dolske

unread,
Feb 10, 2012, 3:08:11 PM2/10/12
to
On 2/6/12 10:56 AM, Benjamin Smedberg wrote:

> It seems as if we are saying that since we already collect most of this
> data via various product features, that makes it ok to also collect this
> data in a central place and attach an ID to it.

I'd like to suggest splitting the "metrics collection" work up into a
few separate pieces. It seems like there are multiple goals, with
varying levels of controversy, and so untangling things seems like a way
to both make some progress and make it easier to consider parts
independently.

Specifically:

1) Implement a "metrics ping", using only the data that is already
present in the blocklist update requests. I don't think there are any
privacy concerns here, since it's essentially an implementation detail.

[Goals: simplify the blocklist code/service, and _improve_ privacy
footing using a separate ping, so it's possible to disable it without
disabling blocklist checking.]


2) Expand the set of data contained in the metrics ping (not to include
any unique identifier). This should be subject to the same privacy
reviews and discussion as has been norm in the past (eg, telemetry).
Shouldn't be vary controversial, as I think the main objection from the
current patches was the since-removed addon/theme ID reporting.

[Goal: understand our users better, with minimal privacy impact.]


3) Improve quality of collected data, balanced against privacy concerns.
Maybe this means an anonymous ID, or some kind of alternate approach.
The discussion going on here should continue. Another possibility is
that we find no middle ground, and do nothing.

[Goals: Major, even radical, improvement to understanding our users and
data usability, but needs to be balanced against perceived privacy
risk/impact.]


One small catch comes to mind: it still feels like there is a lot of
overlap with what Telemetry does, and so it would seem sensible to try
to converge on a single "telemetry and metrics" service. Or, even if
they're separate implementations, combine them when handled in the UI
and a single optin/optout.

Justin

Justin Dolske

unread,
Feb 10, 2012, 3:51:06 PM2/10/12
to
On 2/6/12 10:56 AM, Benjamin Smedberg wrote:

> * the unique IDs used to keep track of particular installations can
> potentially track data back to users (note that the UUID proposal has
> changed somewhat due to privacy concerns, but that there is still a
> local ID -> remote data mapping)

Some random thoughts on mulling over how to lessen the impact of an
unique ID...

I am no stats wizard (thankfully other around here are ;), but surely we
don't need to have each and every one of our 500 millions users uniquely
identified.

One starting point would be to only collect data + ID from whatever
percentage of users gives us sufficient confidence in extrapolating
results. Say, 15% of the user base? (The browser would randomly decide
at install time to participate or not, avoiding sample bias.) This
doesn't directly address objections about having IDs, but the point is
that its impact -- and perception -- can be lessened by not tagging
_everyone_.

Another thought would be to to periodically rotate IDs. Say, on the 1st
of every month anyone with an ID gets a new random ID. This limits -- on
the *client* -- tracking beyond 1 month.

Combine the two, and every month have the browser re-roll to determine
if it's in the 15% sample pool or not along with rotating the ID. That
would be... a 2% chance of participating 2 months in a row, and you'd
have a new ID anyway.

Complex variations come to mind too... AIUI, even just short-term IDs
would be greatly helpful. (For example, with crashes we can't really
tell how frequent crashes are for user that hit them -- multiple times a
day? Once a day every day? Only rarely?) So, I could imagine having most
of the sampled users rotating IDs weekly, some rotating monthly, and
perhaps even a small number on a longer interval (if useful).

[Of course, even if the above was all opt-out, once out users would not
have to opt again every rotation. Also, all the numbers I used were
rough guesses, which would need refinement.]

Justin

Justin Lebar

unread,
Feb 10, 2012, 4:18:05 PM2/10/12
to Justin Dolske, dev-pl...@lists.mozilla.org
> I'd like to suggest splitting the "metrics collection" work up into a few separate pieces.

This seems prudent to me.

> I am no stats wizard (thankfully other around here are ;), but surely we
> don't need to have each and every one of our 500 millions users uniquely
> identified.

Yes, taking a random sample of even 1% of our users would likely be
more than sufficient. But the sample has to be as random as possible
-- the objection to opt-in is that it's not an unbiased sample of our
users.

> One starting point would be to only collect data + ID from whatever
> percentage of users gives us sufficient confidence in extrapolating results.
> Say, 15% of the user base?

Even 1% of our user-base is 4 million people. So the essential
question is, why is it OK to collect information from 4 million
people, but not from 400 million?

Similarly, why is it OK to track users for a month if it's not OK to
track them for a year?

I don't think it's the scale of the proposed data collection that
people object to so much as the fact that we'd be doing it at all.
I'd have concerns about tracking 10 people for a day without their
explicit opt-in consent.

-Justin L.

pierreh...@gmail.com

unread,
Feb 11, 2012, 7:05:29 AM2/11/12
to mozilla.dev.planning group
Are you guys trying to add so much tracking the the browser will halt like a snail or something ?

Why don't you spend your time with serious things instead.

Robert Kaiser

unread,
Feb 13, 2012, 4:21:11 PM2/13/12
to
pierreh...@gmail.com schrieb:
> Are you guys trying to add so much tracking the the browser will halt like a snail or something ?
>
> Why don't you spend your time with serious things instead.

Actually, our metrics team is trying to do the serious job of learning
where the issues lie for our users and how to best improve Firefox while
at the same time not adding anything that can be really "tracking" users.

And the fact that we are openly discussing this here well before even
getting this into development versions is one major way of how Mozilla
is very different in that respect than our competitors, who to some part
are just adding tracking technology without even talking about it in public.

We value privacy, and we value making Firefox better for users. That
includes making it faster, more responsive and using less memory than
previous versions. That's the reason why we are discussing this matter.

Robert Kaiser

Gervase Markham

unread,
Feb 14, 2012, 9:41:10 AM2/14/12
to Justin Dolske
On 10/02/12 20:08, Justin Dolske wrote:
> I'd like to suggest splitting the "metrics collection" work up into a
> few separate pieces. It seems like there are multiple goals, with
> varying levels of controversy, and so untangling things seems like a way
> to both make some progress and make it easier to consider parts
> independently.

Great idea.

It seems that most of the concern is around the unique user identifier.
Splitting it up also helps people who want to work on mathematical or
cryptographic ways to preserve privacy while correlating submission
events to work only on that piece.

> One small catch comes to mind: it still feels like there is a lot of
> overlap with what Telemetry does, and so it would seem sensible to try
> to converge on a single "telemetry and metrics" service. Or, even if
> they're separate implementations, combine them when handled in the UI
> and a single optin/optout.

Certainly they should be combined in the UI. IMO it would be a very
nuanced privacy-conscious individual who was OK with Mozilla getting
Telemetry data but not OK with what Metrics are after.

Gerv

Gervase Markham

unread,
Feb 14, 2012, 9:50:13 AM2/14/12
to Justin Lebar, Justin Dolske
On 10/02/12 21:18, Justin Lebar wrote:
> Even 1% of our user-base is 4 million people. So the essential
> question is, why is it OK to collect information from 4 million
> people, but not from 400 million?
>
> Similarly, why is it OK to track users for a month if it's not OK to
> track them for a year?

Particularly if the data collection switched on and off, and if each
period of collection were un-correlatable with other ones, I would have
a better gut feeling about this if it were me giving the data.

> I don't think it's the scale of the proposed data collection that
> people object to so much as the fact that we'd be doing it at all.
> I'd have concerns about tracking 10 people for a day without their
> explicit opt-in consent.

"Tracking", I would say, is building up a profile on a particular
individual over time. Therefore, "tracking someone for a day" doesn't
make all that much sense as a concept.

In addition, I feel that the creepiness of "tracking" also comes from
data correlation - advertising companies build up a profile on my
interests by looking at 20 sites I've visited. If the "unique
identifier" we use is uncorrelatable with any other information anyone
else holds about the individual concerned, is that better? I would
suggest it is, a bit.

Gerv

Mitchell Baker

unread,
Feb 14, 2012, 3:12:57 PM2/14/12
to
After letting this sit for a few days I still feel the need to respond.


> If the current proposal is in violation of our values, it's up to the
> metrics team (and whoever wants to help) to come up with an
> alternative. It is explicitly *not* up to those of us opposing the I
> current proposal to propose an alternative.

I disagree with this statement profoundly.

First of all, the statement has the "in violation of our values" clause,
which implicitly suggests an assertion about "violation of values" is
adequate to make it a violation of values.

But more important, let's take it as given that *you* believe this is in
violation of the Mozilla value of privacy. It is also *your* job to try
to internalize the problem presented and whether there are alternatives.
You may end up deciding that "stop," "no" and "I don't care about
anything but this single issue" is your ultimate view.

Asserting that anyone can just say this, then sit back and say "prove
it, fix it for me" is unhealthy.


Mitchell

Gervase Markham

unread,
Feb 15, 2012, 6:44:19 AM2/15/12
to Mitchell Baker
Mitchell, I think you have misunderstood what Justin is saying. He is
not _asserting_ that this action is in violation of our values, and that
therefore the metrics team must come up with something else, and then
sitting there with arms folded waiting for them to dance to his tune.

He is saying that _if_ it is (and that is what we are discussing), then
the responsibility for coming up with something better falls to the
Metrics team. In other words, he thinks it's unreasonable for Daniel to
say, as he did further up the thread, "if other people can come up with
a feasible alternative to this form of data collection, fine, otherwise
we are going ahead with it".

In that, I think he is right. We should work out whether this is in
violation of our values or not and, if it is, it's up to the Metrics
team to propose another way of doing it if they still want the data.

Gerv

Mitchell Baker

unread,
Feb 15, 2012, 4:33:02 PM2/15/12
to
Thanks Gerv, this is a helpful discussion. Some comments inline.
Actually, I still disagree with both the positions asserted. If there
is an issue related to our values, then it is *everyone's* job to do a
few things:

1. understand the other person's goal, perspective and issues.
2. try to figure out a way to meet those goals*


In this case for example, one goal is to understand more when people
give us a very clear message: " I'm not using your product as my {main}
browser anymore."

(*One could end up disagreeing on the goals. For example, perhaps some
would say we don't need to know why people drop Firefox for another
browser, it's better to do whatever we think is right and not be
concerned with relevance. If so, we should have that discussion directly.)
>
> In that, I think he is right. We should work out whether this is in
> violation of our values or not and, if it is, it's up to the Metrics
> team to propose another way of doing it if they still want the data.
>
Yes and no, in my mind. Metrics are the metrics experts, so we can't
have a metrics proposal without their leadership. But others are
product experts and can have a lot of valuable input and leadership.
Indeed, we're seeing some of that here. So the specific part of this
statement that i disagree with are the ideas that:

-- any specific team is on its own when trying to solve a thorny
problems -- others can stop the work but not have responsibility for
solving the problem
-- the tenor of the phraise "if they still want the data." This
assumes that the problem is "theirs" not "ours."

It's easy to end up in this mode of thinking. I think it's a slow and
painful death for us if we don't build strong ways of working
productively together to solve thorny problems. There will be more and
more of them.

Mitchell

Justin Lebar

unread,
Feb 15, 2012, 5:19:16 PM2/15/12
to Mitchell Baker, dev-pl...@lists.mozilla.org
>> He is saying that _if_ it is (and that is what we are discussing), then
>> the responsibility for coming up with something better falls to the
>> Metrics team. In other words, he thinks it's unreasonable for Daniel to
>> say, as he did further up the thread, "if other people can come up with
>> a feasible alternative to this form of data collection, fine, otherwise
>> we are going ahead with it".

I think Gerv's second rewording matches most closely with what I'm
trying to say. A third way of putting it is that I'd like to separate
the question of "should we go forward with the current proposal?" from
the question of "how can we improve the current proposal?".

Although it's conceivable that we might be able to tweak the proposal
so as to improve its privacy aspect without harming its
data-collection aspect, I think that's unlikely (and in any case, such
a change would be uncontroversial). More likely, an "improvement" in
one area would be a regression in the other.

So before we try to trade off between privacy and data-collection, I'd
like us to understand whether the current proposal is acceptable at
all from a privacy perspective. If it is, then we wouldn't need to
change the proposal to increase privacy (at the expense of data
accuracy).

> Actually, I still disagree with both the positions asserted.  If there is an
> issue related to our values, then it is *everyone's* job to do a few things:
>
>    1.  understand the other person's goal, perspective and issues.
>    2.  try to figure out a way to meet those goals*

>From the perspective of my obligations as a Mozillian, I agree that I
should help the metrics team achieve its goals, rather than adopting
an us-versus-them attitude.

(I don't think this is necessarily inconsistent with Gerv's "it's up
to the Metrics team to propose another way of doing it if they still
want the data," in the sense that "the Metrics team" might be more
than just the people below "metrics" in the MoCo org chart. In a way,
anyone can be of the Metrics team by constructively participating in
this hypothetical rework of MDP, so Gerv's statement is tautological.
But this is a distraction from the main point.)

All I'm saying is, one should be able to argue that MDP is not
consistent with our values without, at the same time, proposing an
alternative to MDP. The question of whether MDP as proposed is
consistent with our values is separate from the question of whether
and how we're going to learn the things MDP wants to help us learn.

Thankfully we've stopped conflating these two questions in this
thread, so I have no further complaints at this time! :)

I'm sorry for speaking so strongly on this issue without speaking clearly.

-Justin

> In this case for example, one goal is to understand more when people give us
> a very clear message: " I'm not using your product as my {main} browser
> anymore."
>
> (*One could end up disagreeing on the goals.  For example, perhaps some
> would say we don't need to know why people drop Firefox for another browser,
> it's better to do whatever we think is right and not be concerned with
> relevance.  If so, we should have that discussion directly.)
>
>>
>> In that, I think he is right. We should work out whether this is in
>> violation of our values or not and, if it is, it's up to the Metrics
>> team to propose another way of doing it if they still want the data.
>>
> Yes and no, in my mind.  Metrics are the metrics experts, so we can't have a
> metrics proposal without their leadership.  But others are product experts
> and can have a lot of valuable input and leadership. Indeed, we're seeing
> some of that here.  So the specific part of this statement that  i disagree
> with are the ideas that:
>
>    -- any specific team is on its own when trying to solve a thorny problems
> -- others can stop the work but not have responsibility for solving the
> problem
>    -- the tenor of the phraise "if they still want the data."   This assumes
> that the problem is "theirs" not "ours."
>
> It's easy to end up in this mode of thinking.  I think it's a slow and
> painful death for us if we don't build strong ways of working productively
> together to solve thorny problems.  There will be more and more of them.
>
> Mitchell
>

Daniel E

unread,
Feb 15, 2012, 5:46:48 PM2/15/12
to
On Feb 14, 9:41 am, Gervase Markham <g...@mozilla.org> wrote:
> On 10/02/12 20:08, Justin Dolske wrote:
>
> > I'd like to suggest splitting the "metrics collection" work up into a
> > few separate pieces. It seems like there are multiple goals, with
> > varying levels of controversy, and so untangling things seems like a way
> > to both make some progress and make it easier to consider parts
> > independently.
>
> Great idea.

We do have a few key goals that are laid out as requirements in the
proposal on the wiki. However, the assessment of the metrics team is
that the proposal we offered supports all the requirements without
introducing a high risk of failure due to complexity or data quality
issues. If we instead break it up into four or more separate
mechanisms, we still have the original set of requirements (unless the
organization decides to change those as well), and they share several
of the same technical needs, but now we have a higher risk of reduced
effectiveness or complete failure in meeting one or more of those
goals and a lot of shared code and additional surface area that the
user needs to be aware of.

>
> It seems that most of the concern is around the unique user identifier.
> Splitting it up also helps people who want to work on mathematical or
> cryptographic ways to preserve privacy while correlating submission
> events to work only on that piece.

The majority of concern that I have seen so far is not that we have a
unique user identifier, we are explicitly trying to avoid that. The
concern that I have seen is whether the technical approaches (such as
the document identifier strategy and user-controlled data removal) and
the policy approaches (such as permanently divorcing the IP address
and any longitudinal geographic location from the data and providing
source code for client and server and completely transparent view of
the data being stored) are sufficient.

If people have the opinion that these measures are not enough and that
it could be technically possible at some point in the future to morph
this into a user identification and tracking system, contrary to our
current privacy standards and efforts to prevent such action, we
should at least be specific about that. When we talk about it as if
it already is such a system, it just leads to more confusion.

>
> > One small catch comes to mind: it still feels like there is a lot of
> > overlap with what Telemetry does, and so it would seem sensible to try
> > to converge on a single "telemetry and metrics" service. Or, even if
> > they're separate implementations, combine them when handled in the UI
> > and a single optin/optout.
>
> Certainly they should be combined in the UI. IMO it would be a very
> nuanced privacy-conscious individual who was OK with Mozilla getting
> Telemetry data but not OK with what Metrics are after.

Telemetry collects snapshots of lots of fine grained performance
data. MDP collects a much higher level set of product usage trends.
If we want to ask one question whether it is okay to collect both is
something that could be considered separately, but the systems have
enough differences in their purposes that I fear it would be a very
complex task to try to merge them and not create more confusion.

Mitchell Baker

unread,
Feb 15, 2012, 8:19:52 PM2/15/12
to
On 2/15/12 2:19 PM, Justin Lebar wrote:
>>> He is saying that _if_ it is (and that is what we are discussing), then
>>> the responsibility for coming up with something better falls to the
>>> Metrics team. In other words, he thinks it's unreasonable for Daniel to
>>> say, as he did further up the thread, "if other people can come up with
>>> a feasible alternative to this form of data collection, fine, otherwise
>>> we are going ahead with it".
>
> I think Gerv's second rewording matches most closely with what I'm
> trying to say. A third way of putting it is that I'd like to separate
> the question of "should we go forward with the current proposal?" from
> the question of "how can we improve the current proposal?".

hmm, this feels odd to me but I think it's a lot of language and nuance
parsing at this point and probably not the best use of time.
>
> Although it's conceivable that we might be able to tweak the proposal
> so as to improve its privacy aspect without harming its
> data-collection aspect, I think that's unlikely (and in any case, such
> a change would be uncontroversial). More likely, an "improvement" in
> one area would be a regression in the other.
>
> So before we try to trade off between privacy and data-collection, I'd
> like us to understand whether the current proposal is acceptable at
> all from a privacy perspective. If it is, then we wouldn't need to
> change the proposal to increase privacy (at the expense of data
> accuracy).
>
Maybe if I saw the world as clear, sharp edges where something is either
on or off I might agree. Even if a policy were acceptable, we might
be able to make it better. And if there's a reasonable way to improve a
policy, then fighting over whether the original one was acceptable or
not may not be worth the time. But I think we're discussing meta topics
here, so I'll stop.


>> Actually, I still disagree with both the positions asserted. If there is an
>> issue related to our values, then it is *everyone's* job to do a few things:
>>
>> 1. understand the other person's goal, perspective and issues.
>> 2. try to figure out a way to meet those goals*
>
>> From the perspective of my obligations as a Mozillian, I agree that I
> should help the metrics team achieve its goals, rather than adopting
> an us-versus-them attitude.
OK, great. This is key.
>
> (I don't think this is necessarily inconsistent with Gerv's "it's up
> to the Metrics team to propose another way of doing it if they still
> want the data," in the sense that "the Metrics team" might be more
> than just the people below "metrics" in the MoCo org chart. In a way,
> anyone can be of the Metrics team by constructively participating in
> this hypothetical rework of MDP, so Gerv's statement is tautological.
> But this is a distraction from the main point.)
>
> All I'm saying is, one should be able to argue that MDP is not
> consistent with our values without, at the same time, proposing an
> alternative to MDP. The question of whether MDP as proposed is
> consistent with our values is separate from the question of whether
> and how we're going to learn the things MDP wants to help us learn.
>
> Thankfully we've stopped conflating these two questions in this
> thread, so I have no further complaints at this time! :)
>
> I'm sorry for speaking so strongly on this issue without speaking clearly.
>
I probably responded strongly too; thanks for the replies.

Gervase Markham

unread,
Feb 16, 2012, 6:07:03 AM2/16/12
to
On 15/02/12 21:33, Mitchell Baker wrote:
> Actually, I still disagree with both the positions asserted. If there is
> an issue related to our values, then it is *everyone's* job to do a few
> things:
>
> 1. understand the other person's goal, perspective and issues.
> 2. try to figure out a way to meet those goals*

No argument there :-)

> In this case for example, one goal is to understand more when people
> give us a very clear message: " I'm not using your product as my {main}
> browser anymore."

To nuance that: my understanding is that we are reasonably sure what
reasons people give for doing that ("it's slow / it's bloated etc."),
but we need to find out _why_ they are having that experience at a
technical level.

I certainly agree that we should do our utmost to find that out, in a
way consistent with our privacy principles. But I think everyone agrees
on that :-) The question is about whether certain actions are consistent
with them, or can be made consistent with them by doing some work or by
notifying users or by having the right written privacy policy or some
other mechanism.

And I think the reason there is disagreement is that people have
differing opinions on how our privacy principles apply in particular
concrete cases.

>> In that, I think he is right. We should work out whether this is in
>> violation of our values or not and, if it is, it's up to the Metrics
>> team to propose another way of doing it if they still want the data.
>>
> Yes and no, in my mind. Metrics are the metrics experts, so we can't
> have a metrics proposal without their leadership. But others are product
> experts and can have a lot of valuable input and leadership. Indeed,
> we're seeing some of that here. So the specific part of this statement
> that i disagree with are the ideas that:
>
> -- any specific team is on its own when trying to solve a thorny
> problems -- others can stop the work but not have responsibility for
> solving the problem
> -- the tenor of the phraise "if they still want the data." This assumes
> that the problem is "theirs" not "ours."

Both fair points.

> It's easy to end up in this mode of thinking. I think it's a slow and
> painful death for us if we don't build strong ways of working
> productively together to solve thorny problems. There will be more and
> more of them.

I've been encouraged to see people (from various teams) coming up with
creative cryptographic and statistical suggestions for how we might be
able to get this data with reduced privacy impact. More of that :-))

Gerv

Henri Sivonen

unread,
Feb 20, 2012, 6:27:35 AM2/20/12
to dev-pl...@lists.mozilla.org
On Thu, Feb 16, 2012 at 12:46 AM, Daniel E <deins...@gmail.com> wrote:
> The
> concern that I have seen is whether the technical approaches (such as
> the document identifier strategy and user-controlled data removal) and
> the policy approaches (such as permanently divorcing the IP address
> and any longitudinal geographic location from the data and providing
> source code for client and server and completely transparent view of
> the data being stored) are sufficient.

Why does the longitudinal study of performance effects need to happen
on the server?

Would it not be feasible to have the browser perform longitudinal
analytics locally and send the conclusions to the server? (E.g. have
the browser locally conclude that memory usage went bad after
extension Foo got installed and then send an "extension Foo is bad for
memory usage" message to the metrics server.)

--
Henri Sivonen
hsiv...@iki.fi
http://hsivonen.iki.fi/

Daniel E

unread,
Feb 20, 2012, 5:58:37 PM2/20/12
to
On Feb 20, 6:27 am, Henri Sivonen <hsivo...@iki.fi> wrote:
> Why does the longitudinal study of performance effects need to happen
> on the server?
>
> Would it not be feasible to have the browser perform longitudinal
> analytics locally and send the conclusions to the server? (E.g. have
> the browser locally conclude that memory usage went bad after
> extension Foo got installed and then send an "extension Foo is bad for
> memory usage" message to the metrics server.)

It might be possible to eventually develop an automated diagnostic
feature, and the about:metrics portion of MDP is all about being able
to surface such information. It would be very difficult to get such a
feature right without having enough data to test and develop the
model. Retention metrics is one area that would not be feasible
without data being sent in to us on a frequent basis.

Henri Sivonen

unread,
Feb 21, 2012, 2:04:30 AM2/21/12
to dev-pl...@lists.mozilla.org
Could the model be developed with opt-in data first?

Could retention be tallied by having the clients send "I've been an
active Firefox for an X length of time" where X is made coarse-grained
enough not to uniquely identify an install time? If the time was
reported to month granularity, if you see n Firefoxes that have been
active for N months at month_0 and m Firefoxes that have been active
for N+1 months at month_0+1, you can conclude that n-m installations
were lost.

John Hopkins

unread,
Feb 21, 2012, 10:10:29 AM2/21/12
to dev-pl...@lists.mozilla.org
On 12-02-08 04:45 PM, Daniel E wrote:
> Are installations with few add-ons more likely to be abandoned than
> installations with a lot?
> Do certain sets of add-ons contribute significantly to the continued
> use of Firefox?

Some thoughts:

1. Can the product simply treat addons as 'untrustworthy' and limit CPU
and memory usage? If we can prevent addons from affecting the core
browser experience, that would eliminate a whole class of problems.
People might complain "addon XYZ is slow" instead of blaming the browser
itself.

2. Separate from whether or not we collect profile data, have we
considered using a Test Pilot study to ask users for their input?
- how likely are you to use Firefox 6 months from now?
- what is your favourite Firefox feature?
- what is one thing you would change about Firefox?
- etc.

3. Do we have any data on the impact that competing browser
advertisements have had on Firefox usage? How do we interpret profile
correlation results when we know there are external influences like this?

4. Should we codify our data collection policies? For example, in a
table like this:
| POLICY | DATA TYPE |
| opt-in | email address |
| opt-out | raw performance data, list of addons |
| aggregate | websites visited per day |
| never | passwords |


John

On 12-02-08 04:45 PM, Daniel E wrote:
> On Feb 8, 12:59 pm, Robert Kaiser<ka...@kairo.at> wrote:
>
>> Do we need the list of add-ons including a lot of detail on them for
>> answering those questions? It looks to me like a significant factor in
>> controversy here is the fingerprintability of that data (correct me if
>> I'm wrong) and I wonder if we really need that data to answer the core
>> questions we have or if it's just a "would be nice to answer those
>> questions as well" point to include that info.
>>
>> Robert Kaiser
>
> We are working with the Privacy team to put together a comprehensive
> set of information that covers each data point and the questions that
> require them.
> I can go ahead and answer this one right now though. Note that these
> are questions that specifically center around add-ons. It is by no
> means the complete set of questions.
>
> For the user looking at about:metrics and their own local data, the
> questions are:
> Has the performance or stability of my installation changed since I
> installed this add-on?
> Has the performance or stability of my installation changed since I
> updated this add-on to this version?
> How does my performance or stability compare to other installations
> with the same add-on, version, or set of add-ons?
> Did the performance or stability of installations with this add-on
> change recently? (i.e. a problem with the add-on's interaction with
> the web)
>
> Obviously, not everyone will know or wish to go and look here. That
> is part of why we are trying to get this data so we can understand it
> at an aggregate level too. That said, I believe that empowering users
> to discover the answer to these questions is a good thing. It provides
> another tool for the SUMO community to help people. It also gives
> people a factual basis to communicate issues -- "I have this chart
> that shows the add-on caused a problem for me" instead of intuition
> and approximations -- "I think it started getting a bit crashier after
> I updated a couple of add-ons".
>
> For Mozilla looking at the data on our side, the questions are:
> What add-ons significantly contribute to a degradation of performance
> or stability for the majority of installations?
> Do certain versions of those add-ons increase or decrease their
> impact?
> Is the amount of impact constant (i.e. 1000ms sleep on start) or does
> it have a high variance? (It affects some installations severely,
> others not at all. This is a likely indication of a confounding
> variable such as conflict with another add-on).
> Do certain sets of add-ons significantly contribute to a degradation
> of performance or stability for the majority of installations?
> Do add-ons which use binary components significantly contribute to a
> degradation of performance or stability for the majority of
> installations?
> Are there "dark-matter" add-ons or plug-ins that significantly
> contribute to a degradation of performance or stability for the
> majority of installations? (Dark-matter add-ons are ones that are not
> hosted on AMO.)
>
> Specific retention analysis questions:
> Are installations with few add-ons more likely to be abandoned than
> installations with a lot?
> Do certain sets of add-ons contribute significantly to the continued
> use of Firefox?
> Do certain add-ons or versions of add-ons significantly contribute to
> the abandonment of installations?
> Do certain add-ons or versions of add-ons significantly contribute to
> the delay of updates of Firefox?
> Is the combination of add-ons plus usage frequency statistically
> significant?
>
>
> All of these questions are things that have been asked over the past
> several years and they are determined to be critically important as a
> part of understanding usage and improving Firefox. In the past, we
> have tried to answer some of these questions through debugging,
> analyzing crash-stats, analyzing SAMO data, and looking at Telemetry.
> Each one of these data sources doesn't provide a clear enough picture
> though. Mostly because they can't show a trend over time (i.e. the
> installation was good before add-on version X and bad after it) and
> they can't show abandoned installations (i.e. retention).
>
> Some people feel that dark-matter add-ons are a significant cause of
> difficult to understand performance and stability problems. It is
> difficult for us to find and investigate those since they aren't
> available on AMO and we don't have many metrics on them. It is easy
> to look at some crash data and decide that an add-on causes crashes.
> In many cases, that analysis will be right, but without knowing how
> many installations were already having trouble before that add-on was
> installed, or how many installations are running fine with it
> installed, we aren't making the best decisions we could be making.

Boris Zbarsky

unread,
Feb 21, 2012, 1:49:01 PM2/21/12
to
On 2/21/12 10:10 AM, John Hopkins wrote:
> 1. Can the product simply treat addons as 'untrustworthy' and limit CPU
> and memory usage?

Addons effectively inject new parts of the product, so no. As in, addon
script looks no different from all other Firefox UI script and shares a
heap with it....

Of course if we completely redesign how addons work, we could do that.
All addons would need to be substantially rewritten to make that work.

-Boris

Ben Bucksch

unread,
Feb 22, 2012, 8:39:34 AM2/22/12
to
On 21.02.2012 19:49, Boris Zbarsky wrote:
>
> Of course if we completely redesign how addons work, we could do that.
> All addons would need to be substantially rewritten to make that work.

Jetpack / Addon-SDK is exactly that, and designed to make that possible.

George S

unread,
Mar 20, 2012, 6:44:27 AM3/20/12
to
Hello. I hope you will tolerate my tardy response to this thread.
I have only just recently learned of this issue and I also wanted
to review all previous messages and information before commenting.

I see what appears to be conflicting info on connection security.
The MetricsDataPing Wiki page says "The server will return an HTTP
response to the client indicating success of both the deletion of
the old document and storage of the new document.". However, a
post by Daniel E mentions "Every time a new submission is made, it
will have a new document identifier. It is even possible for the
identifier to not be part of the URL (which is sent using SSL).".
Will ALL communications related to metrics reporting be carried out
via SSL?

The MetricsDataPing Wiki page says "In the future this response might
also include instructions to the client for things such as changing
timing or MetricsDataPing configuration.". What are we talking about
here? The ability to increase the reporting period to a client bounds
checked value of >= 24? The ability to (silently) create a change in
what information is being gathered and reported?!?

Will add-ons or extensions be able to access local metrix data and/or
document ID?

If the most recently accepted document ID must be submitted to the
server in order for the server to locate & delete the latest document,
then the latest document ID becoming inaccessible (hard drive crash,
something else requiring a clean reinstall, whatever) would eliminate
the ability to delete the latest stored document. In many cases, I
think, not even a restore from backup would result in the latest
document ID being restored and accessible. One would have no option
but to wait for the "documents older than 6 months are deleted"?

Will an enabled->disabled transition in the client result in both
local and remote information being deleted?

Will reported information be captured by backups that are not subject
to the deletions requested by users?

It appears this is being enabled by default and the setting to control
it will be found in the Advanced General tab. I get the impression
that the intent is for there to be no disclosure plus express consent
cycle and there will be no ability to disable this reporting until
after the software is installed and up and running (and reporting?).
I'd like to confirm that though, if I may, for both new installs and
updates.

Thanks

Ben Bucksch

unread,
Jan 27, 2014, 1:06:01 PM1/27/14
to
Update, 2 years later:

https://groups.google.com/d/msg/mozilla.dev.security/3LB_8JSHqyQ/gReKuUdt090J
Subject: "Snowden ARD/DasErste interview in English"

> Snowden: "I can build what’s called a fingerprint which is network
> activity unique to you which means anywhere you go in the world"
>
> He didn't elaborate on what this "fingerprint" is based on, but it
> could be:
> ...
> * Unique IDs in "phone home", e.g "Firefox Health Services". This is
> precisely why I opposed Metrics, I wrote in bug 718066 comment 2:
> "Having a UUID would allow, for example, to track all my dynamic IP
> addresses over time, and allow to build a profile, when combined
> with access logs. If I have a notebook or mobile browser, it would
> even allow to track the places where I go based on IP geolocation /
> whois data."
> and ... now Snowden confirms that they are doing precisely that.

and totalitarian states as well, probably.
0 new messages