Firefox Hello new data collection

244 views
Skip to first unread message

Romain Testard

unread,
Apr 4, 2016, 5:02:03 AM4/4/16
to dev-pl...@lists.mozilla.org, Ian Bicking, Marshall Erwin, Winston Bowden
Hi all,


We wanted to let you know about new data collection that we will be doing
for Firefox Hello starting with FF46 launch on April 19th, and the steps we
took to prevent it from collecting personal identification. We want to
collect more data about the websites that people share with Hello, to help
optimize the product UX, understand what people use our new tab sharing
feature for, and prioritize features accordingly. The product features and
UX can be very different if we decide to optimize against “Shopping
together” use cases as opposed to “Playing online games together”, just as
examples.


We did a lot of diligence for this and explored several options for getting
the data. The approach described below is the one we settled on. It
prevents personal identification and gets us the data we need to build the
best tool we can while being sensitive to our users. This involves
collecting the domain names for tabs shared on Firefox Hello on our own
servers.


How we collect the data


We plan to put in place a data collection solution that prevents personal
identification. The technical approach to doing this through the use of
client-side whitelisting is outlined here:



-

Data will go to our servers and will be stored with our other server
metrics. We are aggregating domain names, and are not storing session
histories. These are submitted at the end of the session, so exact
timestamps of any visit are not included.
-

Users who have disabled Health Reports will also not submit this data.
-

We would use a whitelist client-side to only collect domains that are
part of the top 2000 domains (Alexa list of top domains). This prevents
personal identification based on obscure domain usage. We would subtract
the sites from the Adult
<http://www.alexa.com/topsites/category/Top/Adult> category and add all
the subdomains of:


-

google.com
<http://www.labnol.org/internet/popular-google-subdomains/5888/>(e.g.,
drive.google.com)
-

yahoo.com (e.g., games.yahoo.com)
-

developer.mozilla.org, bugzilla.mozilla.org, wiki.mozilla.org (this
helps us understand how much our user base is Mozillians)
-

tunes.apple.com
-

You can see the exact list here: DomainWhitelist.jsm
<https://github.com/mozilla/loop/blob/master/add-on/chrome/modules/DomainWhitelist.jsm>



-

The data will only be kept for 6 months and we plan to revisit this
collection in 6 months. We’ll evaluate at the end of this period if we
should carry on collecting the data (the data is still useful and will help
further shape the product) or just stop.


This e-mail is intended to make everyone aware of the data we’re collecting
in Hello in an effort to be as transparent as possible. We want make sure
people get the full picture of what we are trying to achieve and what we’re
putting in place to protect our users.


Let me know if you have any questions.



Implementation bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1211542

Technical documentation:
https://github.com/mozilla/loop/blob/master/docs/DataCollection.md


-Romain

Gijs Kruitbosch

unread,
Apr 4, 2016, 5:23:23 AM4/4/16
to
Hi,

It's very concerning to me that you have not answered the obvious
question: what domains are collected? All of the ones visited while the
browser is running? The ones visited while Hello is open? The ones
visited while shared through Hello? What about the ones that someone
shared with you through Hello, rather than that you shared with someone
else?

What about Private Browsing mode, have you disabled collection there?

On 04/04/2016 10:01, Romain Testard wrote:
> We would use a whitelist client-side to only collect domains that are
> part of the top 2000 domains (Alexa list of top domains). This prevents
> personal identification based on obscure domain usage.

Mathematically, the combination of a set of (popular) domains shared
could still be uniquely identifying, especially as, AIUI, you will get
the counts of each domain and in what sequence they were visited / which
ones were visited in which session. It all depends on the number of
unique users and the number of domains they visit / share (not clear:
see above). Because the total number of Hello users compared with the
number of Firefox users is quite low, this still seems somewhat
concerning to me. Have you tried to remedy this in any way?

The beginning of your message mentioned that you were interested in
different "types" of sites. I don't think it would be necessary to
optimize Hello for one shopping site over another, or for one search
engine over another, or for one news site over another. So, why don't
you categorize the domains in the whitelist according to broad
categories ("news", "search", "shopping", "games", or something like
this) on the client side, and then send that information instead? If the
set of domains is limited (which it is) then this should not take that
long, and get you exactly the information you want, and limit the
privacy invasion that the current collection scheme represents.

6 months also seems incredibly long. You should be able to aggregate the
data and keep that ("60% of users share on sites of type X") and throw
away the raw data much sooner than that.

Finally, I am surprised that you're sharing this 2 weeks before we're
releasing Firefox 46. Hasn't this been tested and verified on Nightly
and/or other channels? Why was no privacy update made at/before that time?

~ Gijs

Gijs Kruitbosch

unread,
Apr 4, 2016, 5:25:31 AM4/4/16
to
On 04/04/2016 10:01, Romain Testard wrote:
> Implementation bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1211542

Because this bug does not link to it: where is the bug for the privacy
review of this collection? Judging by the people you CC'd I assume you
got one, but where is it?

~ Gijs

ad...@imgland.xyz

unread,
Apr 4, 2016, 5:44:52 AM4/4/16
to Romain Testard, dev-pl...@lists.mozilla.org, Ian Bicking, Marshall Erwin, Winston Bowden
This isn't technically about the data collection but it would be better if there was some sort of api that web developers could implement on sites like games so instead of regular chat things like co-op and game events could be streamlined into Hello itself

04.04.2016, 10:02, "Romain Testard" <rom...@mozilla.com>:
>    We would use a whitelist client-side to only collect domains that are
>    part of the top 2000 domains (Alexa list of top domains). This prevents
> _______________________________________________
> dev-platform mailing list
> dev-pl...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform

Romain Testard

unread,
Apr 4, 2016, 6:01:49 AM4/4/16
to Gijs Kruitbosch, dev-pl...@lists.mozilla.org
The privacy review bug is
https://bugzilla.mozilla.org/show_bug.cgi?id=1261467.
More details added below.

On Mon, Apr 4, 2016 at 11:23 AM, Gijs Kruitbosch <gijskru...@gmail.com>
wrote:

> Hi,
>
> It's very concerning to me that you have not answered the obvious
> question: what domains are collected? All of the ones visited while the
> browser is running? The ones visited while Hello is open? The ones visited
> while shared through Hello? What about the ones that someone shared with
> you through Hello, rather than that you shared with someone else?
>

We only collect domains browsed whilst sharing your tabs on Firefox Hello
(link generator side).

>
> What about Private Browsing mode, have you disabled collection there?


Firefox Hello cannot be used with private browsing mode.

>
>
> On 04/04/2016 10:01, Romain Testard wrote:
>
>> We would use a whitelist client-side to only collect domains that are
>> part of the top 2000 domains (Alexa list of top domains). This
>> prevents
>> personal identification based on obscure domain usage.
>>
>
> Mathematically, the combination of a set of (popular) domains shared could
> still be uniquely identifying, especially as, AIUI, you will get the counts
> of each domain and in what sequence they were visited / which ones were
> visited in which session. It all depends on the number of unique users and
> the number of domains they visit / share (not clear: see above). Because
> the total number of Hello users compared with the number of Firefox users
> is quite low, this still seems somewhat concerning to me. Have you tried to
> remedy this in any way?
>

We are aggregating domain names, and are not storing session histories.
These are submitted at the end of the session, so exact timestamps of any
visit are not included.

The beginning of your message mentioned that you were interested in
> different "types" of sites. I don't think it would be necessary to optimize
> Hello for one shopping site over another, or for one search engine over
> another, or for one news site over another. So, why don't you categorize
> the domains in the whitelist according to broad categories ("news",
> "search", "shopping", "games", or something like this) on the client side,
> and then send that information instead? If the set of domains is limited
> (which it is) then this should not take that long, and get you exactly the
> information you want, and limit the privacy invasion that the current
> collection scheme represents.
>
> We looked into this approach originally although we found that we'd lose a
level of granularity that can have an importance. We may find that Hello
gets used a lot with a specific Website for a specific reason and using
client side categories would prevent us from learning this. Also Alexa
website categories are far from perfect which would add another level of
complexity to understand the collected data.


> 6 months also seems incredibly long. You should be able to aggregate the
> data and keep that ("60% of users share on sites of type X") and throw away
> the raw data much sooner than that.
>
Yes agreed, we'll look into what's the most optimal amount of time required
to process the data and extract the useful information. I agree we should
try to make this shorter - we'll learn from being on Beta and will adjust
this accordingly.

>
> Finally, I am surprised that you're sharing this 2 weeks before we're
> releasing Firefox 46. Hasn't this been tested and verified on Nightly
> and/or other channels? Why was no privacy update made at/before that time?
>

We are shipping Hello through Go Faster. The Go Faster process allows us to
uplift directly to Beta 46 directly since we're a system add-on
(development was done about 2 weeks ago).
Firefox Hello has its own privacy notice (details here
<https://www.mozilla.org/en-US/privacy/firefox-hello/>).

>
> ~ Gijs

Chris Hofmann

unread,
Apr 4, 2016, 11:22:16 AM4/4/16
to Romain Testard, dev-platform, Gijs Kruitbosch
It also seems like you haven't explored other alternatives to get the data
you are after, have some theories around what results you might expect, and
what possible out comes will be pursed once you get the data.

Have you looked at other studies like this and many more that tell about
general browsing habits?
http://www.adweek.com/socialtimes/online-time/463670

Have you looked at just doing a simple survey to ask people to tell you
what kinds of activities they most use when sharing sites with hello?

If the survey or data collection results tell you that some people play
games against each other *and* some people shop together what will you do
then?

-chofmann

On Mon, Apr 4, 2016 at 3:01 AM, Romain Testard <rom...@mozilla.com> wrote:

> The privacy review bug is
> https://bugzilla.mozilla.org/show_bug.cgi?id=1261467.
> More details added below.
>
> On Mon, Apr 4, 2016 at 11:23 AM, Gijs Kruitbosch <gijskru...@gmail.com
> >
> wrote:
>
> > Hi,
> >
> > It's very concerning to me that you have not answered the obvious
> > question: what domains are collected? All of the ones visited while the
> > browser is running? The ones visited while Hello is open? The ones
> visited
> > while shared through Hello? What about the ones that someone shared with
> > you through Hello, rather than that you shared with someone else?
> >
>
> We only collect domains browsed whilst sharing your tabs on Firefox Hello
> (link generator side).
>
> >
> > What about Private Browsing mode, have you disabled collection there?
>
>
> Firefox Hello cannot be used with private browsing mode.
>
> >
> >
> > On 04/04/2016 10:01, Romain Testard wrote:
> >
> >> We would use a whitelist client-side to only collect domains that
> are
> >> part of the top 2000 domains (Alexa list of top domains). This
> >> prevents
> >> personal identification based on obscure domain usage.
> >>
> >
> > Mathematically, the combination of a set of (popular) domains shared
> could
> > still be uniquely identifying, especially as, AIUI, you will get the
> counts
> > of each domain and in what sequence they were visited / which ones were
> > visited in which session. It all depends on the number of unique users
> and
> > the number of domains they visit / share (not clear: see above). Because
> > the total number of Hello users compared with the number of Firefox users
> > is quite low, this still seems somewhat concerning to me. Have you tried
> to
> > remedy this in any way?
> >
>
> We are aggregating domain names, and are not storing session histories.
> These are submitted at the end of the session, so exact timestamps of any
> visit are not included.
>
> The beginning of your message mentioned that you were interested in
> > different "types" of sites. I don't think it would be necessary to
> optimize
> > Hello for one shopping site over another, or for one search engine over
> > another, or for one news site over another. So, why don't you categorize
> > the domains in the whitelist according to broad categories ("news",
> > "search", "shopping", "games", or something like this) on the client
> side,
> > and then send that information instead? If the set of domains is limited
> > (which it is) then this should not take that long, and get you exactly
> the
> > information you want, and limit the privacy invasion that the current
> > collection scheme represents.
> >
> > We looked into this approach originally although we found that we'd lose
> a
> level of granularity that can have an importance. We may find that Hello
> gets used a lot with a specific Website for a specific reason and using
> client side categories would prevent us from learning this. Also Alexa
> website categories are far from perfect which would add another level of
> complexity to understand the collected data.
>
>
> > 6 months also seems incredibly long. You should be able to aggregate the
> > data and keep that ("60% of users share on sites of type X") and throw
> away
> > the raw data much sooner than that.
> >
> Yes agreed, we'll look into what's the most optimal amount of time required
> to process the data and extract the useful information. I agree we should
> try to make this shorter - we'll learn from being on Beta and will adjust
> this accordingly.
>
> >
> > Finally, I am surprised that you're sharing this 2 weeks before we're
> > releasing Firefox 46. Hasn't this been tested and verified on Nightly
> > and/or other channels? Why was no privacy update made at/before that
> time?
> >
>

ad...@imgland.xyz

unread,
Apr 4, 2016, 11:25:04 AM4/4/16
to Chris Hofmann, Romain Testard, dev-platform, Gijs Kruitbosch
I agree with chofmann in that a simple survey request when users open Hello would probably work since Mozilla is trusted by alot of people.

04.04.2016, 16:22, "Chris Hofmann" <chof...@mozilla.com>:
> It also seems like you haven't explored other alternatives to get the data
> you are after, have some theories around what results you might expect, and
> what possible out comes will be pursed once you get the data.
>
> Have you looked at other studies like this and many more that tell about
> general browsing habits?
> http://www.adweek.com/socialtimes/online-time/463670
>
> Have you looked at just doing a simple survey to ask people to tell you
> what kinds of activities they most use when sharing sites with hello?
>
> If the survey or data collection results tell you that some people play
> games against each other *and* some people shop together what will you do
> then?
>
> -chofmann
>
> On Mon, Apr 4, 2016 at 3:01 AM, Romain Testard <rom...@mozilla.com> wrote:
>
>>  The privacy review bug is
>>  https://bugzilla.mozilla.org/show_bug.cgi?id=1261467.
>>  More details added below.
>>
>>  On Mon, Apr 4, 2016 at 11:23 AM, Gijs Kruitbosch <gijskru...@gmail.com
>>  >
>>  wrote:
>>
>>  > Hi,
>>  >
>>  > It's very concerning to me that you have not answered the obvious
>>  > question: what domains are collected? All of the ones visited while the
>>  > browser is running? The ones visited while Hello is open? The ones
>>  visited
>>  > while shared through Hello? What about the ones that someone shared with
>>  > you through Hello, rather than that you shared with someone else?
>>  >
>>
>>  We only collect domains browsed whilst sharing your tabs on Firefox Hello
>>  (link generator side).
>>
>>  >
>>  > What about Private Browsing mode, have you disabled collection there?
>>
>>  Firefox Hello cannot be used with private browsing mode.
>>
>>  >
>>  >
>>  > On 04/04/2016 10:01, Romain Testard wrote:
>>  >
>>  >> We would use a whitelist client-side to only collect domains that
>>  are
>>  >> part of the top 2000 domains (Alexa list of top domains). This
>>  >> prevents
>>  >> personal identification based on obscure domain usage.
>>  >>
>>  >
>>  > Mathematically, the combination of a set of (popular) domains shared
>>  could
>>  > still be uniquely identifying, especially as, AIUI, you will get the
>>  counts
>>  > of each domain and in what sequence they were visited / which ones were
>>  > visited in which session. It all depends on the number of unique users
>>  and
>>  > the number of domains they visit / share (not clear: see above). Because
>>  > the total number of Hello users compared with the number of Firefox users
>>  > is quite low, this still seems somewhat concerning to me. Have you tried
>>  to
>>  > remedy this in any way?
>>  >
>>
>>  We are aggregating domain names, and are not storing session histories.
>>  These are submitted at the end of the session, so exact timestamps of any
>>  visit are not included.
>>
>>  The beginning of your message mentioned that you were interested in
>>  > different "types" of sites. I don't think it would be necessary to
>>  optimize
>>  > Hello for one shopping site over another, or for one search engine over
>>  > another, or for one news site over another. So, why don't you categorize
>>  > the domains in the whitelist according to broad categories ("news",
>>  > "search", "shopping", "games", or something like this) on the client
>>  side,
>>  > and then send that information instead? If the set of domains is limited
>>  > (which it is) then this should not take that long, and get you exactly
>>  the
>>  > information you want, and limit the privacy invasion that the current
>>  > collection scheme represents.
>>  >
>>  > We looked into this approach originally although we found that we'd lose
>>  a
>>  level of granularity that can have an importance. We may find that Hello
>>  gets used a lot with a specific Website for a specific reason and using
>>  client side categories would prevent us from learning this. Also Alexa
>>  website categories are far from perfect which would add another level of
>>  complexity to understand the collected data.
>>
>>  > 6 months also seems incredibly long. You should be able to aggregate the
>>  > data and keep that ("60% of users share on sites of type X") and throw
>>  away
>>  > the raw data much sooner than that.
>>  >
>>  Yes agreed, we'll look into what's the most optimal amount of time required
>>  to process the data and extract the useful information. I agree we should
>>  try to make this shorter - we'll learn from being on Beta and will adjust
>>  this accordingly.
>>
>>  >
>>  > Finally, I am surprised that you're sharing this 2 weeks before we're
>>  > releasing Firefox 46. Hasn't this been tested and verified on Nightly
>>  > and/or other channels? Why was no privacy update made at/before that
>>  time?
>>  >
>>

Gijs Kruitbosch

unread,
Apr 4, 2016, 11:44:22 AM4/4/16
to Romain Testard
On 04/04/2016 11:01, Romain Testard wrote:
> The privacy review bug is
> https://bugzilla.mozilla.org/show_bug.cgi?id=1261467.
> More details added below.

See response at the bottom.

> On Mon, Apr 4, 2016 at 11:23 AM, Gijs Kruitbosch <gijskru...@gmail.com>
> wrote:
>> On 04/04/2016 10:01, Romain Testard wrote:
>>
>>> We would use a whitelist client-side to only collect domains that are
>>> part of the top 2000 domains (Alexa list of top domains). This
>>> prevents
>>> personal identification based on obscure domain usage.
>>>
>>
>> Mathematically, the combination of a set of (popular) domains shared could
>> still be uniquely identifying, especially as, AIUI, you will get the counts
>> of each domain and in what sequence they were visited / which ones were
>> visited in which session. It all depends on the number of unique users and
>> the number of domains they visit / share (not clear: see above). Because
>> the total number of Hello users compared with the number of Firefox users
>> is quite low, this still seems somewhat concerning to me. Have you tried to
>> remedy this in any way?
>>
>
> We are aggregating domain names, and are not storing session histories.
> These are submitted at the end of the session, so exact timestamps of any
> visit are not included.

But both Firefox and Hello sessions are commonly relatively short (<1d)
and numerous. That means lots of data points, which will likely be
enough to uniquely identify people even without exact timestamps of
their visits. (FWIW, from a technical perspective, there is no reason
why the submission time implies ("so") that exact timestamps of visits
are not included.)

>> We looked into this approach originally although we found that we'd lose a
> level of granularity that can have an importance. We may find that Hello
> gets used a lot with a specific Website for a specific reason and using
> client side categories would prevent us from learning this.

This was explicitly not in your original motivation, so you're moving
the goalposts here. If the goal is about separate categories or separate
sites then those are pretty distinct goals that require different
approaches. If the real point is "we have no idea, so we figured we'd
just get the data and then go from there", why not be upfront about it?
But in that case, yeah, why not consider a survey or something less
intrusive, like asking people explicitly what type of site they were
using, or asking if Mozilla can use the domain in question ?

> Also Alexa
> website categories are far from perfect which would add another level of
> complexity to understand the collected data.

At no point did I say I expected you to use their categorization,
whatever that is. Categorize as you see fit, rather than as Alexa does it.

Conversely, if their categorization is questionable, then your scrubbing
of the Adult category sounds like it might need auditing? Also, why not
other categories like "Banking" or "Medical" (NB: no idea what
categorization Alexa employs, but these seem like categories that ought
to be scrubbed, too)?


>> 6 months also seems incredibly long. You should be able to aggregate the
>> data and keep that ("60% of users share on sites of type X") and throw away
>> the raw data much sooner than that.
>>
> Yes agreed, we'll look into what's the most optimal amount of time required
> to process the data and extract the useful information. I agree we should
> try to make this shorter - we'll learn from being on Beta and will adjust
> this accordingly.

Well, why not make it 1 week to start with, and make it longer if you
don't get enough information from beta (with a rationale as to why that
is the case) ?

>> Finally, I am surprised that you're sharing this 2 weeks before we're
>> releasing Firefox 46. Hasn't this been tested and verified on Nightly
>> and/or other channels? Why was no privacy update made at/before that time?
>>
>
> We are shipping Hello through Go Faster. The Go Faster process allows us to
> uplift directly to Beta 46 directly since we're a system add-on
> (development was done about 2 weeks ago).
> Firefox Hello has its own privacy notice (details here
> <https://www.mozilla.org/en-US/privacy/firefox-hello/>).

But shipping through go faster does not absolve you from adequately
testing changes and getting feedback on them. Is the add-on not getting
tested on nightly at all? Or at the same time as it goes to beta? When
will it be used on release - when 46 ships as release, or earlier, or later?

It also seems like you filed the privacy review after the functionality
was implemented and is now shipping, which per
https://wiki.mozilla.org/Privacy/Reviews seems like it is too late to
incorporate meaningful feedback. I'm not on the privacy team, but that
order looks wrong to me.

Finally, that privacy policy at no point says anything about Mozilla
having access to visited/shared domains and thereby potentially to
personally identifying information.

~ Gijs

ad...@imgland.xyz

unread,
Apr 4, 2016, 11:50:27 AM4/4/16
to Gijs Kruitbosch, dev-pl...@lists.mozilla.org
I don't know much about Mozilla's privacy but in my opinion feel the need to immediately remove it from Firefox and push a new beta build

04.04.2016, 16:45, "Gijs Kruitbosch" <gijskru...@gmail.com>:

Georg Fritzsche

unread,
Apr 4, 2016, 1:14:34 PM4/4/16
to Gijs Kruitbosch, dev-pl...@lists.mozilla.org, Romain Testard
On Mon, Apr 4, 2016 at 5:44 PM, Gijs Kruitbosch <gijskru...@gmail.com>
wrote:

It also seems like you filed the privacy review after the functionality was
> implemented and is now shipping, which per
> https://wiki.mozilla.org/Privacy/Reviews seems like it is too late to
> incorporate meaningful feedback. I'm not on the privacy team, but that
> order looks wrong to me.
>

We have a common data collection review process for Firefox now (with
additionally more intense privacy reviews where needed):
https://wiki.mozilla.org/Firefox/Data_Collection

The idea is definitely to request approval before landing new data
collections.
For more complex collections it is helpful to start communications in the
design phase to catch problems before implementation (maybe this happened
through other channels here?).

Georg

Mark Banner

unread,
Apr 4, 2016, 2:25:16 PM4/4/16
to
On 04/04/2016 16:49, ad...@imgland.xyz wrote:
> I don't know much about Mozilla's privacy but in my opinion feel the
> need to immediately remove it from Firefox and push a new beta build

I can understand your concern, however, please understand that this
logging functionality is currently disabled by default - see the
"loop.logDomains" preference.

We won't be enabling it until the privacy review is completed.

If you wish to inspect and validate my assertion, you are quite welcome to.

Here's a link to the code so that you can see what is currently on beta

http://mxr.mozilla.org/mozilla-beta/search?string=loop.logDomains&find=&findi=&filter=%5E%5B%5E%5C0%5D*%24&hitlimit=&tree=mozilla-beta

You can also see from the test file there, that we have a test to check
that nothing is logged if the pref is set to false.

Mark.

ad...@imgland.xyz

unread,
Apr 4, 2016, 3:48:48 PM4/4/16
to Mark Banner, dev-pl...@lists.mozilla.org

However it is concerning to have code in an Open source project that is
1.Mostly undocumented
2.Could be confusing to privacy-aware users
3.Harvests data without proper privacy notices
4.Has been added prematurely without proper documentation and serves no purpose like say, dom.webcomponents.enabled , does in making the browser more standards-compliant or up to date with an early W3C Specification

In my opinion this sort of feature should be held back from any mainstream release until it is clear to the end user exactly what it does. You appear to still not have a real focus for it from your messages which I find very worrying. I find that clear documentation is important and my only gripe with alot of products right now.

04.04.2016, 19:30, "Mark Banner" <mba...@mozilla.com>:

Ian Bicking

unread,
Apr 4, 2016, 4:35:48 PM4/4/16
to Gijs Kruitbosch, dev-pl...@lists.mozilla.org
On Mon, Apr 4, 2016 at 10:44 AM, Gijs Kruitbosch <gijskru...@gmail.com>
Yes, if an attacker has access to cross-domain tracking for several sites
that a user visits, and that attacker can access the reporting in transit,
it may be possible to correlate, thus finding the rest of the whitelisted
history, and some associated Firefox Hello data. But that's only in the
case of an attack. The actually data sent to the logging pipeline is
immediately pulled out of a session list and submitted as individual items,
and all other data (e.g., IP address) is left out of this logging.


>
> We looked into this approach originally although we found that we'd lose a
>>>
>> level of granularity that can have an importance. We may find that Hello
>> gets used a lot with a specific Website for a specific reason and using
>> client side categories would prevent us from learning this.
>>
>
> This was explicitly not in your original motivation, so you're moving the
> goalposts here. If the goal is about separate categories or separate sites
> then those are pretty distinct goals that require different approaches. If
> the real point is "we have no idea, so we figured we'd just get the data
> and then go from there", why not be upfront about it?


We are looking for clues about how people are using Hello, and using
domains as one way to understand this. So yes, it is exploratory, and we
are looking for insight we have not yet received, rather than a more binary
signal such as do people use Hello for shopping or not.

For example, two domains that are on the whitelist: steampowered.com and
steamcommunity.com – these would both typically be categorized as "gaming",
but they represent very different use cases (store vs. discussion). Or
aa.com, tripadvisor.com, and expedia.com are all travel sites, but
represent different (but overlapping) use cases.

But in that case, yeah, why not consider a survey or something less
> intrusive, like asking people explicitly what type of site they were using,
> or asking if Mozilla can use the domain in question ?


Asking people what site they were using seems challenging. Do we suggest
types? Will people acknowledge the full path of sites they used? How much
do we have to annoy people with questions in order to get a large enough
sample? Will it ever be a representative sample? Even if we do work to
address these, how can we tell if we have done so if we don't have real
usage data to compare to?

As fully implemented, including the backend collection which further
aggregates the information, I believe we are not collecting private or
personally revealing information. If we ask users to opt-in to collection
I don't think we can accurately explain to users the limits of what we are
collecting (especially at that moment when we are interrupting what they
are doing), and I think it will make it appear that we are trying to
collect personal information that we are not.


>
> Also Alexa
>> website categories are far from perfect which would add another level of
>> complexity to understand the collected data.
>>
>
> At no point did I say I expected you to use their categorization, whatever
> that is. Categorize as you see fit, rather than as Alexa does it.
>
> Conversely, if their categorization is questionable, then your scrubbing
> of the Adult category sounds like it might need auditing? Also, why not
> other categories like "Banking" or "Medical" (NB: no idea what
> categorization Alexa employs, but these seem like categories that ought to
> be scrubbed, too)?
>

For filtering out adult sites we used a well-maintained blacklist. Alexa
categorization seems to be based on dmoz, which is very out of date –
browsing the categories feels like being sent back in time to a younger
internet. It seemed reasonable to add items to the list based on that
categorization, but it's otherwise a very poor categorization.


> 6 months also seems incredibly long. You should be able to aggregate the
>>> data and keep that ("60% of users share on sites of type X") and throw
>>> away
>>> the raw data much sooner than that.
>>>
>>> Yes agreed, we'll look into what's the most optimal amount of time
>> required
>> to process the data and extract the useful information. I agree we should
>> try to make this shorter - we'll learn from being on Beta and will adjust
>> this accordingly.
>>
>
> Well, why not make it 1 week to start with, and make it longer if you
> don't get enough information from beta (with a rationale as to why that is
> the case) ?
>

The way tab sharing now works in Hello is a new experience, and we both
don't expect that it has found its niche yet, nor that people have decided
how they want to use Hello. Capturing it for 1 week now is unlikely to
show us how people successfully use Hello, and in order to see when the
data seems to be settling around particular use cases requires us to track
it over time.


>
> Finally, I am surprised that you're sharing this 2 weeks before we're
>>> releasing Firefox 46. Hasn't this been tested and verified on Nightly
>>> and/or other channels? Why was no privacy update made at/before that
>>> time?
>>>
>>>
>> We are shipping Hello through Go Faster. The Go Faster process allows us
>> to
>> uplift directly to Beta 46 directly since we're a system add-on
>> (development was done about 2 weeks ago).
>> Firefox Hello has its own privacy notice (details here
>> <https://www.mozilla.org/en-US/privacy/firefox-hello/>).
>>
>
> But shipping through go faster does not absolve you from adequately
> testing changes and getting feedback on them. Is the add-on not getting
> tested on nightly at all? Or at the same time as it goes to beta? When will
> it be used on release - when 46 ships as release, or earlier, or later?
>
> It also seems like you filed the privacy review after the functionality
> was implemented and is now shipping, which per
> https://wiki.mozilla.org/Privacy/Reviews seems like it is too late to
> incorporate meaningful feedback. I'm not on the privacy team, but that
> order looks wrong to me.
>

This is my fault – we began discussion of this collection many months ago
with people from data stewardship and legal through less formal channels,
and I didn't follow up with a formal privacy review bug. I agree it is not
the correct order.

Note that while implemented, the functionality is currently pref'd off.

--
Ian Bicking | Engineering Manager | Hello | Mozilla

Chris Hofmann

unread,
Apr 4, 2016, 9:33:12 PM4/4/16
to Ian Bicking, dev-platform, Gijs Kruitbosch
On Mon, Apr 4, 2016 at 1:35 PM, Ian Bicking <ibic...@mozilla.com> wrote:

> On Mon, Apr 4, 2016 at 10:44 AM, Gijs Kruitbosch <gijskru...@gmail.com
> >
> wrote:
>
> <snip> I put some comments about data bias against international users
and the .edu domains and possible data leaks that could result directly in
the bug.


> >
> > We looked into this approach originally although we found that we'd lose
> a
> >>>
> >> level of granularity that can have an importance. We may find that Hello
> >> gets used a lot with a specific Website for a specific reason and using
> >> client side categories would prevent us from learning this.
> >>
> >
> > This was explicitly not in your original motivation, so you're moving the
> > goalposts here. If the goal is about separate categories or separate
> sites
> > then those are pretty distinct goals that require different approaches.
> If
> > the real point is "we have no idea, so we figured we'd just get the data
> > and then go from there", why not be upfront about it?
>
>
> We are looking for clues about how people are using Hello, and using
> domains as one way to understand this. So yes, it is exploratory, and we
> are looking for insight we have not yet received, rather than a more binary
> signal such as do people use Hello for shopping or not.
>
> For example, two domains that are on the whitelist: steampowered.com and
> steamcommunity.com – these would both typically be categorized as
> "gaming",
> but they represent very different use cases (store vs. discussion). Or
> aa.com, tripadvisor.com, and expedia.com are all travel sites, but
> represent different (but overlapping) use cases.
>
> But in that case, yeah, why not consider a survey or something less
> > intrusive, like asking people explicitly what type of site they were
> using,
> > or asking if Mozilla can use the domain in question ?
>
>
> Asking people what site they were using seems challenging.


Yeah, but if it's the intent to gather insight around how people are, or
could use hello for shared browsing that seems an indirect and possibly
error prone set of data to gather with out context of what users are trying
to accomplish.


> Do we suggest
> types? Will people acknowledge the full path of sites they used? How much
> do we have to annoy people with questions in order to get a large enough
> sample? Will it ever be a representative sample? Even if we do work to
> address these, how can we tell if we have done so if we don't have real
> usage data to compare to?
>

Simply ask users what type of task they are trying to accomplish.

Maybe give them a set of hints about the standard things that people do
on-line.

With that you could start to understand some interesting use cases that
hello could be made to address in a better way, or some possible use cases
that might be in confilict and may need multiple approaches to support.
(e.g. your game playing against v. shopping together use case)

You really don't want to know particular sites people visit, you want to
understand what it is they are trying to accomplish, right?

>
> As fully implemented, including the backend collection which further
> aggregates the information, I believe we are not collecting private or
> personally revealing information. If we ask users to opt-in to collection
> I don't think we can accurately explain to users the limits of what we are
> collecting (especially at that moment when we are interrupting what they
> are doing), and I think it will make it appear that we are trying to
> collect personal information that we are not.
>

Yes, its easy to understand and believe that its our intention not to
collect personal information.

But that intention avoids the underlying question about if browsing history
is personal data, or is consider by our users to be such.

As it's been said in the past, the road to leaking user personal data is
paved with good intentions.

The survey approach gets us a more direct path to the data we need for
designing a better co-browsing experience or experiences, since we really
would never really design a feature against a particular site or URL, but
are really looking for a common behavior pattern that lends itself to
streamlining across many sites and application.


>
>
> >
> > Also Alexa
> >> website categories are far from perfect which would add another level of
> >> complexity to understand the collected data.
> >>
> >
> > At no point did I say I expected you to use their categorization,
> whatever
> > that is. Categorize as you see fit, rather than as Alexa does it.
> >
> > Conversely, if their categorization is questionable, then your scrubbing
> > of the Adult category sounds like it might need auditing? Also, why not
> > other categories like "Banking" or "Medical" (NB: no idea what
> > categorization Alexa employs, but these seem like categories that ought
> to
> > be scrubbed, too)?
> >
>
> For filtering out adult sites we used a well-maintained blacklist. Alexa
> categorization seems to be based on dmoz, which is very out of date –
> browsing the categories feels like being sent back in time to a younger
> internet. It seemed reasonable to add items to the list based on that
> categorization, but it's otherwise a very poor categorization.
>
>
Yeah, but the question remains is why are we scrubbing adult sites?

Is ti because we expect this not to be a particular co-browsing behavior
that hello users might be participating in?

That seems the only valid reason.

Sure collecting, storing, and potencially leaking that data opens up a set
of problems that we don't want to deal with, but other similar categories
within data collection could, and are likely to, exist in the whitelist.


>
> > 6 months also seems incredibly long. You should be able to aggregate the
> >>> data and keep that ("60% of users share on sites of type X") and throw
> >>> away
> >>> the raw data much sooner than that.
> >>>
> >>> Yes agreed, we'll look into what's the most optimal amount of time
> >> required
> >> to process the data and extract the useful information. I agree we
> should
> >> try to make this shorter - we'll learn from being on Beta and will
> adjust
> >> this accordingly.
> >>
> >
> > Well, why not make it 1 week to start with, and make it longer if you
> > don't get enough information from beta (with a rationale as to why that
> is
> > the case) ?
> >
>
> The way tab sharing now works in Hello is a new experience, and we both
> don't expect that it has found its niche yet, nor that people have decided
> how they want to use Hello. Capturing it for 1 week now is unlikely to
> show us how people successfully use Hello, and in order to see when the
> data seems to be settling around particular use cases requires us to track
> it over time.
>

Sounds like still another source of bias in the data.

If we understand that hello is in an early state where we observe lots of
user churn we could expect to get lots of URLs where people might be
experimenting with sending to others rather than well developed use cases
where a frequent user is persistent in the type of activity that they've
found useful.

To find persistent users v. one time experimental user you will need to
profile the usage and the users.

That puts you back at associating site visits with individuals to really
understand some interesting long term situations to try to optimize for.

Better to just ask if they use hello a lot, and ask what kinds of things
seem useful to do with it if you are looking for early adopter feedback
that might stand the test of time and be useful to mainstream users of the
future.

-chofmann

Randell Jesup

unread,
Apr 5, 2016, 11:28:07 AM4/5/16
to
>The privacy review bug is
>https://bugzilla.mozilla.org/show_bug.cgi?id=1261467.
>More details added below.
>> On 04/04/2016 10:01, Romain Testard wrote:
>>
>>> We would use a whitelist client-side to only collect domains that are
>>> part of the top 2000 domains (Alexa list of top domains). This
>>> prevents
>>> personal identification based on obscure domain usage.
>>>
>>
>> Mathematically, the combination of a set of (popular) domains shared could
>> still be uniquely identifying, especially as, AIUI, you will get the counts
>> of each domain and in what sequence they were visited / which ones were
>> visited in which session. It all depends on the number of unique users and
>> the number of domains they visit / share (not clear: see above). Because
>> the total number of Hello users compared with the number of Firefox users
>> is quite low, this still seems somewhat concerning to me. Have you tried to
>> remedy this in any way?
>>
>
>We are aggregating domain names, and are not storing session histories.
>These are submitted at the end of the session, so exact timestamps of any
>visit are not included.

There's been a bunch of surprises over the last few years where
"anonymized" data turned out to be de-anonymizable. This is the sort of
data that feels like it could lead to surprises. I think this would
need more looks by someone who actually understands that and where those
risks come from (not me).

There are added risks if you include the case of someone using our data
*and* data from one or more 3rd-party sites, and that's not easy to
reason about, which is why this needs careful consideration.

>> Finally, I am surprised that you're sharing this 2 weeks before we're
>> releasing Firefox 46. Hasn't this been tested and verified on Nightly
>> and/or other channels? Why was no privacy update made at/before that time?
>>
>
>We are shipping Hello through Go Faster. The Go Faster process allows us to
>uplift directly to Beta 46 directly since we're a system add-on
>(development was done about 2 weeks ago).
>Firefox Hello has its own privacy notice (details here
><https://www.mozilla.org/en-US/privacy/firefox-hello/>).

Since the collection is not enabled currently anywhere, how known-stable
is it for beta? Having the code in a disabled state safely is one
thing; having it known to be safe to turn on is another.

--
Randell Jesup, Mozilla Corp
remove "news" for personal email

Chris Hofmann

unread,
Apr 5, 2016, 12:05:50 PM4/5/16
to Romain Testard, dev-platform, Gijs Kruitbosch
On Mon, Apr 4, 2016 at 3:01 AM, Romain Testard <rom...@mozilla.com> wrote:

>
> Firefox Hello has its own privacy notice (details here
> <https://www.mozilla.org/en-US/privacy/firefox-hello/>).
>
>
Its unclear to me reading the follow through link to the
TokBox Privacy Policy. -> https://tokbox.com/support/privacy-policy

Does TokBox already have access to the contents of the messages and URLs
that might have been shared?

the tokbox policy says:

The types of information collected include your name, e-mail address, and
any other data you actively choose to provide.

and leaves it vague about the definition of "other data you actively
provide." Does that include shared URLs and message content?

Thie passage in https://www.mozilla.org/en-US/privacy/firefox-hello/ also
would lead me to believe that the contents of my communication with another
user (including shared URLs) are encrypted (and would be private).

We've just invested heavily in making this point and trying to make that
association that encryption mean strong privacy and vice-versa.
https://blog.mozilla.org/blog/2016/03/30/everyday-internet-users-can-stand-up-for-encryption-heres-how/

How are we going to address the possible take away that some will have that
we've just created a backdoor for parts (shared urls that are part of the
message content) of the hello encrypted message channel if we turn this
change on?

ad...@imgland.xyz

unread,
Apr 5, 2016, 1:09:47 PM4/5/16
to Chris Hofmann, Romain Testard, dev-platform, Gijs Kruitbosch
I think this should be abandoned in favour of an optional survey for Hello Users

05.04.2016, 17:06, "Chris Hofmann" <chof...@mozilla.com>:

Joseph Lorenzo Hall

unread,
Apr 5, 2016, 2:51:40 PM4/5/16
to Chris Hofmann, dev-platform, Romain Testard, Gijs Kruitbosch
On Tue, Apr 5, 2016 at 12:05 PM, Chris Hofmann <chof...@mozilla.com> wrote:
> Thie passage in https://www.mozilla.org/en-US/privacy/firefox-hello/ also
> would lead me to believe that the contents of my communication with another
> user (including shared URLs) are encrypted (and would be private).
>
> We've just invested heavily in making this point and trying to make that
> association that encryption mean strong privacy and vice-versa.
> https://blog.mozilla.org/blog/2016/03/30/everyday-internet-users-can-stand-up-for-encryption-heres-how/

As an outside lurker on dev-platform but a big fan of Mozilla's data
stewardship folks, this is the core of the issue for me. WebRTC
conversations should be assumed to be highly private and any
exfiltration on the client without explicit opt-in is seems very
dangerous. I'm not saying it should never be done but it should be
very very important and done very very carefully. I don't get the
sense that this data is that crucial to innovative Hello features. You
could opt-in folks to the study just-in-time using tab sharing. I know
that clobbers the UX but if it's that important I think you need to
take that hit given the sensitivity of real-time comms.

--
Joseph Lorenzo Hall
Chief Technologist, Center for Democracy & Technology [https://www.cdt.org]
e: j...@cdt.org, p: 202.407.8825, pgp: https://josephhall.org/gpg-key
Fingerprint: 3CA2 8D7B 9F6D DBD3 4B10 1607 5F86 6987 40A9 A871

CDT's annual dinner, Tech Prom, is April 6, 2016! https://cdt.org/annual-dinner
Reply all
Reply to author
Forward
0 new messages