Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Re: Enumerated values and locales in telemetry

35 views
Skip to first unread message

Brian Smith

unread,
Dec 26, 2011, 5:54:10 AM12/26/11
to Henri Sivonen, dev-platform
Henri Sivonen wrote:
> I suspect some of our localizations have inappropriate defaults for
> the fallback character encoding. On the conceptual level, telemetry
> could be used to discover
> a) how often pages end up relying on the fallback character encoding
> b) what the fallback encoding is in those cases (i.e. has the user
> changed it and to what)
> c) how often users override the character encoding on a per-page basis
>
> It turns out there are two problems here:
> 1) Telemetry doesn't seem to have nice ready-made tools for dealing
> with enumerations
> 2) These metrics would only make sense to measure on a
> per-localization basis and there doesn't appear to be a way to do this

Regarding privacy, would we even want to include these two pieces of information in the telemetry data?:

* what the fallback encoding was and/or what it was changed to
* what the locale was

This is the kind of data that, in combination with other such data, would seem to cause trouble along the same lines as the Netflix Prize [1]. For example, if the locale is Romanian, then you have reduced the search space for an individual from 7,000,000,000 to 25,000,000. Let's say the platform is Android. Then, you've probably narrowed the search space down to less than 1,000 people. With just these two factors, you aren't far from isolating individual users from their telemetry data, even before you combine it with other factors. At least in theory, this could be cross-referenced with other datasets, such as Romanian-language tweets sent from the Twitter for Android app at times close to the times the Telemetry data was sent, to identify telemetry-providing users by name with a high degree of accuracy.

That doesn't seem to fit well with the privacy expectations we've set regarding Telemetry.

Cheers,
Brian

[1] http://en.wikipedia.org/wiki/Netflix_Prize#Privacy_concerns

Doug Turner

unread,
Dec 26, 2011, 12:30:47 PM12/26/11
to Brian Smith, Sid Stamm, Henri Sivonen, dev-platform
Brian, you are right. Sid's teams should audit these closely. Our
privacy policy says:

"""
Beginning with version 7, Firefox includes functionality that is
turned off by default to send to Mozilla non-personal usage,
performance, and responsiveness statistics about user interface
features, memory, and hardware configuration.
"""

The quick list of many of the attributes we send is here:

http://mxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/TelemetryHistograms.h#58

We want to make sure we have reasons for every single thing we send.
If, in doubt, we must not send that kind of data.

Doug

Henri Sivonen

unread,
Jan 16, 2013, 7:25:25 AM1/16/13
to dev-platform
On Mon, Dec 26, 2011 at 12:54 PM, Brian Smith <bsm...@mozilla.com> wrote:
> Henri Sivonen wrote:
>> I suspect some of our localizations have inappropriate defaults for
>> the fallback character encoding. On the conceptual level, telemetry
>> could be used to discover
>> a) how often pages end up relying on the fallback character encoding
>> b) what the fallback encoding is in those cases (i.e. has the user
>> changed it and to what)
>> c) how often users override the character encoding on a per-page basis
>>
>> It turns out there are two problems here:
>> 1) Telemetry doesn't seem to have nice ready-made tools for dealing
>> with enumerations
>> 2) These metrics would only make sense to measure on a
>> per-localization basis and there doesn't appear to be a way to do this
>
> Regarding privacy, would we even want to include these two pieces of information in the telemetry data?:
>
> * what the fallback encoding was and/or what it was changed to
> * what the locale was

How do you suggest we answer the question: “Do we have the most
successful encoding defaults for all our localizations?”

> This is the kind of data that, in combination with other such data, would seem to cause trouble along the same lines as the Netflix Prize [1]. For example, if the locale is Romanian, then you have reduced the search space for an individual from 7,000,000,000 to 25,000,000. Let's say the platform is Android. Then, you've probably narrowed the search space down to less than 1,000 people. With just these two factors, you aren't far from isolating individual users from their telemetry data, even before you combine it with other factors. At least in theory, this could be cross-referenced with other datasets, such as Romanian-language tweets sent from the Twitter for Android app at times close to the times the Telemetry data was sent, to identify telemetry-providing users by name with a high degree of accuracy.

Do I understand correctly that the problem is that even if we stored
the locale-related telemetry data separately from other telemetry data
and threw out IP address and time stamp right away, we couldn’t prove
that to users? Actual ability to correlate the data could be removed
by discarding the originating IP address, the exact time and the
association with other telemetry data, right?

On Mon, Dec 26, 2011 at 7:30 PM, Doug Turner <doug....@gmail.com> wrote:
> Brian, you are right. Sid's teams should audit these closely. Our
> privacy policy says:
>
> """
> Beginning with version 7, Firefox includes functionality that is
> turned off by default to send to Mozilla non-personal usage,
> performance, and responsiveness statistics about user interface
> features, memory, and hardware configuration.
> """

The data I’m interested in would fall under “user interface features”
first and “performance” second.

I'd like to have answers to these questions, and I think telemetry
might be able to provide the answers:

1) Do we have the most successful encoding defaults for our locales?
The reason I want to know: by inspection, I suspect that the locales
turned up by the search
http://mxr.mozilla.org/l10n-central/search?string=charset.default\s*%3D\s*UTF-8&regexp=1&find=\.properties%24&findi=&filter=^[^\0]*%24&hitlimit=&tree=l10n-central
have inappropriate defaults, because the default encoding exists for
misauthored legacy content that predates UTF-8. Inappropriate defaults
may lead to user frustration or choosing another browser over Firefox.

2) Instead of having locale-specific defaults, could we decide the
fallback encoding based on the top-level domain name of the site?
The reason I want to know: Currently, the Web-exposed behavior of
Firefox depends on the localization. It's bad that the way sites work
depends on the UI language of the browser. In principle, you should be
able to read e.g. Russian-language sites as successfully with e.g.
Estonian-language Firefox then with a Russian-language Firefox.

3) Could we use a pan-Chinese encoding detector?
Currently, it appears that our zh-TW localization turns on the
universal detector. The universal detector has various problems (it
isn't actually universal), so if the use case is that Taiwanese users
often read both Traditional Chinese and Simplified Chinese legacy
content while also reading some English content, maybe a detector that
doesn't try to detect stuff as Cyrillic encodings could be more
successful and perform faster.

4) Do users actually use the Character Encoding menu enough to warrant
keeping that UI around? Can we get rid of the menu already?
Reasons why it would be nice to get rid of the menu:
* We shouldn't be signaling to Web authors that they have the option
to leave this problem to users instead of getting their authoring act
together.
* Less opportunity for users to introduce data corruption by using
the menu and then submitting a form.
* Less code complexity.
* Less code to maintain and fix. (I just wrote a fix in this area.
Even though the code changes weren't that difficult, writing the unit
tests was quite time-consuming.)

Can answers to any of these questions pursued with telemetry given our
policies? Have collecting impression data from the sin of the service
or the Metrics Data Ping changed the thinking on whether it's okay to
measure usage of Firefox features that correlate with language and/or
geography?

Alternatives to telemetry that I can think of:

For #1: Instead of measuring our success, seeing what the most popular
browser in each local does and doing the same.

For #2: Doing a massive Web crawl.

For #3: Shipping a pan-Chinese detector and seeing if anyone complains.

For #4: Measuring the menu usage frequency without measuring what gets
overridden and to what.

--
Henri Sivonen
hsiv...@iki.fi
http://hsivonen.iki.fi/
0 new messages