On Mon, Apr 4, 2016 at 1:35 PM, Ian Bicking <
ibic...@mozilla.com> wrote:
> On Mon, Apr 4, 2016 at 10:44 AM, Gijs Kruitbosch <
gijskru...@gmail.com
> >
> wrote:
>
> <snip> I put some comments about data bias against international users
and the .edu domains and possible data leaks that could result directly in
the bug.
> >
> > We looked into this approach originally although we found that we'd lose
> a
> >>>
> >> level of granularity that can have an importance. We may find that Hello
> >> gets used a lot with a specific Website for a specific reason and using
> >> client side categories would prevent us from learning this.
> >>
> >
> > This was explicitly not in your original motivation, so you're moving the
> > goalposts here. If the goal is about separate categories or separate
> sites
> > then those are pretty distinct goals that require different approaches.
> If
> > the real point is "we have no idea, so we figured we'd just get the data
> > and then go from there", why not be upfront about it?
>
>
> We are looking for clues about how people are using Hello, and using
> domains as one way to understand this. So yes, it is exploratory, and we
> are looking for insight we have not yet received, rather than a more binary
> signal such as do people use Hello for shopping or not.
>
> For example, two domains that are on the whitelist:
steampowered.com and
>
steamcommunity.com – these would both typically be categorized as
> "gaming",
> but they represent very different use cases (store vs. discussion). Or
>
aa.com,
tripadvisor.com, and
expedia.com are all travel sites, but
> represent different (but overlapping) use cases.
>
> But in that case, yeah, why not consider a survey or something less
> > intrusive, like asking people explicitly what type of site they were
> using,
> > or asking if Mozilla can use the domain in question ?
>
>
> Asking people what site they were using seems challenging.
Yeah, but if it's the intent to gather insight around how people are, or
could use hello for shared browsing that seems an indirect and possibly
error prone set of data to gather with out context of what users are trying
to accomplish.
> Do we suggest
> types? Will people acknowledge the full path of sites they used? How much
> do we have to annoy people with questions in order to get a large enough
> sample? Will it ever be a representative sample? Even if we do work to
> address these, how can we tell if we have done so if we don't have real
> usage data to compare to?
>
Simply ask users what type of task they are trying to accomplish.
Maybe give them a set of hints about the standard things that people do
on-line.
With that you could start to understand some interesting use cases that
hello could be made to address in a better way, or some possible use cases
that might be in confilict and may need multiple approaches to support.
(e.g. your game playing against v. shopping together use case)
You really don't want to know particular sites people visit, you want to
understand what it is they are trying to accomplish, right?
>
> As fully implemented, including the backend collection which further
> aggregates the information, I believe we are not collecting private or
> personally revealing information. If we ask users to opt-in to collection
> I don't think we can accurately explain to users the limits of what we are
> collecting (especially at that moment when we are interrupting what they
> are doing), and I think it will make it appear that we are trying to
> collect personal information that we are not.
>
Yes, its easy to understand and believe that its our intention not to
collect personal information.
But that intention avoids the underlying question about if browsing history
is personal data, or is consider by our users to be such.
As it's been said in the past, the road to leaking user personal data is
paved with good intentions.
The survey approach gets us a more direct path to the data we need for
designing a better co-browsing experience or experiences, since we really
would never really design a feature against a particular site or URL, but
are really looking for a common behavior pattern that lends itself to
streamlining across many sites and application.
>
>
> >
> > Also Alexa
> >> website categories are far from perfect which would add another level of
> >> complexity to understand the collected data.
> >>
> >
> > At no point did I say I expected you to use their categorization,
> whatever
> > that is. Categorize as you see fit, rather than as Alexa does it.
> >
> > Conversely, if their categorization is questionable, then your scrubbing
> > of the Adult category sounds like it might need auditing? Also, why not
> > other categories like "Banking" or "Medical" (NB: no idea what
> > categorization Alexa employs, but these seem like categories that ought
> to
> > be scrubbed, too)?
> >
>
> For filtering out adult sites we used a well-maintained blacklist. Alexa
> categorization seems to be based on dmoz, which is very out of date –
> browsing the categories feels like being sent back in time to a younger
> internet. It seemed reasonable to add items to the list based on that
> categorization, but it's otherwise a very poor categorization.
>
>
Yeah, but the question remains is why are we scrubbing adult sites?
Is ti because we expect this not to be a particular co-browsing behavior
that hello users might be participating in?
That seems the only valid reason.
Sure collecting, storing, and potencially leaking that data opens up a set
of problems that we don't want to deal with, but other similar categories
within data collection could, and are likely to, exist in the whitelist.
>
> > 6 months also seems incredibly long. You should be able to aggregate the
> >>> data and keep that ("60% of users share on sites of type X") and throw
> >>> away
> >>> the raw data much sooner than that.
> >>>
> >>> Yes agreed, we'll look into what's the most optimal amount of time
> >> required
> >> to process the data and extract the useful information. I agree we
> should
> >> try to make this shorter - we'll learn from being on Beta and will
> adjust
> >> this accordingly.
> >>
> >
> > Well, why not make it 1 week to start with, and make it longer if you
> > don't get enough information from beta (with a rationale as to why that
> is
> > the case) ?
> >
>
> The way tab sharing now works in Hello is a new experience, and we both
> don't expect that it has found its niche yet, nor that people have decided
> how they want to use Hello. Capturing it for 1 week now is unlikely to
> show us how people successfully use Hello, and in order to see when the
> data seems to be settling around particular use cases requires us to track
> it over time.
>
Sounds like still another source of bias in the data.
If we understand that hello is in an early state where we observe lots of
user churn we could expect to get lots of URLs where people might be
experimenting with sending to others rather than well developed use cases
where a frequent user is persistent in the type of activity that they've
found useful.
To find persistent users v. one time experimental user you will need to
profile the usage and the users.
That puts you back at associating site visits with individuals to really
understand some interesting long term situations to try to optimize for.
Better to just ask if they use hello a lot, and ask what kinds of things
seem useful to do with it if you are looking for early adopter feedback
that might stand the test of time and be useful to mainstream users of the
future.
-chofmann