Approximating the median

314 views
Skip to first unread message

Rick Viscomi

unread,
Feb 1, 2018, 4:11:11 PM2/1/18
to chrome-ux-report
CrUX provides the entire distribution so it's relatively straightforward to write queries to answer questions like "what percent of FCP are under 1 second?"

#standardSQL
SELECT
  SUM(bin.density) AS fast_fcp
FROM
  `chrome-ux-report.all.201712`,
  UNNEST(first_contentful_paint.histogram.bin) AS bin
WHERE
  origin = 'https://example.com' AND
  bin.start < 1000

Someone recently asked me how to approximate the median. This is still possible but slightly less approachable. So here's how I would query for the median: https://bigquery.cloud.google.com/savedquery/920398604589:5cd19bea92d347fb92a6d8d109e6ba71

#standardSQL
SELECT
  origin,
  ect,
  MIN(start) AS median_fcp
FROM (
  SELECT
    origin,
    bin.start,
    bin.density,
    SUM(bin.density) OVER (PARTITION BY origin, effective_connection_type.name ORDER BY bin.start) AS cdf,
    SUM(bin.density) OVER (PARTITION BY origin, effective_connection_type.name) AS total
  FROM
    `chrome-ux-report.all.201712`,
    UNNEST(first_contentful_paint.histogram.bin) AS bin
  WHERE
    NET.REG_DOMAIN(origin) = 'example.com')
WHERE
  cdf / total >= 0.5
GROUP BY
  origin,
  ect
ORDER BY
  median_fcp


It's a bit of overkill because it shows medians for multiple origins and ECTs, so you could simplify as needed. For example, if you don't need a variable origin or ECT the query is shorter:

#standardSQL
SELECT
  MIN(start) AS median_fcp
FROM (
  SELECT
    bin.start,
    SUM(bin.density) OVER (ORDER BY bin.start) AS cdf
  FROM
    `chrome-ux-report.all.201712`,
    UNNEST(first_contentful_paint.histogram.bin) AS bin
  WHERE
    origin = 'https://example.com')
WHERE
  cdf >= 0.5

If any SQL experts know of an even cleaner way, I'd love to learn how you'd do it!
Message has been deleted

Rick Viscomi

unread,
Jul 2, 2018, 11:14:48 AM7/2/18
to chrome-ux-report
Exactly!

On Monday, July 2, 2018 at 11:04:53 AM UTC-4, joe....@mavens.co.uk wrote:
Hi Rick

This is very cool. If I understand the query correctly, to get a different percentile, such as the 80th, would I just change ">= 0.5" to ">= 0.8"?

Regards,

Joe

Vinicius Fortuna

unread,
Jan 19, 2019, 2:26:03 PM1/19/19
to chrome-ux-report
That's very useful, thanks!

Is it safe to aggregate densities across ects and form factors?
I thought the density would add up to 1 for a fixed (table, origin, effective connection type, form factor) tuple.

Vinicius Fortuna

Rick Viscomi

unread,
Feb 4, 2019, 4:26:50 PM2/4/19
to chrome-ux-report
Hi Vinicius,

Yes, it's safe to aggregate densities across ECT/device grouped by origin. The sum of all ECT/device tuples for each origin is 1.
Message has been deleted

Rick Viscomi

unread,
May 2, 2019, 11:39:30 AM5/2/19
to chrome-ux-report
Hi Joe,

The gotcha is that you've removed the origin from the WHERE clause, so now you're aggregating distributions for all origins. To find the median in a distribution of distributions we need to aggregate each origin somehow. So in this query' I take the "average" distribution by adding up all of the densities for each bin and dividing those by the total density to compute the pdf. Then I accumulate those into the cdf and take the smallest bin where the cdf is greater than 50%:

#standardSQL
SELECT
  MIN(start)
FROM (
  SELECT
    start,
    SUM(pdf) OVER (ORDER BY start) AS cdf
  FROM (
    SELECT
      start,
      n / SUM(n) OVER () AS pdf
    FROM (
      SELECT
        bin.start,
        SUM(bin.density) AS n
      FROM
        `chrome-ux-report.country_in.201903`,
        UNNEST(onload.histogram.bin) AS bin
      WHERE
        form_factor.name = 'phone'
      GROUP BY
        start)))
WHERE
  cdf >= 0.5


The result is 4600. Hope that helps.


Rick


On Friday, April 26, 2019 at 9:24:42 AM UTC-4, joe....@mavens.co.uk wrote:
Hi Rick

I can't see a way around a problem. I want to determine the median load time for specific devices in specific countries. The query I'm using is set out below, but I get an error message about resource usage. Is there a way around the problem?

QUERY:

#standardSQL

SELECT

  MIN(start) AS Load

FROM (

  SELECT

    bin.start,

    SUM(bin.density) OVER (ORDER BY bin.start) AS cdf

  FROM

    `chrome-ux-report.country_in.201903`,

    UNNEST(onLoad.histogram.bin) AS bin

    WHERE form_factor.name = "phone"

)

WHERE

  cdf >= 0.5


ERROR MESSAGE:
Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 115% of limit. Top memory consumer(s): sort operations used for analytic OVER() clauses: 98% other/unattributed: 2%


On Thursday, 1 February 2018 21:11:11 UTC, Rick Viscomi wrote:
Message has been deleted

Rick Viscomi

unread,
May 13, 2019, 1:29:15 PM5/13/19
to chrome-ux-report
Hi Joe,

The density field in each bin represents the percent of all experiences on the origin. To get a percent of experiences using only the phone form factor, you'll need to divide each density by the sum of all phone densities.

For example, if a bin is 0.02, that's 2% of all experiences. If the phone form factor as a whole makes up 10% of the origin, then we can normalize the bin's density to 0.02 / 0.1 = 0.2 or 20% of phone experiences.

The earlier query is a good starting point, we just need to add the origin to the WHERE clause and change the country dataset:

#standardSQL
SELECT
  MIN(start)
FROM (
  SELECT
    start,
    SUM(pdf) OVER (ORDER BY start) AS cdf
  FROM (
    SELECT
      start,
      n / SUM(n) OVER () AS pdf
    FROM (
      SELECT
        bin.start,
        SUM(bin.density) AS n
      FROM
        `chrome-ux-report.country_au.201903`,
        UNNEST(onload.histogram.bin) AS bin
      WHERE
        origin = 'https://www.dove.com' AND
        form_factor.name = 'phone'
      GROUP BY
        start)))
WHERE
  cdf >= 0.5

I get a result of 3900.


On Monday, May 6, 2019 at 9:54:41 AM UTC-4, joe....@mavens.co.uk wrote:
Related to my previous query...when running a query to find a specific speed percentile, in a specific country for a specific origin and specific device type using the query below, I sometimes get a null result. If I change the percentile, I often get a result in those cases. I've also tried a different table date and I can use that to get a result sometimes. Is the problem a lack of data? is there a way to overcome the problem, such as rewrite the query to include a range of tables? If so, how should the query be written. Your help is much appreciated.

#standardSQL
SELECT
  MIN(start) AS median_load
FROM (
  SELECT
    bin.start,
    SUM(bin.density) OVER (ORDER BY bin.start) AS cdf
  FROM
    `chrome-ux-report.country_au.201903`,
    UNNEST(onLoad.histogram.bin) AS bin
  WHERE
    origin = "https://www.dove.com" and form_factor.name = "phone")
WHERE
  cdf >= 0.5
Reply all
Reply to author
Forward
0 new messages