Approximating the median

Rick Viscomi

unread,

Feb 1, 2018, 4:11:11 PM2/1/18

to chrome-ux-report

CrUX provides the entire distribution so it's relatively straightforward to write queries to answer questions like "what percent of FCP are under 1 second?"

#standardSQL

SELECT

SUM(bin.density) AS fast_fcp

FROM

`chrome-ux-report.all.201712`,

UNNEST(first_contentful_paint.histogram.bin) AS bin

WHERE

origin = 'https://example.com' AND

bin.start < 1000

Someone recently asked me how to approximate the median. This is still possible but slightly less approachable. So here's how I would query for the median: https://bigquery.cloud.google.com/savedquery/920398604589:5cd19bea92d347fb92a6d8d109e6ba71

#standardSQL

SELECT

origin,

ect,

MIN(start) AS median_fcp

FROM (

SELECT

origin,

effective_connection_type.name AS ect,

bin.start,

bin.density,

SUM(bin.density) OVER (PARTITION BY origin, effective_connection_type.name ORDER BY bin.start) AS cdf,

SUM(bin.density) OVER (PARTITION BY origin, effective_connection_type.name) AS total

FROM

`chrome-ux-report.all.201712`,

UNNEST(first_contentful_paint.histogram.bin) AS bin

WHERE

NET.REG_DOMAIN(origin) = 'example.com')

WHERE

cdf / total >= 0.5

GROUP BY

origin,

ect

ORDER BY

median_fcp

It's a bit of overkill because it shows medians for multiple origins and ECTs, so you could simplify as needed. For example, if you don't need a variable origin or ECT the query is shorter:

#standardSQL

SELECT

MIN(start) AS median_fcp

FROM (

SELECT

bin.start,

SUM(bin.density) OVER (ORDER BY bin.start) AS cdf

FROM

`chrome-ux-report.all.201712`,

UNNEST(first_contentful_paint.histogram.bin) AS bin

WHERE

origin = 'https://example.com')

WHERE

cdf >= 0.5

If any SQL experts know of an even cleaner way, I'd love to learn how you'd do it!

Message has been deleted

Rick Viscomi

unread,

Jul 2, 2018, 11:14:48 AM7/2/18

to chrome-ux-report

Exactly!

On Monday, July 2, 2018 at 11:04:53 AM UTC-4, joe....@mavens.co.uk wrote:

Hi Rick

This is very cool. If I understand the query correctly, to get a different percentile, such as the 80th, would I just change ">= 0.5" to ">= 0.8"?

Regards,

Joe

Vinicius Fortuna

unread,

Jan 19, 2019, 2:26:03 PM1/19/19

to chrome-ux-report

That's very useful, thanks!

Is it safe to aggregate densities across ects and form factors?

I thought the density would add up to 1 for a fixed (table, origin, effective connection type, form factor) tuple.

Vinicius Fortuna

Rick Viscomi

unread,

Feb 4, 2019, 4:26:50 PM2/4/19

to chrome-ux-report

Hi Vinicius,

Yes, it's safe to aggregate densities across ECT/device grouped by origin. The sum of all ECT/device tuples for each origin is 1.

Message has been deleted

Rick Viscomi

unread,

May 2, 2019, 11:39:30 AM5/2/19

to chrome-ux-report

Hi Joe,

The gotcha is that you've removed the origin from the WHERE clause, so now you're aggregating distributions for all origins. To find the median in a distribution of distributions we need to aggregate each origin somehow. So in this query' I take the "average" distribution by adding up all of the densities for each bin and dividing those by the total density to compute the pdf. Then I accumulate those into the cdf and take the smallest bin where the cdf is greater than 50%:

#standardSQL
SELECT
  MIN(start)
FROM (
  SELECT
    start,
    SUM(pdf) OVER (ORDER BY start) AS cdf
  FROM (
    SELECT
      start,
      n / SUM(n) OVER () AS pdf
    FROM (
      SELECT
        bin.start,
        SUM(bin.density) AS n
      FROM
        `chrome-ux-report.country_in.201903`,
        UNNEST(onload.histogram.bin) AS bin
      WHERE
        form_factor.name = 'phone'
      GROUP BY
        start)))
WHERE
  cdf >= 0.5

The result is 4600. Hope that helps.

Rick

On Friday, April 26, 2019 at 9:24:42 AM UTC-4, joe....@mavens.co.uk wrote:

Hi Rick

I can't see a way around a problem. I want to determine the median load time for specific devices in specific countries. The query I'm using is set out below, but I get an error message about resource usage. Is there a way around the problem?

QUERY:

#standardSQL

SELECT

MIN(start) AS Load

FROM (

SELECT

bin.start,

SUM(bin.density) OVER (ORDER BY bin.start) AS cdf

FROM

`chrome-ux-report.country_in.201903`,

UNNEST(onLoad.histogram.bin) AS bin

WHERE form_factor.name = "phone"

)

WHERE

cdf >= 0.5

ERROR MESSAGE:

Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 115% of limit. Top memory consumer(s): sort operations used for analytic OVER() clauses: 98% other/unattributed: 2%

On Thursday, 1 February 2018 21:11:11 UTC, Rick Viscomi wrote:

Message has been deleted

Rick Viscomi

unread,

May 13, 2019, 1:29:15 PM5/13/19

to chrome-ux-report

Hi Joe,

The density field in each bin represents the percent of all experiences on the origin. To get a percent of experiences using only the phone form factor, you'll need to divide each density by the sum of all phone densities.

For example, if a bin is 0.02, that's 2% of all experiences. If the phone form factor as a whole makes up 10% of the origin, then we can normalize the bin's density to 0.02 / 0.1 = 0.2 or 20% of phone experiences.

The earlier query is a good starting point, we just need to add the origin to the WHERE clause and change the country dataset:

#standardSQL
SELECT
  MIN(start)
FROM (
  SELECT
    start,
    SUM(pdf) OVER (ORDER BY start) AS cdf
  FROM (
    SELECT
      start,
      n / SUM(n) OVER () AS pdf
    FROM (
      SELECT
        bin.start,
        SUM(bin.density) AS n
      FROM

        `chrome-ux-report.country_au.201903`,
        UNNEST(onload.histogram.bin) AS bin
      WHERE
        origin = 'https://www.dove.com' AND

        form_factor.name = 'phone'
      GROUP BY
        start)))
WHERE
  cdf >= 0.5

I get a result of 3900.

On Monday, May 6, 2019 at 9:54:41 AM UTC-4, joe....@mavens.co.uk wrote:

Related to my previous query...when running a query to find a specific speed percentile, in a specific country for a specific origin and specific device type using the query below, I sometimes get a null result. If I change the percentile, I often get a result in those cases. I've also tried a different table date and I can use that to get a result sometimes. Is the problem a lack of data? is there a way to overcome the problem, such as rewrite the query to include a range of tables? If so, how should the query be written. Your help is much appreciated.

#standardSQL
SELECT
MIN(start) AS median_load

FROM (
SELECT
bin.start,
SUM(bin.density) OVER (ORDER BY bin.start) AS cdf
FROM

`chrome-ux-report.country_au.201903`,
UNNEST(onLoad.histogram.bin) AS bin
WHERE
origin = "https://www.dove.com" and form_factor.name = "phone")
WHERE
cdf >= 0.5

Reply all

Reply to author

Forward