Hey everyone,
I've been working on a more efficient way to query CrUX data. Every release in the `all` dataset has been collated into a single table named
chrome-ux-report.experimental.global. Each release has its own "partition" with "clustering", which are BigQuery terms for more efficiently storing the data to reduce query costs.
For example, this query computes the % of fast LCP experiences for a given origin:
Query complete (0.8 sec elapsed, 97.8 GB processed)
Now compare that with a query over the experimental dataset:
SELECT
SUM(lcp.density) AS fast_lcp
FROM
`chrome-ux-report.experimental.global`,
UNNEST(largest_contentful_paint.histogram.bin) AS lcp
WHERE
yyyymm = 202002 AND
origin = '
https://example.com' AND
form_factor.name = 'phone' AND
effective_connection_type.name = '4G' AND
lcp.start < 2500
Query complete (1.5 sec elapsed, 39.4 MB processed)
It's a bit slower but processes 0.04% as much data, which is crucial for BigQuery users on the free tier.
Note that the second query has a WHERE clause of `yyyymm = 202002`. All queries over the experimental.global table require a condition like this to prevent accidentally incurring the full cost of the dataset. If you forget to include it, you'll get an error like this:
Cannot query over table 'chrome-ux-report.experimental.global' without a filter over column(s) 'yyyymm' that can be used for partition elimination
If you want to intentionally query all data, you can silence the error by adding ` WHERE yyyymm > 0`, which includes all monthly releases. For example:
Query complete (6.5 sec elapsed, 321.5 MB processed)
This query calculates the % of fast FID experiences for a given origin over time. This kind of query was
difficult if not impossible using the existing `chrome-ux-report.all` dataset because the FID metric has been graduated out of the experimental dataset, causing naming inconsistencies across releases. Using the partitioned dataset and its consistent metric naming schema, it can find an answer using less than a third of a gigabyte of data.
Known limitations:
- Not available for country datasets yet, but you can imagine similarly having a single `chrome-ux-report.experimental.country` table with a `country` field including all country data.
- Metrics that used to be in the `experimental` namespace are now collated under their latest name. For example `experimental.first_input_delay` can now be found under `first_input.delay`. This naming scheme may obscure when data was considered experimental and liable to change.
Please give the experimental data a try and let me know what you think. If it proves useful it may become the default way to extract CrUX data on BigQuery.
Rick