non-aggregated NDT data from M-Lab’s BigQuery

216 views
Skip to first unread message

Lizaveta Radzevich

unread,
Aug 3, 2021, 10:46:03 AM8/3/21
to discuss

Hi,

I’m trying to access non-aggregated NDT data from M-Lab’s BigQuery. I’ve tired on public datasets before (set up a project, create service account and key), but it seems like I’m missing some steps to be able to get NDT data. I think you need to set up a service account for me and add it to your access group. You can use my account from discussion group: lizaveta...@gmail.com. I’ll attach code and error message I’m getting, so it would be nice if you can let me know if my theory about service account is wrong!

Thank you,

Lizaveta
Screenshot 2021-08-03 073738.png

Chris Ritzo

unread,
Aug 3, 2021, 10:52:53 AM8/3/21
to discuss, lizaveta...@gmail.com
Hi Lizaveta,
Thanks for writing about this issue. While your membership on this group provides access to the BigQuery datasets using this email, using a service account requires an additional step. As you surmised, that step is getting your service account added to the access group.

To add the service account, you will need to look for it's ID within your project or in the service account key file itself. It looks like an email address, similar to:
<project>@<project>.iam.gserviceaccount.com

You can email that privately to sup...@measurementlab.net  and we will add the account to the access group. After that the error you shared should go away and you can use the service account in your application.

Best,
Chris - M-Lab Support

Bradley Kalgovas

unread,
Aug 3, 2021, 5:41:02 PM8/3/21
to discuss, Chris Ritzo, lizaveta...@gmail.com
Hi Chris,

We want to obtain the data for each test in the from the measurement-lab.ndt.unified_downloads table. However, when we are trying to get the data out we are getting errors that the extract size is too big to load into google drive. Is it possible to have 15 mins on the phone to talk on how to extract it without having to pay $2k for some slots. Thanks!

Bradley

Chris Ritzo

unread,
Aug 4, 2021, 9:13:57 AM8/4/21
to discuss, b.ra...@gmail.com, lizaveta...@gmail.com
Hi Bradley,

If you need to pull individual tests from our datasets, you will likely need to incur some costs. M-Lab subsidizes the queries only. However, there are some techniques I could recommend, and might be helpful to others on the group.

My first suggestion is to think about limiting the number of individual test results you need to export in some way. That could be by date ranges, or specific geographies. You may have a specific focus already, but still the data size is too large for export to Google Drive.

Another option is to save your query results to a BigQuery table in your Google Cloud Project, then export the table contents to Cloud Storage as CSVs, and you can then download the CSVs to work with in your analyses. This link provides some overview of options for extracting data: https://cloud.google.com/bigquery/docs/exporting-data

You might also consider working with a subset of individual test rows in a series of test cases to refine your workflow. An initial, exploratory analysis could give you a sense of the individual fields and what you want to do with them. If that includes specific analyses that could be done within your query, you could then return to BigQuery and do that analysis in SQL to limit the final export you need to a smaller size.

However, if you are using other analysis tools and want to do that with individual rows, then you may need to incur some costs to do the export. Some analysis tools like R Studio for example, a have support for BigQuery, and could be used to extract data.

I hope this is helpful.

Best regards,
Chris

Lizaveta Radzevich

unread,
Aug 9, 2021, 6:12:13 PM8/9/21
to discuss, Chris Ritzo, Lizaveta Radzevich
Hi Chris,

So far Bradley and I were able to aggregate NDT data to the exportable size. I'm having trouble getting to the median value (since SQL does not have direct median function). Can you help me spot the mistake in the following query? I'm trying to get a set grouped by NUTS 3 code and date. BigQuery shows syntax error on WITHIN GROUP statement, but I don't now other way to get median values. 

SELECT 
    date,
    avg(a.MeanThroughputMbps) as Mbps, 
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY a.MeanThroughputMbps) OVER (PARTITION BY LEVL_CODE, date) AS MedianMbps,
    max(a.MeanThroughputMbps) as maxMbps,
    min(a.MeanThroughputMbps) as minMbps,
    count(a.MeanThroughputMbps) as number_speed_tests,
    avg(a.MinRTT) as latency,
    count(client.Network.ASNumber) as number_ISP,
    LEVL_CODE, 
FROM `measurement-lab.ndt.unified_downloads` as unified_downloads
JOIN `measurement-lab.geographies.eu_NUTS_3_2021_01m` as geo
ON ST_Within(ST_GeogPoint(client.Geo.Longitude, client.Geo.Latitude),geo.geometry)
WHERE date >= "2021-01-01" and client.Geo.CountryCode = "DE" 
group by LEVL_CODE, date

Thank you!

Chris Ritzo

unread,
Aug 9, 2021, 6:18:11 PM8/9/21
to discuss, lizaveta...@gmail.com
Hi Lizaveta,

Median aggregation is achievable in BigQuery using the APPROX_QUANTILES function. You can also get other quantiles/percentiles using options in the same function. The sub-queries calculating stats-per-day from our statistics pipeline service will be instructive on the use of APPROX_QUANTILES. See this line for an example of median.

I hope this is helpful.

Best regards,
Chris

Gregory Russell

unread,
Aug 10, 2021, 11:02:56 AM8/10/21
to Lizaveta Radzevich, discuss, Chris Ritzo
Looks like percentile_cont is also supported, but perhaps not the syntax in your query.

--
You received this message because you are subscribed to the Google Groups "discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@measurementlab.net.
To view this discussion on the web visit https://groups.google.com/a/measurementlab.net/d/msgid/discuss/cd736098-9f68-4a0a-8493-09a6f50ee207n%40measurementlab.net.

Lizaveta Radzevich

unread,
Aug 11, 2021, 1:47:29 PM8/11/21
to discuss, gfr1...@gmail.com, discuss, Chris Ritzo, Lizaveta Radzevich
Hi,

Now I'm trying to get the same data as query I shared but for US county (previous was for NUTS3). In short, I need average download speed, max download speed , number of tests, and number of distinct ASNs per US county. I can get the first three from `measurement-lab.statistics.v0_us_counties` table, but there is no data on ASNs. I can use non-aggregated ndt table, but I don't see any county-appropriate file in geographies (I mapped non-aggregated ndt to NUTS3 this way). Is there walkaround to get the number of ASNs per county? 

Chris Ritzo

unread,
Aug 11, 2021, 2:04:23 PM8/11/21
to discuss, lizaveta...@gmail.com, discuss
Hello!

The tables for US County rely on the table, `bigquery-public-data.geo_us_boundaries.counties`, provided by Google's Public Datasets program.

At each geographic level in the statistics API and tables, we also provide aggregation by ASN. See this document for complete details, but in short, you can query the table: `measurement-lab.statistics.v0_us_counties_asn`.

One additional note about the geography tables in `measurement-lab.geographies`- these have been published temporarily for specific internal needs or collaborations and may be eventually be removed. If you need tables such as eu_NUTS_3_2021_01m, we would advise creating them or copying them to your own Google Cloud Project.

Chris - M-Lab Support

Lizaveta Radzevich

unread,
Aug 23, 2021, 1:37:54 PM8/23/21
to discuss, Chris Ritzo, Lizaveta Radzevich, discuss
Hi Chris,

Quick question regarding US GIDs in global_gadm36_2 table. I want to map those codes to FIPS, is there any way to do it?

Chris Ritzo

unread,
Aug 23, 2021, 5:23:59 PM8/23/21
to discuss, lizaveta...@gmail.com, discuss
Hi Lizaveta,
I'm not super familiar with the GIDs in the gadm.org tables. However, Maxmind provides a CSV that maps FIPS to ISO 3166 on this page: https://dev.maxmind.com/geoip/whats-new-in-geoip2

M-Lab annotates NDT test data with the ISO 3166 standard now, but older vintages were annotated with FIPS. See this post for more details: https://www.measurementlab.net/blog/evolution-of-annotations/

Best, Chris
Reply all
Reply to author
Forward
0 new messages