A new Google cluster trace of 8 clusters from May 2019 is now available

251 views
Skip to first unread message

john wilkes

unread,
Apr 1, 2020, 7:07:59 PM4/1/20
to googlecluste...@googlegroups.com
We are pleased to announce the release of a new Google cluster trace.  Please kick the tires, and let us know what you think - especially if you have any difficulty accessing the data.

The data is made available on Google BigQuery, which requires a Google account. (Our apologies in advance for people for whom this may cause issues.)  Be careful about making queries across the entire dataset to avoid racking up too many Cloud charges.
  john wilkes

_______________________________________________________________

Google’s Borg cluster management system supports our computational fleet, and underpins almost every Google service.  For example, the machines that host the Google Doc used for drafting this post are managed by Borg, as are those that run Google’s cloud computing products. That makes the Borg system, as well as its workload, of great interest to researchers and practitioners.


Eight years ago Google published a 29-day trace, a record for every job submission, scheduling decision, and resource usage data for all the jobs in a Google Borg compute cluster, from May 2011.  The trace included a record for every job submission, scheduling decision, and resource usage data for all the jobs in that cluster. That trace has enabled a wide range of research on advancing the state-of-the-art for cluster schedulers and cloud computing, and has been used to generate hundreds of analyses and studies. But in the years since the 2011 trace was made available, machines and software have evolved, workloads have changed, and the importance of workload variance has become even clearer.


To help researchers explore these changes themselves, we are releasing a new trace dataset for the month of May 2019 covering eight Google compute clusters. This new dataset is both larger and more extensive than the 2011 one, and now includes:
  • CPU usage information histograms for each 5 minute period, not just a point sample;
  • information about alloc sets (shared resource reservations used by jobs);
  • job-parent information for master/worker relationships such as MapReduce jobs.
Just like the last trace, these new ones focus on resource requests and usage, and contain no information about end users, their data, or access patterns to storage systems and other services. 


In addition to providing a downloadable format, we are also making the trace data available via Google BigQuery so that sophisticated analyses can be performed without requiring local resources. This site provides access instructions and a detailed description of what the traces contain.


A first-pass analysis of differences between the 2011 and 2019 traces will appear in this paper: 

Muhammad Tirmazi, Adam Barker, Nan Deng, Md E. Haque, Zhijing Gene Qin, Steven Hand, Mor Harchol-Balter, and John Wilkes. Borg: the Next Generation. In Fifteenth European Conference on Computer Systems (EuroSys ’20), April 27–30, 2020, Heraklion, Greece. ACM, New York, NY, USA. https://doi.org/10.1145/3342195.3387517  

We hope this data will facilitate even more research into cluster management. Do let us know if you find it useful, publish papers that use it, develop tools that analyze it, or have suggestions for how to improve it.

Acknowledgements: I’d especially like to thank our intern Muhammad Tirmazi, and my colleagues Nan Deng, Md Ehtesam Haque, Zhijing Gene Qin, Steve Hand and Visiting Researcher Adam Barker for doing the heavy lifting of preparing the new trace set.

jiangcm CHUNMAO

unread,
Apr 2, 2020, 11:11:04 AM4/2/20
to Google cluster data - discussions

Hi, there,
             I am interesting the data, but i don't know how to download it like the cluster-data_2011!!
         Thanks 







在 2020年4月1日星期三 UTC-6下午5:07:59,john wilkes写道:

john wilkes

unread,
Apr 2, 2020, 2:41:16 PM4/2/20
to googlecluste...@googlegroups.com
Hi. The new traces are about 2.8TiB in size, so we don't think that downloading them is the best option.  Please see the trace documentation for more info.
  john

--
You received this message because you are subscribed to the "Google cluster data - discussions" group. To post to this group, send email to googlecluste...@googlegroups.com. To unsubscribe from this group, send email to googleclusterdata-...@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/googleclusterdata-discuss?hl=en-US.
---
You received this message because you are subscribed to the Google Groups "Google cluster data - discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to googleclusterdata-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/googleclusterdata-discuss/bf24529d-4e55-4441-bb78-b5061d5b3590%40googlegroups.com.

jiangcm CHUNMAO

unread,
Apr 3, 2020, 6:27:24 PM4/3/20
to 'john wilkes' via Google cluster data - discussions
thanks, but I indeed to download it , even though I know it big one.
in China, the google can not be visited!
thanks again.



john wilkes

unread,
Apr 3, 2020, 6:30:03 PM4/3/20
to Google cluster data - discussions
Hi.  Apologies, but that isn't possible right now.  Feel free to take advantage of the 2011 traces.
  john
You received this message because you are subscribed to the "Google cluster data - discussions" group. To post to this group, send email to googleclusterdata-discuss@googlegroups.com. To unsubscribe from this group, send email to googleclusterdata-discuss+unsub...@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/googleclusterdata-discuss?hl=en-US.

---
You received this message because you are subscribed to the Google Groups "Google cluster data - discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to googleclusterdata-discuss+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the "Google cluster data - discussions" group. To post to this group, send email to googleclusterdata-discuss@googlegroups.com. To unsubscribe from this group, send email to googleclusterdata-discuss+unsub...@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/googleclusterdata-discuss?hl=en-US.

---
You received this message because you are subscribed to the Google Groups "Google cluster data - discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to googleclusterdata-discuss+unsub...@googlegroups.com.

EHT-E-SHAM Sham

unread,
Apr 4, 2020, 5:11:20 AM4/4/20
to Google cluster data - discussions
HI, Team Please publish a step by step procedure to download the traces of v3 . We believe we are facing issues on how to download it, I have my Google BigQuery account but facing issues while writing the query to extract the information from Google Cluster Trace v3

EHT-E-SHAM Sham

unread,
Apr 4, 2020, 11:17:52 AM4/4/20
to Google cluster data - discussions
@John wilkes pls share the scripts or just some hint to download it.

john wilkes

unread,
Apr 4, 2020, 6:17:46 PM4/4/20
to googlecluste...@googlegroups.com
Hi.  This trace is not designed to be downloaded.  Please use and explore it in BigQuery.  If that's not a good fit for your use case, do feel free to take advantage of the 2011 trace.
  john

On Sat, Apr 4, 2020 at 8:17 AM EHT-E-SHAM Sham <ehtesh...@gmail.com> wrote:
@John wilkes pls share the scripts or just some hint to download it.

--
You received this message because you are subscribed to the "Google cluster data - discussions" group. To post to this group, send email to googlecluste...@googlegroups.com. To unsubscribe from this group, send email to googleclusterdata-...@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/googleclusterdata-discuss?hl=en-US.

---
You received this message because you are subscribed to the Google Groups "Google cluster data - discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to googleclusterdata-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/googleclusterdata-discuss/6edb0d1f-b7c2-4f38-b483-70dbcb061132%40googlegroups.com.

Nan Deng

unread,
Apr 6, 2020, 1:51:08 AM4/6/20
to googlecluste...@googlegroups.com
Sham, would you shae the query you used against the dataset? It would be better if you could give some screenshot so that we can help you with the problem.

Jiangcm, I'm sorry that you could not access Google service from mainland China. Technically, you can export BIgQuery results into csv or other formats: https://cloud.google.com/bigquery/docs/exporting-data But be careful with the operation, the results could be very large. I know accessing BIgQuery may not be possible from China. I'm not sure if there is any technical and legal way to overcome this problem (e.g. VPN?) At the end of the day, we are from Google and we need to host the data on some Google server. If all google servers are inaccessible from China, there is really not too much we can do on our end.

--
You received this message because you are subscribed to the "Google cluster data - discussions" group. To post to this group, send email to googlecluste...@googlegroups.com. To unsubscribe from this group, send email to googleclusterdata-...@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/googleclusterdata-discuss?hl=en-US.
---
You received this message because you are subscribed to the Google Groups "Google cluster data - discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to googleclusterdata-...@googlegroups.com.

morooi

unread,
Apr 21, 2020, 4:01:18 AM4/21/20
to Google cluster data - discussions
I'm having a problem using BigQuery as described in the documentation. When I try to read this data set, the prompt is as follows:

bq show google.com:google-cluster-data:2019-05-a.machine_events

BigQuery error in show operation: Not found: Dataset google.com:google-cluster-data:2019-05-a


This is my first time using BigQuery and how should I read the new Google cluster trace.

在 2020年4月2日星期四 UTC+8上午7:07:59,john wilkes写道:

Christophe Maudoux

unread,
Apr 22, 2020, 5:26:08 AM4/22/20
to Google cluster data - discussions
Hello,

I am facing the same issue since 2 days...

Nan Deng

unread,
Apr 22, 2020, 4:53:33 PM4/22/20
to googlecluste...@googlegroups.com
Morooi,

I believe you got the table name wrong. The following command works when I use my personal account:

 bq show google.com:google-cluster-data:clusterdata_2019_a.machine_eventsTable google.com:google-cluster-data:clusterdata_2019_a.machine_events

Note that it's underscore (_), not hyphen (-) in the table name. I sorry that the document might be wrong.


--
You received this message because you are subscribed to the "Google cluster data - discussions" group. To post to this group, send email to googlecluste...@googlegroups.com. To unsubscribe from this group, send email to googleclusterdata-...@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/googleclusterdata-discuss?hl=en-US.
---
You received this message because you are subscribed to the Google Groups "Google cluster data - discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to googleclusterdata-...@googlegroups.com.

Nan Deng

unread,
Apr 22, 2020, 5:33:47 PM4/22/20
to googlecluste...@googlegroups.com
I believe it's better to provide some example queries.

You can use bq command line for this purpose. If you want to deal with larger query results, I highly recommend using google colab to post process the results: https://colab.research.google.com/

Counting the unique number of machines in cluster a:
bq query --use_legacy_sql=false 'SELECT COUNT(DISTINCT machine_id) FROM `google.com:google-cluster-data`.clusterdata_2019_a.machine_events'


Supposedly, one machine id should have one capacity (i.e. machine should not suddenly "grow" out a new CPU core, or lose some memory chip). However, as in the real world, such a thing does happen rarely. Let's check the number of machines that have more than one capacity. The query is a bit complicated as we need to nest some queries. But I hope it is easy to understand. Here is the query:

SELECT COUNT(DISTINCT machine_id) AS num_abnormal_machines
FROM (
SELECT machine_id,
COUNT(DISTINCT capacity.cpus) AS num_distinct_cpu_cap,
COUNT(DISTINCT capacity.memory) AS num_distinct_memory_cap
FROM `google.com:google-cluster-data`.clusterdata_2019_a.machine_events
WHERE capacity.cpus>0 AND capacity.memory>0
GROUP BY 1)
WHERE num_distinct_cpu_cap>1 OR num_distinct_memory_cap>1

Again, you can use bq to run this query:

bq query --use_legacy_sql=false \
'SELECT COUNT(DISTINCT machine_id) AS num_abnormal_machines
FROM (
SELECT machine_id,
COUNT(DISTINCT capacity.cpus) AS num_distinct_cpu_cap,
COUNT(DISTINCT capacity.memory) AS num_distinct_memory_cap
FROM `google.com:google-cluster-data`.clusterdata_2019_a.machine_events
WHERE capacity.cpus>0 AND capacity.memory>0
GROUP BY 1)
WHERE num_distinct_cpu_cap>1 OR num_distinct_memory_cap>1'

From now on, I will just use the query itself without mentioning bq. You can pick your favorite front end to run the query.

It should return one, meaning we found one machine that has more than one capacity. So which machine is it? With a little change of the previous query, you can find out the machine id:

SELECT machine_id
FROM (
SELECT machine_id,
COUNT(DISTINCT capacity.cpus) AS num_distinct_cpu_cap,
COUNT(DISTINCT capacity.memory) AS num_distinct_memory_cap
FROM `google.com:google-cluster-data`.clusterdata_2019_a.machine_events
WHERE capacity.cpus>0 AND capacity.memory>0
GROUP BY 1)
WHERE num_distinct_cpu_cap>1 OR num_distinct_memory_cap>1

It should return a single machine with machine_id=35872531884. Let's find what's going on with this machine:

SELECT time, type, capacity.cpus, capacity.memory
FROM `google.com:google-cluster-data`.clusterdata_2019_a.machine_events
WHERE machine_id=35872531884
ORDER BY 1

You will see that the machine starts with a memory capacity of 0.33349609375 and later is updated with a new memory capacity of 0.25, then changed back with 0.33349609375 through another update event. This tells us that the machine might have part of its memory damaged/removed for a brief moment. Not an extremely exciting discovery, but that's something that would happen to machines. Fortunately, such thing happens very rarely in the trace data that we can almost always machines don't change their capacity.



On Wed, Apr 22, 2020 at 2:26 AM Christophe Maudoux <chr...@gmail.com> wrote:
--

john wilkes

unread,
Apr 22, 2020, 10:54:16 PM4/22/20
to googlecluste...@googlegroups.com
Hi all.  Sorry about the mistaken table names in the v3 document.  A new version with corrected values has been pushed out, plus a couple of simple queries; you can find it linked to from the 2019 trace documentation page.
  john

morooi

unread,
Apr 22, 2020, 11:55:24 PM4/22/20
to Google cluster data - discussions
Thank you very much, I have successfully queried the data according to the latest document you provided and there are no errors.
You received this message because you are subscribed to the "Google cluster data - discussions" group. To post to this group, send email to googlecluste...@googlegroups.com. To unsubscribe from this group, send email to googleclusterdata-discuss+unsub...@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/googleclusterdata-discuss?hl=en-US.

---
You received this message because you are subscribed to the Google Groups "Google cluster data - discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to googleclusterdata-discuss+unsub...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages