Update on the cluster trace v3

438 views
Skip to first unread message

Nan Deng

unread,
Aug 10, 2020, 7:42:03 PM8/10/20
to Google cluster data - discussions
Hi all,

I would like to announce some updates on the cluster trace v3, which was published early this year alone with a paper in Eurosys 2020. Specifically, we fixed some minor bugs in the data and also provided a way for people to download the trace without using BigQuery APIs. We will update our document soon to reflect the changes. Here is a list of the changes we made:

1. Added the "user" field in the collection_events table.

According to the document, the collection_events table should contain a string field called "user" to indicate the user who runs the collection. However, due to a bug in our program, this field was left empty. We have updated the data so that this field is containing actual information of the owner of the collection, which is a string containing the user name after obfuscation.

2. All timestamps are represented using int64, not uint64

Although BigQuery internally supports uint64 type, it is not available in many external data format. Specifically, Apache AVRO only supports int64. Changing all timestamps from uint64 to int64 should breaking anyone as we sill keep the values unchanged considering int64 is sufficient to represent all the numbers we need. The benefit you would get is that you can now directly export the data and/or query results to AVRO format, which can be later analyzed by systems like Apache Drill, Apache Beam, etc.

3. You can now download the data in .json format

We learned that querying BigQuery, although convenient, is expensive to many people. With the new change, the trace is now available for download in JSON format from Google cloud storage (GCS). Each cell's data is stored in its own bucket whose name follows the pattern clusterdata_2019_${CELL}. For example, the data of cell a is stored in the bucket clusterdata_2019_a. Inside each bucket, there are 5 tables: collection_events, instance_events, machine_events, machine_attributes and instance_usage. Each table is sharded into one or multiple files. Each file contains new-line separated json strings compressed by GZIP. The names of the files follow the pattern ${TABLE_NAME}-[0-9]+.json.gzip, where the number followed by table name represents the shard's id. Combining with the bucket name, one can construct the path of a given table. For example, the instance usage data for cell a is stored at gs://clusterdata_2019_a/instance_usage-*.json.gzip, where * is a wildcard.

To download the data, one can use gsutil:


gsutil cp gs://clusterdata_2019_a/instance_usage-*.json.gzip <destination dir>


To inspect the content of a file, one needs to first decompress the data using GZIP, e.g. using gunzip command:


gunzip instance_usage-000000000000.json.gz


The content of the file contains a set of JSON strings separated by new lines, each line contains a JSON object representing a row in the table. Note that because JSON does not support int64 type, any int64 field (e.g. timestamps, collection of, machine id) are represented as strings in JSON format.


Here is an example of one row of data from machine_events table:


{"time":"89703182129","machine_id":"375997113395","type":"1","switch_id":"0kdfKLeqkk1sN8xXnJ8f63bMq+ciUu2ztSu53+pf1HM=","platform_id":"JQ1tVQBMHBAIISU1gUNXk2powhYumYA+4cB3KzU29l8="}


The schema of each table is stored in another GCS bucket named clusterdata_2019_schema. Inside the bucket, there are 5 files each representing the schema of their corresponding table using JSON.



Regards,

-Nan


john wilkes

unread,
Aug 11, 2020, 6:41:14 PM8/11/20
to googlecluste...@googlegroups.com
Hi.  The documentation has also been updated to reflect these changes.  It also includes some updated information about Borg's priority values.

Many thanks to Nan for doing the work to make the JSon available - we hope you will find it useful!
  john

--
You received this message because you are subscribed to the "Google cluster data - discussions" group. To post to this group, send email to googlecluste...@googlegroups.com. To unsubscribe from this group, send email to googleclusterdata-...@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/googleclusterdata-discuss?hl=en-US.
---
You received this message because you are subscribed to the Google Groups "Google cluster data - discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to googleclusterdata-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/googleclusterdata-discuss/df145ca3-5aea-4ab8-b8fc-1c1a883a7c60o%40googlegroups.com.

Dimuth Lasantha Senarath Pathirane Rajapakshalage

unread,
Aug 16, 2020, 11:08:49 PM8/16/20
to Google cluster data - discussions
Hi,

Hope you all are having a grate day. I need to clarify one thing about the URL for downloading the data set. 

According to the document the download URL is - gsutil cp gs://clusterdata_2019_a/instance_usage-*.json.gzip <destination dir>  
I used it and it gave me an error(CommandException: No URLs matched: gs://clusterdata_2019_a/instance_events-000000000000.json.gzip)

Then I changed the URL as - gsutil cp gs://clusterdata_2019_a/instance_usage-*.json.gz <destination dir> 
Then it worked fine.

I just want to request for updating the document if the URL is wrong otherwise provide the additional information. 

Cheers,
-Dimuth

Nan Deng

unread,
Aug 17, 2020, 3:32:00 PM8/17/20
to Google cluster data - discussions
Thank you, Dimuth. You are right. It should be .json.gz.

Sorry for the confusion. I'll update the document acordingly.

Dimuth Lasantha Senarath Pathirane Rajapakshalage

unread,
Aug 17, 2020, 8:46:17 PM8/17/20
to googlecluste...@googlegroups.com
Dear Nan,

That is fine and thanks for your attention. 

Cheers,
-Dimuth 

john wilkes

unread,
Aug 19, 2020, 12:28:36 AM8/19/20
to googlecluste...@googlegroups.com
The documentation has been updated.  Sorry for any confusion.
  john

Ph.D. Student

unread,
Aug 29, 2020, 5:25:34 PM8/29/20
to Google cluster data - discussions
Hi Nan,

Thanks for making the data available to download in json format. I also want to highlight one point here that like 2011 trace if you can put this data in .csv fomat the size of the data will reduce dramatically and processing and storage will be a lot more faster and practical for many people. Json files use too much space due to key-value format. When uncompressed json files will grow very large.

Thanks

On Sunday, 16 August 2020 at 23:08:49 UTC-4  wrote:
Hi,

Hope you all are having a grate day. I need to clarify one thing about the URL for downloading the data set. 

According to the document the download URL is - gsutil cp gs://clusterdata_2019_a/instance_usage-*.json.gzip <destination dir>  
I used it and it gave me an error(CommandException: No URLs matched: gs://clusterdata_2019_a/instance_events-000000000000.json.gzip)

Then I changed the URL as - gsutil cp gs://clusterdata_2019_a/instance_usage-*.json.gz <destination dir> 
Then it worked fine.

I just want to request for updating the document if the URL is wrong otherwise provide the additional information. 

Cheers,
-Dimuth

On Wednesday, 12 August 2020 at 08:41:14 UTC+10 john wilkes wrote:
Hi.  The documentation has also been updated to reflect these changes.  It also includes some updated information about Borg's priority values.

Many thanks to Nan for doing the work to make the JSon available - we hope you will find it useful!
  john

On Mon, Aug 10, 2020 at 4:42 PM 'Nan Deng' via Google cluster data - discussions  wrote:
Hi all,

I would like to announce some updates on the cluster trace v3, which was published early this year alone with a paper in Eurosys 2020. Specifically, we fixed some minor bugs in the data and also provided a way for people to download the trace without using BigQuery APIs. We will update our document soon to reflect the changes. Here is a list of the changes we made:

1. Added the "user" field in the collection_events table.

According to the document, the collection_events table should contain a string field called "user" to indicate the user who runs the collection. However, due to a bug in our program, this field was left empty. We have updated the data so that this field is containing actual information of the owner of the collection, which is a string containing the user name after obfuscation.

2. All timestamps are represented using int64, not uint64

Although BigQuery internally supports uint64 type, it is not available in many external data format. Specifically, Apache AVRO only supports int64. Changing all timestamps from uint64 to int64 should breaking anyone as we sill keep the values unchanged considering int64 is sufficient to represent all the numbers we need. The benefit you would get is that you can now directly export the data and/or query results to AVRO format, which can be later analyzed by systems like Apache Drill, Apache Beam, etc.

3. You can now download the data in .json format

We learned that querying BigQuery, although convenient, is expensive to many people. With the new change, the trace is now available for download in JSON format from Google cloud storage (GCS). Each cell's data is stored in its own bucket whose name follows the pattern clusterdata_2019_${CELL}. For example, the data of cell a is stored in the bucket clusterdata_2019_a. Inside each bucket, there are 5 tables: collection_events, instance_events, machine_events, machine_attributes and instance_usage. Each table is sharded into one or multiple files. Each file contains new-line separated json strings compressed by GZIP. The names of the files follow the pattern ${TABLE_NAME}-[0-9]+.json.gzip, where the number followed by table name represents the shard's id. Combining with the bucket name, one can construct the path of a given table. For example, the instance usage data for cell a is stored at gs://clusterdata_2019_a/instance_usage-*.json.gzip, where * is a wildcard.

To download the data, one can use gsutil:


gsutil cp gs://clusterdata_2019_a/instance_usage-*.json.gzip <destination dir>


To inspect the content of a file, one needs to first decompress the data using GZIP, e.g. using gunzip command:


gunzip instance_usage-000000000000.json.gz


The content of the file contains a set of JSON strings separated by new lines, each line contains a JSON object representing a row in the table. Note that because JSON does not support int64 type, any int64 field (e.g. timestamps, collection of, machine id) are represented as strings in JSON format.


Here is an example of one row of data from machine_events table:


{"time":"89703182129","machine_id":"375997113395","type":"1","switch_id":"0kdfKLeqkk1sN8xXnJ8f63bMq+ciUu2ztSu53+pf1HM=","platform_id":"JQ1tVQBMHBAIISU1gUNXk2powhYumYA+4cB3KzU29l8="}


The schema of each table is stored in another GCS bucket named clusterdata_2019_schema. Inside the bucket, there are 5 files each representing the schema of their corresponding table using JSON.



Regards,

-Nan


--

---

Nan Deng

unread,
Aug 31, 2020, 7:53:03 PM8/31/20
to Google cluster data - discussions
We cannot make csv files out of the trace because there're many repeated fields in this version of trace, which cannot be easily converted to csv in a universally acceptable way. Another alternative to JSON is AVRO, which should give us a more compact representation. But AVRO is a binary format while JSON is plain text. I feel introducing any binary format would create additional burden to the users to process the data even if the format is an open standard.

john wilkes

unread,
Sep 1, 2020, 12:59:41 PM9/1/20
to googlecluste...@googlegroups.com
+1 to Nan's observations.

The other thing I was going to say was that if you still felt a .csv version is helpful (given the complications Nan pointed out), feel free to make one - and then help everybody else by sharing the code required to do it!  The trace is a gift to the community; we encourage the community to give back :-).

--
You received this message because you are subscribed to the "Google cluster data - discussions" group. To post to this group, send email to googlecluste...@googlegroups.com. To unsubscribe from this group, send email to googleclusterdata-...@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/googleclusterdata-discuss?hl=en-US.
---
You received this message because you are subscribed to the Google Groups "Google cluster data - discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to googleclusterdata-...@googlegroups.com.

adam....@gmail.com

unread,
Sep 1, 2020, 6:04:34 PM9/1/20
to Google cluster data - discussions
On the topic of community: I'd also encourage users to share their colab / analysis notebooks so that the rest of the community can build on your findings. It's good practice if you are publishing a paper on the data set to also make your code available, so we can reproduce your analysis.

R

unread,
Sep 27, 2020, 2:56:19 PM9/27/20
to Google cluster data - discussions
Thank you very much for making the JSONs available. Very helpful.

Regards,

huiyan...@gmail.com

unread,
Feb 16, 2021, 2:00:53 PM2/16/21
to Google cluster data - discussions
Hi Nan,

I checked the user-defined in the document which is "the obfuscated name of the “user” (person or system) that submitted the collection".
The "person" I can easily understand. But I have some questions about the system. Does it like an app or server which a lot customer can access?
For example, in the search system, different customers do a search on this system but in the cluster, they all have the same user name(search system)? Please correct me. Thanks.
Huiyang

Nan Deng

unread,
Feb 19, 2021, 7:10:13 PM2/19/21
to Google cluster data - discussions
Huiyang,

Please note that "user" in the trace has NO relationship with Google accounts. It's internal Google's user name. It's similar to the concept of user and group as in normal UNIX-like systems. You can run a Borg job using your own account (e.g. dengnan, which is my username), or using a group name that belongs to you (e.g. fancy-google-search or something. It's a made up name by the way). Using different user or group names would affect the accounting, i.e. how much to charge each user or group and if the user or group has enough quota to run the job in the first place.
Reply all
Reply to author
Forward
0 new messages