To download the data, one can use gsutil:
gsutil cp gs://clusterdata_2019_a/instance_usage-*.json.gzip <destination dir>
To inspect the content of a file, one needs to first decompress the data using GZIP, e.g. using gunzip command:
gunzip instance_usage-000000000000.json.gz
The content of the file contains a set of JSON strings separated by new lines, each line contains a JSON object representing a row in the table. Note that because JSON does not support int64 type, any int64 field (e.g. timestamps, collection of, machine id) are represented as strings in JSON format.
Here is an example of one row of data from machine_events table:
{"time":"89703182129","machine_id":"375997113395","type":"1","switch_id":"0kdfKLeqkk1sN8xXnJ8f63bMq+ciUu2ztSu53+pf1HM=","platform_id":"JQ1tVQBMHBAIISU1gUNXk2powhYumYA+4cB3KzU29l8="}
The schema of each table is stored in another GCS bucket named clusterdata_2019_schema. Inside the bucket, there are 5 files each representing the schema of their corresponding table using JSON.
Regards,
-Nan
--
You received this message because you are subscribed to the "Google cluster data - discussions" group. To post to this group, send email to googlecluste...@googlegroups.com. To unsubscribe from this group, send email to googleclusterdata-...@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/googleclusterdata-discuss?hl=en-US.
---
You received this message because you are subscribed to the Google Groups "Google cluster data - discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to googleclusterdata-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/googleclusterdata-discuss/df145ca3-5aea-4ab8-b8fc-1c1a883a7c60o%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/googleclusterdata-discuss/d104049d-29a8-4432-a6dd-4560e48758d3o%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/googleclusterdata-discuss/CAAVoVvXn4xdNCWoNu8%3DH35iSeLzpGoa_V7_s-UbpeON%3DDjvdbg%40mail.gmail.com.
Hi,Hope you all are having a grate day. I need to clarify one thing about the URL for downloading the data set.According to the document the download URL is - gsutil cp gs://clusterdata_2019_a/instance_usage-*.json.gzip <destination dir>I used it and it gave me an error(CommandException: No URLs matched: gs://clusterdata_2019_a/instance_events-000000000000.json.gzip)Then I changed the URL as - gsutil cp gs://clusterdata_2019_a/instance_usage-*.json.gz <destination dir>Then it worked fine.I just want to request for updating the document if the URL is wrong otherwise provide the additional information.Cheers,-Dimuth
On Wednesday, 12 August 2020 at 08:41:14 UTC+10 john wilkes wrote:
Hi. The documentation has also been updated to reflect these changes. It also includes some updated information about Borg's priority values.Many thanks to Nan for doing the work to make the JSon available - we hope you will find it useful!john
On Mon, Aug 10, 2020 at 4:42 PM 'Nan Deng' via Google cluster data - discussions wrote:Hi all,I would like to announce some updates on the cluster trace v3, which was published early this year alone with a paper in Eurosys 2020. Specifically, we fixed some minor bugs in the data and also provided a way for people to download the trace without using BigQuery APIs. We will update our document soon to reflect the changes. Here is a list of the changes we made:1. Added the "user" field in the collection_events table.According to the document, the collection_events table should contain a string field called "user" to indicate the user who runs the collection. However, due to a bug in our program, this field was left empty. We have updated the data so that this field is containing actual information of the owner of the collection, which is a string containing the user name after obfuscation.2. All timestamps are represented using int64, not uint64Although BigQuery internally supports uint64 type, it is not available in many external data format. Specifically, Apache AVRO only supports int64. Changing all timestamps from uint64 to int64 should breaking anyone as we sill keep the values unchanged considering int64 is sufficient to represent all the numbers we need. The benefit you would get is that you can now directly export the data and/or query results to AVRO format, which can be later analyzed by systems like Apache Drill, Apache Beam, etc.3. You can now download the data in .json formatWe learned that querying BigQuery, although convenient, is expensive to many people. With the new change, the trace is now available for download in JSON format from Google cloud storage (GCS). Each cell's data is stored in its own bucket whose name follows the pattern clusterdata_2019_${CELL}. For example, the data of cell a is stored in the bucket clusterdata_2019_a. Inside each bucket, there are 5 tables: collection_events, instance_events, machine_events, machine_attributes and instance_usage. Each table is sharded into one or multiple files. Each file contains new-line separated json strings compressed by GZIP. The names of the files follow the pattern ${TABLE_NAME}-[0-9]+.json.gzip, where the number followed by table name represents the shard's id. Combining with the bucket name, one can construct the path of a given table. For example, the instance usage data for cell a is stored at gs://clusterdata_2019_a/instance_usage-*.json.gzip, where * is a wildcard.To download the data, one can use gsutil:
gsutil cp gs://clusterdata_2019_a/instance_usage-*.json.gzip <destination dir>
To inspect the content of a file, one needs to first decompress the data using GZIP, e.g. using gunzip command:
gunzip instance_usage-000000000000.json.gz
The content of the file contains a set of JSON strings separated by new lines, each line contains a JSON object representing a row in the table. Note that because JSON does not support int64 type, any int64 field (e.g. timestamps, collection of, machine id) are represented as strings in JSON format.
Here is an example of one row of data from machine_events table:
{"time":"89703182129","machine_id":"375997113395","type":"1","switch_id":"0kdfKLeqkk1sN8xXnJ8f63bMq+ciUu2ztSu53+pf1HM=","platform_id":"JQ1tVQBMHBAIISU1gUNXk2powhYumYA+4cB3KzU29l8="}
The schema of each table is stored in another GCS bucket named clusterdata_2019_schema. Inside the bucket, there are 5 files each representing the schema of their corresponding table using JSON.
Regards,
-Nan
--
---
--
You received this message because you are subscribed to the "Google cluster data - discussions" group. To post to this group, send email to googlecluste...@googlegroups.com. To unsubscribe from this group, send email to googleclusterdata-...@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/googleclusterdata-discuss?hl=en-US.
---
You received this message because you are subscribed to the Google Groups "Google cluster data - discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to googleclusterdata-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/googleclusterdata-discuss/cd2d80de-171f-40fd-b9b7-d9196ed081e9o%40googlegroups.com.