Request google cluster trace

814 views
Skip to first unread message

raha abbasi

unread,
Dec 2, 2020, 12:52:25 PM12/2/20
to Google cluster data - discussions

Hello,

I am phd student and I am going to use google cluster trace for my project.
Is there any free version of google cluster trace?
Do google cluster trace have information including memory usage, network info and cpu usage?


Best regards
Raha

john wilkes

unread,
Dec 2, 2020, 12:58:40 PM12/2/20
to googlecluste...@googlegroups.com
On Wed, Dec 2, 2020 at 9:52 AM raha abbasi <raha.ab...@gmail.com> wrote:

Hello,

I am phd student and I am going to use google cluster trace for my project.
Hi, and welcome to the community. 

Is there any free version of google cluster trace?
Yes - please read the documentation on  https://github.com/google/cluster-data

Do google cluster trace have information including memory usage, network info and cpu usage?
Yes, no, yes - please read the documentation on  https://github.com/google/cluster-data



Best regards
Raha

--
You received this message because you are subscribed to the "Google cluster data - discussions" group. To post to this group, send email to googlecluste...@googlegroups.com. To unsubscribe from this group, send email to googleclusterdata-...@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/googleclusterdata-discuss?hl=en-US.
---
You received this message because you are subscribed to the Google Groups "Google cluster data - discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to googleclusterdata-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/googleclusterdata-discuss/6bd8893b-1e21-4957-b24b-b0ba2e556530n%40googlegroups.com.

raha abbasi

unread,
Dec 2, 2020, 8:11:36 PM12/2/20
to Google cluster data - discussions
Thank you so much for your reply.
I tried to download the free version based on the instructions in the document.
It redirected me to the google cloud page.
After selecting get started free, it redirected me to the payment page to pay 300$. 
Would be possible to guide me on how can I download it as a free version?
I am a Concordia, Canada Ph.D. student. I think that I should have a google cloud account to download it. is it possible to have an account for Concordia students?

Best regards
Raha

Asif Ejaz

unread,
Dec 3, 2020, 1:02:16 AM12/3/20
to googlecluste...@googlegroups.com
You need to create account for google cloud platform and you will be required to add your credit card details also. You will get credit of 300$ for using cloud services on new account. You have to pay for cloud services after you incur these 300$.
For university google cloud account you can also ask from your university if they provide or atleat some account dedicated for some research labs are purchased or not. 
For just downloading dataset you will be charged nothing from that 300$ because dataset is free. but if you will save data on cloud storage service like bigquery etc then you will be charges
You may need to think about working on cloud because dataset of 2019 has 8 different clusters and each cluster data is around 1 TB in compressed format that will be 8 times in uncompressed format. 
If you can manage such a fast and large storage then you should think about downloading it.


Prof. Manoel Campos

unread,
Dec 3, 2020, 6:43:10 AM12/3/20
to googlecluste...@googlegroups.com
For the 2011 version, you can simply try this script:

https://github.com/manoelcampos/cloudsim-plus/blob/master/script/download-google-cluster-data.sh

Manoel Campos da Silva Filho Software Engineer

Computer Science and Engineering Ph.D. Student at University of Beira Interior (Portugal)

Professor at Federal Institute of Education, Science and Technology of Tocantins (Brazil)

http://manoelcampos.com


 about.me


raha abbasi

unread,
Dec 3, 2020, 12:01:16 PM12/3/20
to Google cluster data - discussions
Thank you for your reply. 
-Should I install ubuntu and then run this bash file on this?
-I am going to use this data for fault detection in the cloud. 
- Do this version suitable for fault detection?

Best regards
Raha

raha abbasi

unread,
Dec 3, 2020, 12:01:37 PM12/3/20
to Google cluster data - discussions

Thank you for your reply. 

Prof. Manoel Campos

unread,
Dec 3, 2020, 1:50:25 PM12/3/20
to googlecluste...@googlegroups.com
If you have Windows Subsystem for Linux (WSL) you can run the script.
If the traces are suitable for fault detection, that is a subjective question and depends on your experiments.
They don't provide any data about failures but you can use them to create some simulation experiments
and inject faults using some pseudo random number generator following a statistical distribution.

If you want to create simulations in Java, check CloudSim Plus.
It has support for the Google Traces and has a fault injection module that
does exactly what I just explained.

The official website is http://cloudsimplus.org

Manoel Campos da Silva Filho Software Engineer

Computer Science and Engineering Ph.D. Student at University of Beira Interior (Portugal)

Professor at Federal Institute of Education, Science and Technology of Tocantins (Brazil)

http://manoelcampos.com


 about.me

raha abbasi

unread,
Dec 3, 2020, 6:19:51 PM12/3/20
to Google cluster data - discussions

Thank you so much for your reply.

I installed ubuntu and run download-google-cluster-data.sh.

As you mentioned, I collected normal data for the version of 2011. For fault data, I should inject fault statically.

 Does the version of 2019 include fault data and normal data?

I tried to download it but I should pay 300$ for a google account.

To pay for this, I need to make sure that  version of 2019 has fault data sample. 


Best regards

Raha

Prof. Manoel Campos

unread,
Dec 4, 2020, 10:58:50 AM12/4/20
to googlecluste...@googlegroups.com

 Does the version of 2019 include fault data and normal data?

I don't know. 

I tried to download it but I should pay 300$ for a google account.

You don't have to pay that. When you create a Google Cloud Platform account, you receive this value in free credits.
You just have to provide a credit card in case you exceed this limit.

Bamdad Mousavi

unread,
Dec 4, 2020, 1:16:41 PM12/4/20
to googlecluste...@googlegroups.com
Hello Raha,

I would like to clarify the earlier reply you received from Asif Ejaz for you. You do NOT have to pay $300 for downloading the 2019 version of the dataset. When you create a new Google Cloud Account, you receive
$300 in credits from Google to try different services that Google Cloud provides. The reason Google asks for your credit card information is to charge you after you have used your $300 worth of credit. You can use this credit if you decide to process the dataset using BigQuery or for any other Google Cloud services and you will only be charged after you reach your credit limit. Google publishes these datasets for free for academic researchers like us and it does not charge you for downloading the 2019 version of the dataset.

Regards,
Bamdad Mousavi

Nan Deng

unread,
Dec 9, 2020, 5:06:02 PM12/9/20
to Google cluster data - discussions
There are so many misunderstandings in this thread. Let me make sure that everyone is on the same page.

First and foremost: The trace, both the 2011 and the 2019 trace, is provided to the public for free. You can use it however you want and you DO NOT need to pay anyone for using the data. The only thing we are asking is to properly cite the work if you use the data in your publications.

Second, you DO NOT need to register a google account (or google cloud account, for that matter) to download and/or use the trace. You can download the 2019 trace in json format using the plain https protocol. You don't need to have any account on any website to download the data as long as you have internet access. The cloud credit thing mentioned in this thread has nothing to do with the trace. It's just a promotion from Google cloud to encourage people to try Google cloud product. Again, the trace data has nothing to do with the promotion, just like a 30-day free Netflix trial has nothing to do with our trace data.

Now let's talk about how to access the data. I'll only talk about the 2019 trace (or version 3) because that's the most recent data. If you are still using 2011 trace, we strongly encourage you to switch to 2019 trace data, which better reflects the reality of our workload.

Before reading the rest of this email, please read the document we published about the 2019 trace, specifically, this one: https://drive.google.com/file/d/10r6cnJ5cJ89fPWCgj7j4LtLBqYN9RiI9/view

Please read the section Accessing the trace data and feel free to search terms on the Internet in case you are not familiar with them.

If you have read the document and understand all the necessary information about how the data is shared, you should have known:

1. The data is available on BigQuery and Google Cloud Storage.
2. You can use BigQuery to do lots of analysis work without downloading the data. It's just provided as a public data like many other public datasets available on BigQuery: https://cloud.google.com/bigquery/public-data
3. If you want to download the data onto your disk and run your own program to analyze the data, you can use the data available on Google Cloud Storage.

The confusion in the thread is mostly about point 3, where accessing data on Google Cloud Storage can be achieved using gsutil, which is a command line tool from google cloud and we used it in the document to show one of the ways of downloading the data. However, it is NOT the only way. In the document, we have already mentioned about the bucket name and file names of the datasets stored on google cloud storage:

"Each cell's data is stored in its own bucket whose name follows the pattern clusterdata_2019_${CELL}. (E.g., the data of cell a is stored in the bucket clusterdata_2019_a.) Inside each bucket, there are 5 tables: collection_events, instance_events, machine_events, machine_attributes and instance_usage. Each table is sharded into one or more files, whose names follow the pattern ${TABLE_NAME}-[0-9]+.json.gz, where the number following the table name represents the shard's id."

If you do a little search about Google Cloud Storage, you would realize that it is a storage service where people could access the data using the plain old HTTP protocol. Please read this document about google cloud storage to see how to access data using different tools including using gsutil or normal HTTP protocol: https://cloud.google.com/storage/docs/access-public-data#api-link

The reason we choose gsutil in the document is simply because it allows us to use wildcard so that we can use a one-liner to download the data of a whole table. Nothing would prevent you from downloading things using your browser by using a link like: https://storage.googleapis.com/clusterdata_2019_a/collection_events-000000000000.json.gz. You just need to repeat yourself thousands of times to download the whole dataset because the dataset is sharded into small files.

You can also use a script to download the dataset. Here is one I wrote that could download one table from one cell:

#!/usr/bin/env bash

CELL="a"
TABLE="collection_events"

trace_file_url() {
  local cell=$1
  local table=$2
  local index=$3
  # Note that the index in the URL is always a 12-digit number with left-pad zeros
}

idx=0
while true; do
  if ! wget $(trace_file_url $CELL $TABLE $idx); then
    # When reaches to a non-exist file which fails wget, stop the script
    # Ideally, it should check if the error is NOT FOUND, rather than some network issue, or whatnot
    exit 0
  fi
  idx=$((idx + 1))
done

Feel free to change the script to make it smarter by running things in parallel, passing parameters through command line arguments, random sampling the data, etc.

It worth to mention that the whole dataset is very large. One cell's data would be around several hundreds of GiB and the total dataset of 2019 trace, which contains 8 cells of data, would be around several TiB. Make sure you have enough storage to store the data before downloading them.

In case you are wondering, here are some questions that I imagine you may ask:

Q: Why do you use BigQuery?
A: We found BigQuery is extremely useful for us to analyze a dataset of this size. We can run pretty complicated analysis using a SQL query within seconds or minutes against several TiB of data using BigQuery. Almost all analysis we've done for the Eurosys 2020 paper, Borg: The Next Generation, has done using BigQuery. That said, you are free to use whatever tool you want to analyze the data. We just feel BigQuery is sufficient for most of the analysis work we need to do.

Q: Why do you store data in json format? It used to be CSV.
A: CSV is good for data with a flat structure, like a table with pre-defined number of columns. The 2019 trace introduced some nested data structures that do not have a standard way to store in CSV, e.g. the CPU usage histogram. We can either normalize our data and introduce some additional tables to store the relationships, or we can use a better representation of the data. We ended up with the later option because it is conceptually easier to understand. JSON allows us to store lists and maps, which is sufficient for our use case and it is already widely used.

Q: In the JSON dataset, there are integers stored as strings. Why?
A: Because JSON does not support int64 type, any int64 field (e.g. timestamps, collection_id, machine_id) are represented as strings in JSON format.

I hope it answers most of your questions. If not, feel free to reply under this thread. I'll read through each emails in this thread and will reply them individually in the following email(s).

Regards,
Nan
Reply all
Reply to author
Forward
0 new messages