There are so many misunderstandings in this thread. Let me make sure that everyone is on the same page.
First and foremost: The trace, both the 2011 and the 2019 trace, is provided to the public for free. You can use it however you want and you DO NOT need to pay anyone for using the data. The only thing we are asking is to properly cite the work if you use the data in your publications.
Second, you DO NOT need to register a google account (or google cloud account, for that matter) to download and/or use the trace. You can download the 2019 trace in json format using the plain https protocol. You don't need to have any account on any website to download the data as long as you have internet access. The cloud credit thing mentioned in this thread has nothing to do with the trace. It's just a promotion from Google cloud to encourage people to try Google cloud product. Again, the trace data has nothing to do with the promotion, just like a 30-day free Netflix trial has nothing to do with our trace data.
Now let's talk about how to access the data. I'll only talk about the 2019 trace (or version 3) because that's the most recent data. If you are still using 2011 trace, we strongly encourage you to switch to 2019 trace data, which better reflects the reality of our workload.
Please read the section Accessing the trace data and feel free to search terms on the Internet in case you are not familiar with them.
If you have read the document and understand all the necessary information about how the data is shared, you should have known:
1. The data is available on BigQuery and Google Cloud Storage.
3. If you want to download the data onto your disk and run your own program to analyze the data, you can use the data available on Google Cloud Storage.
The confusion in the thread is mostly about point 3, where accessing data on Google Cloud Storage can be achieved using gsutil, which is a command line tool from google cloud and we used it in the document to show one of the ways of downloading the data. However, it is NOT the only way. In the document, we have already mentioned about the bucket name and file names of the datasets stored on google cloud storage:
"Each cell's data is stored in its own bucket whose name follows the pattern clusterdata_2019_${CELL}. (E.g., the data of cell a is stored in the bucket clusterdata_2019_a.) Inside each bucket, there are 5 tables: collection_events, instance_events, machine_events, machine_attributes and instance_usage. Each table is sharded into one or more files, whose names follow the pattern ${TABLE_NAME}-[0-9]+.json.gz, where the number following the table name represents the shard's id."
If you do a little search about Google Cloud Storage, you would realize that it is a storage service where people could access the data using the plain old HTTP protocol. Please read this document about google cloud storage to see how to access data using different tools including using gsutil or normal HTTP protocol:
https://cloud.google.com/storage/docs/access-public-data#api-link
The reason we choose gsutil in the document is simply because it allows us to use wildcard so that we can use a one-liner to download the data of a whole table. Nothing would prevent you from downloading things using your browser by using a link like:
https://storage.googleapis.com/clusterdata_2019_a/collection_events-000000000000.json.gz. You just need to repeat yourself thousands of times to download the whole dataset because the dataset is sharded into small files.
You can also use a script to download the dataset. Here is one I wrote that could download one table from one cell:
#!/usr/bin/env bash
CELL="a"
TABLE="collection_events"
trace_file_url() {
local cell=$1
local table=$2
local index=$3
# Note that the index in the URL is always a 12-digit number with left-pad zeros
}
idx=0
while true; do
if ! wget $(trace_file_url $CELL $TABLE $idx); then
# When reaches to a non-exist file which fails wget, stop the script
# Ideally, it should check if the error is NOT FOUND, rather than some network issue, or whatnot
exit 0
fi
idx=$((idx + 1))
done
Feel free to change the script to make it smarter by running things in parallel, passing parameters through command line arguments, random sampling the data, etc.
It worth to mention that the whole dataset is very large. One cell's data would be around several hundreds of GiB and the total dataset of 2019 trace, which contains 8 cells of data, would be around several TiB. Make sure you have enough storage to store the data before downloading them.
In case you are wondering, here are some questions that I imagine you may ask:
Q: Why do you use BigQuery?
A: We found BigQuery is extremely useful for us to analyze a dataset of this size. We can run pretty complicated analysis using a SQL query within seconds or minutes against several TiB of data using BigQuery. Almost all analysis we've done for the Eurosys 2020 paper, Borg: The Next Generation, has done using BigQuery. That said, you are free to use whatever tool you want to analyze the data. We just feel BigQuery is sufficient for most of the analysis work we need to do.
Q: Why do you store data in json format? It used to be CSV.
A: CSV is good for data with a flat structure, like a table with pre-defined number of columns. The 2019 trace introduced some nested data structures that do not have a standard way to store in CSV, e.g. the CPU usage histogram. We can either normalize our data and introduce some additional tables to store the relationships, or we can use a better representation of the data. We ended up with the later option because it is conceptually easier to understand. JSON allows us to store lists and maps, which is sufficient for our use case and it is already widely used.
Q: In the JSON dataset, there are integers stored as strings. Why?
A: Because JSON does not support int64 type, any int64 field (e.g. timestamps, collection_id, machine_id) are represented as strings in JSON format.
I hope it answers most of your questions. If not, feel free to reply under this thread. I'll read through each emails in this thread and will reply them individually in the following email(s).
Regards,
Nan