CPU usage normalization and analysis - Google cluster data 2011

Andrea Morichetta

unread,

Jan 23, 2021, 11:48:59 AM1/23/21

to Google cluster data - discussions

Hi everyone,

First of all, I am glad to share my question on a platform so rich of researchers. I started inspecting the "2011 google cluster data" trace, and I have something that is causing me to be in a bit.

Question about features normalization:

In the Paragraph "Resource units," on page 4, there is written that "For each of the foregoing [measurements], we compute separate normalization. The normalization is a scaling relative to the largest capacity on any machine in the trace (which is 1.0)."

However, looking at the "Task usage" dataset, I could find CPU usage values way higher than 1.

Indeed, in the "Resource Usage" section, the document mentions that CPU usage representation is core-seconds per second, citing: "If a task is using two cores all the time, it will be reflected as a usage of 2.0 core-s/s".

This is excellent information, but I would also like to know how much CPU they are using compared to the requested amount. How can I normalize the CPU usage since I don't know the maximum capacity? I thought about comparing it with the requested resources, i.e., the task event table. However, there are reported only the partial sizes over the most powerful resource in any trace machines. Did I get something wrong?

Best,

Andrea

Nan Deng

unread,

Feb 2, 2021, 1:52:37 PM2/2/21

to Google cluster data - discussions

Hi Andrea

First, I would highly recommend you to try our 2019 trace which is much more up to data, contains more data both in volumes (8 cells rather than 1 cell) and number of metrics (autopilot information, CPU usage histogram within 5 min, etc)

To answer your questions, could you be more specific on your findings? Which column(s) did you use? What computations have you done? Could you find an example (e.g. at timestamp=X, machine_id=Y, job_id=Z, average CPU usage is >1.0. It's easier for us to inspect individual rows rather than searching in the whole dataset).

More about CPU usage and normalization:

As described in the document which you quoted in your mail, CPU usage is measured in seconds. It means how much time a task (consists of processes) has been using CPU. We use cgroup to isolate and monitor tasks. Each task consists of a set of processes. For each cpu cgroup, the Linux kernel maintains a counter that keeps increasing. Whenever a process running in a cgroup (i.e. a task) has used CPUs, the OS would increase the counter of the cgroup by the amount of time that the task has been running on the CPUs. So underlying Borg, the CPU usage of a task is maintained by the Linux kernel and it is a monotonically increasing integer. Every second, Borg checks the counter for every task and take the difference between the two consecutive measurements. The difference gives us the amount of time that a task has been running on CPUs between the two measurements (in Borg, it's around 1 second).

For a given machine, the CPU capacity is also measured in CPU seconds per seconds. If a machine only has one core, then its capacity would be 1 s/s because a task running on the machine can only use at most one CPU second per second. For machines with N cores, its capacity is N s/s.

To normalize the CPU usage (or limit) of a task, we find the machines with the maximum capacity within the dataset (think: the machine with the most number of cores,) and divide the CPU usage by the maximum capacity. Because CPU usage measured on real machines cannot exceed the underlying machine's capacity, the normalized usage would always be smaller than 1.0. If you are still a bit confused about this process, just think that we picked a single large enough number and divided all metrics whose unit is CPU second per second by this large number. Because the number is picked large enough, every normalized number should be smaller than 1.0.

The reason why we choose this normalization method is that users can do normal arithmetic on the normalized numbers: adding usages, subtracting, dividing, multiplication, etc. That's because everything is normalized to the single number. Most of the time, you can treat the normalized numbers as if they are raw measures and apply whatever operation you would apply on raw measures.

To answer some of your questions:

> how much CPU they are using compared to the requested amount.

It's available in task event table in the requested resources field. I believe you've already figured it out by yourself.

> How can I normalize the CPU usage since I don't know the maximum capacity?

There are many ways to "normalize" CPU usage. I assume you are talking about normalizing a task's usage to its limit (i.e. requested amount). What you can simply divide the usage (available in task usage table) by the limit (available in task event table) and you are good. Because both usage and limit are normalized by the same number (max machine capacity), you can safely divide the usage by limit and the normalization factor would be canceled out.

> I thought about comparing it with the requested resources, i.e., the task event table. However, there are reported only the partial sizes over the most powerful resource in any trace machines. Did I get something wrong?

I did not quite get your question here. But I hope my answers to other questions would give you enough information.

Nan Deng

unread,

Feb 2, 2021, 2:12:00 PM2/2/21

to Google cluster data - discussions

To be more architecturally correct: whenever I say "cores", it actually means Hyperthreads on Intel architectures.

In case you want to know more about cgroup, which is the underlying technology we used to isolate our tasks, here is a document from IBM I found through a quick Google search: https://www.ibm.com/support/knowledgecenter/SSZUMP_7.3.0/management_sym/cgroup_subsystems.html

There're also plenty of documents related to cgroups or Linux containers. Feel free to take advantage of the resources if you are curious about the technology.

Andrea Morichetta

unread,

Feb 21, 2021, 12:36:46 PM2/21/21

to Google cluster data - discussions

Dear Nan,

First of all, thank you very much for your precise information.

My problem regarding the normalization was not understanding that the machine capacity and the requested amount are using the CPU seconds/seconds notation.

In my analysis, I tried to compute aggregated information for scheduling classes, e.g., taking the max CPU usage value for that scheduling class every five minutes. However, I was wondering how to obtain a rate of CPU usage given the total requested. Your clarification helped me a lot in this direction.

As in the photo attached to this message, I don't understand why the CPU rate (that should coincide with CPU usage) - column 6 - is larger than 1 for many tasks. In this example, I considered the first 50 task usage CSV files and filtered the rows by having CPU rate values greater than 1.

Is CPU rate meaning something different? How could this behavior be explained if considering the normalization of the CPU usage with the maximum capacity?

Screen Shot 2021-02-21 at 6.32.35 PM.png

Nan Deng

unread,

Feb 24, 2021, 1:01:30 AM2/24/21

to Google cluster data - discussions

A disclaimer: I did not participate in the work of releasing the 2011 trace (I haven't joined Google back then). I did, however, work on preparing and publishing the 2019 trace. I do not know too much of details about the 2011 trace. So some of the descriptions below are my speculation.

It is indeed abnormal to see cpu rate or maximum CPU rate columns having values >1.0. To deal with such abnormal data, the first question I would ask is: How often does it happen? Through a quick query on BigQuery (Yes, we also imported the 2011 trace into BigQuery, which is a huge productivity boost), I found that in the task_usage table, there are 583 rows out of 1232799308 with cpu_rate > 0. It should be considered as really rare (<10^-7) and it's likely to be some noise introduced through data collection.

Just looking at the samples you provided, I noticed that duration of each sample (end_time - start_time) is normally less than 5 min. This means there is actually very little data collected by the node agent in Borg, which we call it Borglet. If I add this condition to my query to see how many rows have cpu_rate > 1.0 AND (end_time - start_time) >= 300000000, there is actually only 4 rows satisfies this condition.

If we change the filter to see how many rows have maximum_cpu_rate > 1.0, the numbers are larger: 800845 out of 1232799308 samples. If we also add the condition to only inspect samples with duration >= 5min, the number is down to 765805. It is still <0.06%, which should be considered as rare.

Given the frequency of such thing would happen, it could be caused by lots of things, like normal noise introduced through the data collection procedure, the node agent (which we call it Borglet) being unresponsive, kernel bug, bad hardware, etc.

Regarding to the noise introduced by data collection, here is a possible scenario that could make this thing happen:

Remember the counter maintained by the kernel we used to collect CPU usage data, to quote from my previous email:

> We use cgroup to isolate and monitor tasks. Each task consists of a set of processes. For each cpu cgroup, the Linux kernel maintains a counter that keeps increasing. Whenever a process running in a cgroup (i.e. a task) has used CPUs, the OS would increase the counter of the cgroup by the amount of time that the task has been running on the CPUs. So underlying Borg, the CPU usage of a task is maintained by the Linux kernel and it is a monotonically increasing integer. Every second, Borg checks the counter for every task and take the difference between the two consecutive measurements. The difference gives us the amount of time that a task has been running on CPUs between the two measurements (in Borg, it's around 1 second).

Just imagine how would you implement such data collection procedure. In an ideal world where you want to collect truly accurate data, what you need to do is to pause the whole system, meaning no one except the data collection program can use the CPU. Then you read the counters of each task and take the difference of the two consecutive reads of each counter, divide it by the duration between two consecutive pauses and you get the CPU rate. However, in a real system, you can never pause a process every second simply for the sake of a slightly higher quality of data. What we are actually doing is to read the counters sequentially (with some level of parallelism of course) while they are running and reading the current time. This means, that you would never know when you actually read the counter because at the moment you read time, the value of the counter could be changed. This means when you take the rate by dividing the duration, it is never accurate as the duration is the the duration between two events of reading the kernel's CPU usage counter; rather, it's just the duration between two events of reading time. A psudocode of this logic could be as follow:

last_read_time = now()

# There are N tasks running on the machine

last_cpu_usage_counter_value = [0] * N

cpu_rate = [0] * N

while True:

# Sleep for one second

sleep(1)

# Timestamp in seconds

current_time = now()

# Imagine what would happen if the program is stalled here.

# Other tasks are still running, hence their cpu usage

# counter would keep increasing.

duration = current_time - last_read_time

last_read_time = current_time

for each task i:

current_cpu_usage_counter = read_task_cpu_usage(i)

cpu_rate[i] = (current_cpu_usage_counter - last_cpu_usage_counter_value[i]) / duration

last_cpu_usage_counter_value[i] = current_cpu_usage_counter

For simplicity, I just assume there are always N tasks running on the machine and we can keep a fixed-length array to store properties for each task. Because tasks are still running after calculating duration in each iteration of the while loop, it is possible that cpu_rate can be larger than it should be. One may argue that we can solve this by either smartly put the statement of calling the now() function; and/or individually maintain last_read_time for each task. But as long as there's time elapsing between calling now() function and reading the task's cpu usage counter, there's always possibility that the calculated duration could be smaller than the actual duration between two read events to the counter. In such cases, the calculated CPU rate is enlarged. In most real scenarios, such noise could be negligible. But in rare cases, for example, the data collection agent was stalled after taking the current time but before reading the counters, then the noise could very large. In some cases, it may make the cpu rate higher than the machine's underlying capacity.

The reason that maximum cpu rate has more abnormal rows than cpu_rate, it's simply because cpu_rate is average across 5 min while maximum cpu rate is the maximum of the per-second cpu rate collected within 5 min.

I hope the explanation is helpful. In short: It's rare enough that can be easily considered as noise.

Andrea Morichetta

unread,

Feb 24, 2021, 3:37:43 AM2/24/21

to Google cluster data - discussions

Dear Nan,

Thank you for your thorough explanation. I wondered whether I should have considered them as noise or not, and I wanted to be sure to have correctly grasped how they were collected. Now, with your description, everything is much clearer, thanks! I will then consider switching to BigQuery for further analyses :)