Should Alloc Set and Job Resource Usage be Calculated Together? Why is Disk and Network Resource Usage Missing?

80 views
Skip to first unread message

Ifan Luan

unread,
Apr 10, 2023, 11:05:34 AM4/10/23
to Google cluster data - discussions
Recently, I encountered some issues while studying the Google Cluster Trace v3 dataset. Firstly, the collection in v3 includes job and alloc set, which confused me on whether to compute them together while analyzing resource usage. I found that if the alloc set is included in the calculation, the cluster CPU and memory resource utilization rates were much higher than those reported in "Borg: The Next Generation", whereas only considering job resource usage produced similar results. An example of this calculation in cell e is shown below.
图片1.png

Additionally, I noticed that v3 appears to lack information on disk and network resource usage compared to v2, and I haven't found any articles that clearly explain the reason behind this. I am curious about the rationale behind this approach.

I would greatly appreciate any help or insights you can offer. If you have any thoughts or suggestions, please do not hesitate to share them.

Thank you!

Ifan Luan

unread,
Apr 10, 2023, 11:10:53 AM4/10/23
to Google cluster data - discussions
I have actually noticed similar issues being raised before, where in v2 the job and alloc set were considered together due to being mixed together. However, I have found that in some of the resource usage analysis papers' source code for v3, they only consider the job's resource usage, which leads me to believe they may have had similar concerns.

Nan Deng

unread,
Apr 28, 2023, 7:10:17 PM4/28/23
to Google cluster data - discussions
About alloc set and job: Some jobs do not belong to any alloc set, then you need to count them separately; for alloc set, their usage would include its children's usage.

What you need to do is to include all jobs who have no parent, i.e. alloc set plus alloc set. They are what we called top-level entities. Take a look at what we did here: https://github.com/google/cluster-data/blob/master/clusterdata_analysis_colab.ipynb.

About network and disk: Google has gone diskless, see https://sre.google/resources/practices-and-processes/infrastructure-change-management/, case study 2. This move means disk and other storage devices are mostly managed by D, which explained in the linked doc.

For network, yeah, currently, the data doesn't include network.

Ifan Luan

unread,
May 13, 2023, 1:57:41 AM5/13/23
to Google cluster data - discussions
Thank you very much for your answer. As the analysis progresses, I have some other questions:
1. I found that the resource usage of allocset is different from the resource usage of the instances it contains. May I ask if this difference is caused by maintaining the resource allocation of allocset, monitoring error, or other factors?
2. I found that the number of submit, schedule, and terminate events (EVICT, FAIL, FINISH, KILL, LOST) for many retried instances is not the same, or there are cases where scheduling is less than submission and termination. I also observed the ...->SCHEDULE->UPDATE_RUNNING->SUBMIT situation. This makes me confused about the retry mechanism of the instance. It is difficult to obtain the scheduling time and execution time of the instance.
In summary, the two questions are:
1) What are the possible reasons for the difference in resource usage between allocset and its contained instances?
2) What could be the causes of the inconsistent numbers of submit, schedule and terminate events for retried instances? And how to properly interpret such event sequences to understand instance retries?
3) Is it feasible to calculate the scheduling time and execution time of instances?
Reply all
Reply to author
Forward
0 new messages