Review on our understanding of joining instance usage + event?

67 views
Skip to first unread message

William Nixon

unread,
Jun 3, 2023, 4:22:09 PM6/3/23
to Google cluster data - discussions
Dear community,

We kindly request assistance in verifying our understanding of the dataset for our project. We aim to use machine learning to predict the actual CPU and memory utilization of a job based on its request information. To achieve this, we are looking to extract relevant variables from the dataset.

Our current approach involves:
- Gathering comprehensive information about the requested job.
- Obtaining details about the actual job usage.

To accomplish this, we have developed the following query, joining instance usage and event:

SELECT *
FROM (
  SELECT
    ...
    ROW_NUMBER() OVER (PARTITION BY start_time, u.machine_id, u.collection_id, u.instance_index ORDER BY e.time DESC) AS rn
  FROM `google.com:google-cluster-data.clusterdata_2019_e.instance_usage` AS u
  LEFT JOIN `google.com:google-cluster-data.clusterdata_2019_e.instance_events` AS e
  ON (
    u.collection_id = e.collection_id AND u.machine_id = e.machine_id
    AND u.instance_index = e.instance_index
    AND u.alloc_collection_id = e.alloc_collection_id
    AND u.alloc_instance_index = e.alloc_instance_index
    AND e.time < u.start_time
  )
  WHERE (u.alloc_collection_id IS NULL OR u.alloc_collection_id = 0)
  AND (end_time - start_time) >= (5 * 60 * 1e6)
  AND u.machine_id = 30211749330
)
WHERE rn = 1

This query accomplishes the following:
- Filters jobs with no parent, focusing only on jobs that run for a duration of at least 5 minutes (where alloc_collection_id is 0 or null) [1][2].
- Joins the instance_events table using collection_id (and alloc) and instance_index (and alloc) to obtain additional information about job request constraints. We assume that the combination of collection ID and index is unique for each machine [3].
- Uses the ROW_NUMBER() function to partition and order the results, selecting only the row with the most recent instance event update. This ensures we have the most recent resource request information when the parameters change (indicated by rn = 1).
- Filters the results for a specific machine_id to prevent the table from becoming excessively large.

We would greatly appreciate it if someone could review our query to provide peace of mind :D. 

P.S. if you believe there are other tables with useful attributes that we can leverage, please do tell. :D

Thank you, and have a wonderful day.

Best regards,

William
Bandung Institute of Technology

References:
[1] https://groups.google.com/g/googleclusterdata-discuss/c/6CZYMAqq0NU
[2] https://github.com/google/cluster-data/blob/master/clusterdata_analysis_colab.ipynb
[3] https://groups.google.com/g/googleclusterdata-discuss/c/TzI_qE4uIlU/m/iKZrj8pLCQAJ

William Nixon

unread,
Jun 6, 2023, 2:01:45 PM6/6/23
to Google cluster data - discussions
P.S. If a machine has auto scaling enabled, will constraint / requested resources change be logged somewhere, i.e. instance events?  Can an instance not have any information about it's planned resources / constraint (we won't have enough information to predict its utilization)?
Reply all
Reply to author
Forward
0 new messages