We kindly request assistance in verifying our understanding of the dataset for our project. We aim to use machine learning to predict the actual CPU and memory utilization of a job based on its request information. To achieve this, we are looking to extract relevant variables from the dataset.
- Gathering comprehensive information about the requested job.
- Obtaining details about the actual job usage.
To accomplish this, we have developed the following query, joining instance usage and event:
SELECT *
FROM (
SELECT
...
ROW_NUMBER() OVER (PARTITION BY start_time, u.machine_id, u.collection_id, u.instance_index ORDER BY e.time DESC) AS rn
FROM `google.com:google-cluster-data.clusterdata_2019_e.instance_usage` AS u
LEFT JOIN `google.com:google-cluster-data.clusterdata_2019_e.instance_events` AS e
ON (
u.collection_id = e.collection_id AND u.machine_id = e.machine_id
AND u.instance_index = e.instance_index
AND u.alloc_collection_id = e.alloc_collection_id
AND u.alloc_instance_index = e.alloc_instance_index
AND e.time < u.start_time
)
WHERE (u.alloc_collection_id IS NULL OR u.alloc_collection_id = 0)
AND (end_time - start_time) >= (5 * 60 * 1e6)
AND u.machine_id = 30211749330
)
WHERE rn = 1
This query accomplishes the following:
- Filters jobs with no parent, focusing only on jobs that run for a duration of at least 5 minutes (where alloc_collection_id is 0 or null) [1][2].
- Joins the instance_events table using collection_id (and alloc) and instance_index (and alloc) to obtain additional information about job request constraints. We assume that the combination of collection ID and index is unique for each machine [3].
- Uses the ROW_NUMBER() function to partition and order the results, selecting only the row with the most recent instance event update. This ensures we have the most recent resource request information when the parameters change (indicated by rn = 1).
- Filters the results for a specific machine_id to prevent the table from becoming excessively large.
We would greatly appreciate it if someone could review our query to provide peace of mind :D.