Horizontal Pod Autoscaling using Nvidia GPU Metrics

Ashraf Guitouni

unread,

May 1, 2021, 9:36:37 AM5/1/21

to Prometheus Users

Hi all.

I'm trying to implement HPA based on GPU utilization metrics.

My initial approach is to use DCGM Exporter which is a daemonset that runs a pod on every GPU node and exports GPU metrics.

By setting an additional scrape config when installing kube-prometheus-community and a custom rule when installing prometheus-adapter, I'm able to query the prometheus API and get the dcgm_gpu_utilization for each node:

dcgm_gpu_utilization{Hostname="dcgm-exporter-dmrff", UUID="GPU-e26f8adc-c4aa-4a46-b3d3-ff4599da50a3", device="nvidia0", gpu="0", instance="10.28.0.50:9400", job="gpu-metrics", kubernetes_node="gke-test-hpa-gpu-nodes-0f879509-qth8"} 3

dcgm_gpu_utilization{Hostname="dcgm-exporter-rxjfm", UUID="GPU-0446c63e-3843-62fa-56db-423958021f5c", device="nvidia0", gpu="0", instance="10.28.1.27:9400", job="gpu-metrics", kubernetes_node="gke-test-hpa-gpu-nodes-0f879509-8bgb"} 0

What I'd like to ask is this: Is it possible to configure HPA for a deployment based on this metric (even though it's being exported for each node through dcgm-exporter pods and not the pods corresponding to the deployment we want to autoscale)?

Perhaps there's a way to generate a metric like mydeploy_gpu_avg which is equal to avg(dcgm_gpu_utilization) over all nodes that have a replica of the deployment mydeploy? That would make it possible to configure HPA with a custom object that targets this mydeploy_gpu_avg metric of mydeploy.

I hope I'm making sense so far. To my surprise, this is a very rare scenario it seems. Our use-case is autoscaling GPU-based machine learning inference servers, in case it helps to know.

I would really appreciate any advice regarding this. I tried to document my current progress in a Github repo: https://github.com/ashrafgt/k8s-gpu-hpa

Thank you.

sayf.eddi...@gmail.com

unread,

May 1, 2021, 11:44:38 AM5/1/21

to Prometheus Users

Hi,

It depends on how the pods from the same node are sharing the GPU, but I think it is doable if you configure the hpa to spawn new pods and the pods to `request` GPU resources, this will force the GKE cluster autoscaler into creating new nodes to locate the new pods.

Are you using KubeFlow on top of GKE or a homemade platform?

Ashraf Guitouni

unread,

May 1, 2021, 11:48:59 AM5/1/21

to Prometheus Users

Thank you for your reply.

There's no GPU sharing for pods at the moment (this is how it is in general for k8s, except for Nvidia MIGs). The goal is to have HPA increasing/decreasing the replicas for a deployment, which will call on the cluster autoscaler to provision a new node if needed.

We're using plain GKE (and also RKE on-prem).

Matthias Rampke

unread,

May 1, 2021, 5:24:17 PM5/1/21

to Ashraf Guitouni, Prometheus Users

I realized that using the request metrics may not work because they can only be updated once a request is complete. Ideally you'd have a direct "is this pod occupied" 1/0 metric from each model pod, but I don't know if that's possible with the framework.

For the GPU metrics, we need to match the per-node utilization back to the pods running on each node. Fortunately, kube-state-metrics provides the kube_pod_info metric that does just that. Since it always has a value of 1, we can multiply it with any metric without changing the value. Normally PromQL expects a 1:1 correspondence on both sides of the multiplication, but we can do many-to-one with the on, group_left, and group_ right modifiers. With these we can tell Prometheus which labels we expect to match, and which we want to carry from one side to the other. This works best if we get rid of all labels that we are not interested in.

First we need the source metrics, aggregated down accordingly. Structurally, any aggregation will work; choose the one that semantically works best for the given metric. For example, we can use

avg by(node) (gpu_utilization)

and

max by(node, pod, namespace) (kube_pod_info)

Now we need to combine them to create a new metric for GPU utilization by pod and namespace:

avg by(node) (gpu_utilization)

* on(node) group_right(pod, namespace)

max by(node, pod, namespace) (kube_pod_info)

If you record this, or configure it as an expression in the adapter, you should be able to autoscale on it.

/MR

On Sat, May 1, 2021, 22:39 Ashraf Guitouni <a.gui...@instadeep.com> wrote:

I'm glad to discuss this from different angles.

Before further detailing this particular use-case with seldon-core, I want to say that we have a few other use-cases where we may be also interested in horizontal scaling based on usage metrics, so finding a way to implement this (even if it's not the final solution that we reach) has a lot of value.

To answer the first question: "How many requests can one pod handle in parallel?". The answer is, in most cases, just one request. This is because, for these requests, the input isn't a single instance that we'd like to run inference on (this is how it usually for a lot of ML systems), rather, a bulk of instances.

To perform the computation on this bulk, we usually split it into batches, where the size of this batch depends on the model that we're serving, the GPU type, etc ... Then, sequentially (batch after batch), the batch inference is executed. Because we work like this, usually, every request will consume the GPU resource to nearly 100% until the process is done.

Now, to be frank, this way of using a ML model server is strange, because if requests take minutes or hours to process, a workflow (argo, kubeflow, ...) that requests the needed resources on the spot and releases them once done is more fitting. That's how I initially implemented older versions of such systems.

The issue is, among this majority of requests that take hours to process, there are requests that take less than a second, and for those, having a model server like seldon-core (or kfserving, ...) makes sense.

Ideally, we'd have both methods implemented and deployed, forwarding small requests to the always-on seldon-core model server (and not worrying a lot about the need to autoscale, because a few requests having to wait an extra second or two is not a big issue for us) and triggering asynchronous workflows to process big requests, requesting its own resources and using them exclusively.

Because the current load on our system is very low, I decided that to make the best use of expensive cloud Nvidia GPUs, I can use just the synchronous seldon-core model server to handle both types, and if a big request happens to fully consume the server's resources for a long time (longer than 30 secs for example), a new replica would be created to be ready for any potential future requests.

I hope the use-case makes a bit more sense now.

I'll try to look into the default seldon-core metrics, because if they can indicate that the server is has been under load for the last 30 secs or so (one or many requests are being processed), we can use that. Still, I hope you agree that there is merit to figuring out how to autoscale based on hardware resource usage (GPUs, TPUs, etc ...)

Matthias Rampke

unread,

May 1, 2021, 6:22:33 PM5/1/21

to Ashraf Guitouni, Prometheus Users

Looks good!

On Sat, May 1, 2021, 23:43 Ashraf Guitouni <a.gui...@instadeep.com> wrote:

Understood. Thank you for the explanation!

I tried the expression and I got the following error:
Error executing query: multiple matches for labels: grouping labels must ensure unique matches

The output of the first part of the query returns:
{node="gke-test-hpa-gpu-nodes-0f879509-jrx0"} 4
{node="gke-test-hpa-gpu-nodes-0f879509-z7p2"} 0

Which looks like the expected result. Note that I have a single-replica GPU deployment running (hence the 4% usage).

As for the second part, this is the output (a bit long so I placed it in a pastebin): https://pastebin.com/KH8VCGB4

I think this is perhaps because we're not filtering for pods that belong to our targeted deployment?

I modified the query to this:
avg by(node) (dcgm_gpu_utilization)
* on(node) group_right(pod, namespace)
max by(node, pod, namespace) (kube_pod_info{pod=~"cuda-test-.*"})

And it seems to give the right value:
{node="gke-test-hpa-gpu-nodes-0f879509-jrx0"} 4

Please let me know if this is the desired result. Many thanks!

--
Data Engineer - InstaDeep Ltd

Reply all

Reply to author

Forward