I'm glad to discuss this from different angles.Before further detailing this particular use-case with seldon-core, I want to say that we have a few other use-cases where we may be also interested in horizontal scaling based on usage metrics, so finding a way to implement this (even if it's not the final solution that we reach) has a lot of value.
To answer the first question: "How many requests can one pod handle in parallel?". The answer is, in most cases, just one request. This is because, for these requests, the input isn't a single instance that we'd like to run inference on (this is how it usually for a lot of ML systems), rather, a bulk of instances.
To perform the computation on this bulk, we usually split it into batches, where the size of this batch depends on the model that we're serving, the GPU type, etc ... Then, sequentially (batch after batch), the batch inference is executed. Because we work like this, usually, every request will consume the GPU resource to nearly 100% until the process is done.
Now, to be frank, this way of using a ML model server is strange, because if requests take minutes or hours to process, a workflow (argo, kubeflow, ...) that requests the needed resources on the spot and releases them once done is more fitting. That's how I initially implemented older versions of such systems.
The issue is, among this majority of requests that take hours to process, there are requests that take less than a second, and for those, having a model server like seldon-core (or kfserving, ...) makes sense.
Ideally, we'd have both methods implemented and deployed, forwarding small requests to the always-on seldon-core model server (and not worrying a lot about the need to autoscale, because a few requests having to wait an extra second or two is not a big issue for us) and triggering asynchronous workflows to process big requests, requesting its own resources and using them exclusively.
Because the current load on our system is very low, I decided that to make the best use of expensive cloud Nvidia GPUs, I can use just the synchronous seldon-core model server to handle both types, and if a big request happens to fully consume the server's resources for a long time (longer than 30 secs for example), a new replica would be created to be ready for any potential future requests.
I hope the use-case makes a bit more sense now.
I'll try to look into the default seldon-core metrics, because if they can indicate that the server is has been under load for the last 30 secs or so (one or many requests are being processed), we can use that. Still, I hope you agree that there is merit to figuring out how to autoscale based on hardware resource usage (GPUs, TPUs, etc ...)
Understood. Thank you for the explanation!
I tried the expression and I got the following error:
Error executing query: multiple matches for labels: grouping labels must ensure unique matches
The output of the first part of the query returns:
{node="gke-test-hpa-gpu-nodes-0f879509-jrx0"} 4
{node="gke-test-hpa-gpu-nodes-0f879509-z7p2"} 0
Which looks like the expected result. Note that I have a single-replica GPU deployment running (hence the 4% usage).
As for the second part, this is the output (a bit long so I placed it in a pastebin): https://pastebin.com/KH8VCGB4
I think this is perhaps because we're not filtering for pods that belong to our targeted deployment?
I modified the query to this:
avg by(node) (dcgm_gpu_utilization)
* on(node) group_right(pod, namespace)
max by(node, pod, namespace) (kube_pod_info{pod=~"cuda-test-.*"})
And it seems to give the right value:
{node="gke-test-hpa-gpu-nodes-0f879509-jrx0"} 4
Please let me know if this is the desired result. Many thanks!
--Data Engineer - InstaDeep Ltd