I am working on a machine learning project where we use the Openscoring REST API to convert our models to inference servers. However, we find ourselves now needing to handle multiple models (thousands) to allow for multitenancy in our application which we were able to do by running multiple Openscoring REST API nodes and grouping a certain amount of models in each node. The most complicated problem arises when the requests for a certain model increase beyond the node’s computational capacity and we want to scale the computation for that particular model.
The available choices we can think of are:
Do you happen to have some advice as to how to surpass this problem please?
Thank you!