Hello,
I am working on a plan to replace our Azure IoT Hub with RabbitMQ cluster, which we would host ourselves in Azure.
I have done some usability proof of concept, and also a benchmark with anticipated traffic.
First some description of what we have:
For evaluation purposes, I set up the cluster myself as 3 nodes on VMs with 4 CPUs, 16GB RAM each, and premium SSDs for mnesia DB. I considered all points in the production checklist like socket limits, tcp buffer sizes, ...
It can handle the traffic I expect with some margin.
For the reference, our expected setup:
- 50k IoT devices, each having own queue and user credentials
- each device sends 4 msgs per minute, 2-8kB payloads.
- a couple of queues for services handling these messages (under 100), using mostly topic exchanges for message routing.
- occasional message from service to device - for this I configured a direct exchange with each IoT device queue bound to it (= this exchange has 50k bindings)
- if we add more devices, we would add more nodes, so that queues and TCP connections are distributed (each client prefers connection to the node with its queue)
Finally, the question
disclaimer: I don't know that much about kubernetes
My architect does not want us to run RabbitMQ directly on VMs, and I don't blame him as somebody would have to maintain this (probably me anyway). What he wants is to run this on kubernetes cluster, so that we can migrate elsewhere by a click of a button. I am not against this idea, but I am worried about scalability of this approach.
When going through the documentation, I stumbled upon a sentence that the CPU should not be shared with other services for heavy load scenarios, and I brought this up with the architect, so he replied that we can create a separate k8s cluster just for RabbitMQ.
At this point I started to think - does that make even sense? Are there some performance considerations when RabbitMQ runs in k8s? What I am worried about is that with such amount of clients (50k now, 100k in couple of years) we would have connectivity problems, when each connection has to go through ingress. With my VM setup that is not the case.
So I am worried that when running it in kubernetes, we lose too much control, or at least for me, I stop understanding what exactly is happening, and that there might be some overhead, causing this to not scale well.
TL;DR
We want to migrate our IoT Hub of 50-100k devices (ca. 600 million messages per day) to RabbitMQ and tests show we could. Architect is strongly against installing RabbitMQ on VMs but wants it to run on Kubernetes. Does that make sense, considering the traffic and number of concurrent connections (100k) we need?
With regards,
Marek