I saw another threads that sounded related but I guess this one is slightly different.
I observed a weird behavior in couple of production environments which I could not replicate in Qa (not surprisingly). I wanted to check if anyone experienced a similar problem or have an explanation for this.
- Job Queue (mirrored) is shared in between 2 competing consumers (separate machines), is already filled with more than 1000 messages (I know that's not ideal).
- When I start both consumers, each consumer fetches the number provided to the bus while configuring queue, which is prefetch = 8, then 8 messages are being consumed in parallel in each consumer as expected.
At this point I can see Unacked count for Job Queue is 16 in RabbitMQ admin UI, which is perfectly fine.
- Quick info about my subscription handler used in consumers: each message (job request) is handled in a separate .net process for historical reasons. And the code is simply like this:
Process process = JobUtil.GetProcess(jobId);
process.Start();
Thread.Sleep(50); // Yield
process.WaitForExit(); // Wait until it finishes
- Now, the problem is, after a certain amount of time, while these jobs are running, Unacked count in RabbitMQ turns out to be 0, and when I check queue name in RabbitMQ Admin UI, I see no consumers attached to the Job Queue. But, there are still too many messages waiting in Job Queue and I know consumers are busy with processing existing jobs by following the processes running on consumer machines.
- After a job (process) is finished, RabbitMq seems to not to push another message, and Unacked count for Job Queue remains 0. Normally this should always be in sync with my prefetch count as long as I have messages in the queue, right?
As a result, number of processes running in each consumer drops from 8 to 1 gradually, and finally when the consumer is finished with executing very last process, RabbitMQ pushes another 8 messages. Scenario is the same for the second consumer.
This is causing a performance problem, since my consumer always waits for the slowest job to be finished before receiving 8 more jobs.
- I am trying to understand what might be the problem here. I think if this was a heartbeat problem (which should be enabled anyways in 2.10 be default), Consumer would not be able to get 8 messages again after finishing the last job.
- It looks like my consumer sends multiple acknowledgements after finishing the last job, but I cannot see any reason why it does so. Also, I have no explanation why RabbitMQ Job Queue Unacked count in admin UI does not reflect the number of messages running in my consumers. Also, same setup in Qa environment works as charm, Unacked is always aligned with the Prefetch, and consumers keep processing multiple jobs at any time unlike production issues.
Sorry for long explanation, but I tried to give as much information as I could. I am not sure if the problem is related to using a separate process in subscription handler and waiting for it.
Please share your thoughts since this is a critical performance problem right now in production.