I'd still highly recommend using perf-test to reproduce the workload, so that you can investigate it in another environment and play with different scenarios.
When you say ~6k "rps", do you mean 6k messages published (coming into RabbitMQ)? If so, given what you described, you actually have 6k*4~=24k msg/s
or more based on the topics published/bindings. If you expect to be able to publish 40k/s that will be multiplied by 4 or more, then most likely quorum queues can't handle this currently unless you have a cluster large enough that this would be spread better. Quorum queues on a given node share a write-ahead log which generally can handle tens of thousands of messages per second. Using streams could be an option though. With streams you could just deliver the messages to 1 stream and 4 consumers could consume them (no need for the fanout).
I'm not sure if you are adding publishers because you have to or because you think it will speed things up. If the latter, it's most likely counter productive - more publishers mean:
* more Erlang processes running on your CPUs (each connection requires some processes and resources in general)
* more work for the queue to send confirmations
Run `rabbitmq-diagnostics observer` and take some screenshots - this will give us an idea of what's going on.
Best,