You may want to use distributed stress testing tools. Tools running on a single machine are seldom useful, most MQTT brokers can handle more load than a single machine can produce. Also, a single machine with one network interface can only simulate ~64510 client connections per network interface (only in case you don't use a library which spawns a thread per connection). And of course you should make sure to run the tools and brokers on dedicated hardware.
I heard people find mqtt-malaria useful for these kind of tests. For testing your exact load distribution, you must probably roll your own tool, though.
Another caveat: Benchmarks don't always test what one would expect and correct benchmarks are very rare, most benchmarks test artificial conditions you won't face in production systems.
Having that said, I would recommend the following:
* mqtt-malaria
* mqttbench (Wasn't able to produce more than 20k msg/s on my machine, though)
* A simple own implementation based on non-blocking, async libraries (libuv, netty, grizzly, ...)
Hope this helps!
Dominik