With two similar GPU's it is best to run using the 'roundrobin' backend and in that case, it is best to run with 3 or 4 threads. Four threads will give you more nodes per second but due to MCTS, you may wind up wasting computing cycles, so three may be the best option.
If you have more than two similar GPU's you should run the 'demux' backend and use only two threads, there would not be any need for more than two threads.
In cases where your GPU's are dissimilar, it is best to run the multiplexing backend because both for the 'roundrobin' and 'demux' backend all GPU's have to wait till the weakest GPU is done.