Hi Gareth,
Some comments re your questions...
When using multi-tenancy, cockpit is able to swap between tenants, but this is swapping logical engines within a cluster, not physical nodes across the cluster.
Q1 - Keep in mind that there may not be a 1 to 1 affinity between a node and a process instance. Consider a process with two sequential asynchronous continuations. The fist continuation may be executed by the job executor in node two, the next part of the process executed by the job executor in node one. Hence there may not be affinity (You could force affinity with a heterogenious deployment, but then you don't have a cluster...)
Q2 - I would recommend thinking about a different view taxonomy. In your Tomcat cluster, you may have many tomcat nodes, however conceptually you have one stateless engine distributed across the cluster nodes. If you use multi-tenancy, you are creating multiple logical process engines, again distributed across multiple cluster nodes.
Q3 - the default deployment sets history level to all. This kills the DB under load. Change the history level to activity or none and you will see considerable improvement. Then there is DB specific tuning...
In practice, the BPM engine would not typically be your bottleneck. The BPM engine should be orchestrating external services and tasks rather than performing any heavy lifting. Hence the BPM orchestrations are usually much lighter that the task processes and thus the limiting factor will be where the tasks are actually executed...
Performance of a BPM engine is a little different to what you may be familiar with, eg web server. In terms of performance, you really want to be clear on what your performance requirements are. For example are you more concerned with response time to instantiate a process instance or process throughput? For example, if process instantiate time is your main objective, then make the start of the process asynchronous. Hence as soon as the process instance is created, the client thread returns. I can easily load a dual CPU amazon instance with sustained 90tps and a response time in the 100ms range. The consequence of this is I sacrifice process throughput as the processes are executed entirely in the background jobexecutor and this adds DB overhead. Alternately, if I want more throughput, and Im prepared to sacrifice response time, the engine could borrow the client thread for longer and thus perform more processing in the client thread contest with lower DB overhead. An alternate metric is process liveness, but thats a much larger discussion...
The great thing about option 1, is if you have a usage pattern where you get massive load during the day, but a trough at night, then the engine can absorb huge numbers of process instantiations during the day with little impact on user perfromance, buffer them and catchup with processing during the slack times...
And yes - AWS is Amazon Web Services, with RDS I can have a high performance database setup, with synchronous data replication across availability zones and backups, all ready to go in about 15 minutes...