Hello Eddy,
Here is a more detailed answer. First a little general background.
In a transient circuit simulation, there are two sources of computational expense. One is the device evaluations, and the other is the linear solve. The device evaluations produce all the values that are put into the linear system prior to that solve.
Device evaluations tend to dominate the runtime cost in small circuits. As the problem size gets bigger, however, the two expenses scale differently. Device evaluations scale linearly, while the linear solve cost scales superlinearly. So, eventually, you will reach problem sizes where the linear solve is more expensive than device evaluations.
Also, there are many different methods for solving the linear system. Broadly speaking, there are two types of linear solver. Direct solvers and iterative solvers. Direct solvers are more robust, and also more "idiot proof", and for smaller problems are much faster. Iterative solvers are slower and much more tempermental, sensitive to solver options etc. However, iterative solvers have the benefit of being easier to parallelize, and they can scale much better than direct solvers. So, for small problems, direct solvers will nearly always be a better choice than iterative, even if the direct solver is serial and the iterative solver is parallel.
In Xyce, the parallelism of the device evaluation is completely separate from the parallelism of the linear solver. There is a communication layer in-between the two operations. If you are running a parallel build of Xyce, the device evaluation is always done in parallel. It doesn't require much communication, and the distribution strategy can be simple and still scale. However, the linear solver can be (and often is) done in serial. So, for smaller problems, it is usually best to use what we call the "parallel load, serial solve" strategy, where the linear solve is performed on proc 0 using KLU (a serial direct solver).
Using iterative solvers in parallel really only becomes a "win" if the problem size is pretty big. And, the problem size that you are running (20-40K unknowns) is not large enough for the iterative solvers to be a better choice. You'll want to run with the direct solver. If you aren't already, you should be setting the linear solver to be KLU using ".options linsol type=klu".
One drawback of this approach is that it limits your parallel speedup, based on Amdahl's law:
https://en.wikipedia.org/wiki/Amdahl%27s_law. If you only parallelize 1/2 the problem, there is a limit to how much faster you can possibly get.
In the paper you referenced, all the circuits in that paper are much larger. Most of them have over 1M unknowns. 1M is large enough that a direct solver chokes and dies, so a parallel iterative solver is the only feasible choice.
Regarding AWS, one of my colleagues told me that sometimes on AWS latency can be an issue. Also it is of course very important to use AWS with a contiguous block of processors. (ie, don't have them on lots of separate servers, or anything like that). Xyce could do a better job of minimizing its communication volume, which could result in latency being a more significant issue. We're working on improving that.
One other comment I'll make is that we have a parallel direct solver, called BASKER that is under development. It has not been included in our formal code releases yet, but it is possible to build development branch Xyce to use it. BASKER is a direct solver, but as it is threaded it is usually a little faster than KLU.
thanks,
Eric