Xyce on AWS questions

40 views
Skip to first unread message

Eddy Wu

unread,
Apr 5, 2023, 12:54:57 PMApr 5
to xyce-users
Hi,
We've gotten Xyce parallel to run on AWS clusters using EC2 and EFA (Elastic Fabric adaptor). In our initial tests, we don't see an appreciable difference in circuits with 20-40K unknowns as we scale up our compute cluster beyond 16 cores. As we go to 64 or even 128 cores (2 cores per node), we do not see the same performance improvements shown in the paper linked below. In all the circuits we ran experiments on, the performance increase per node tapers off around 16-32 nodes.

Question 1: Are there guidelines for setting up and scaling parallel simulations? We've looked into some reports where we expect a square root performance of N increase, but we cannot achieve this. https://www.osti.gov/servlets/purl/1595909

Question 2: Has anyone else got Xyce up and running on AWS?

We are working with an AWS team to see if we have correctly set up the clusters.   We are looking into whether or not we have configured EFA properly. There may be an issue in MPI and EFA interaction. But others run MPI loads on EFA as well.

Once sorted, we expect to share the AMI (Amazon Machine Image) and scripts we developed with the Xyce community.   Once we move past 16 cores, there is no cost advantage to using AWS as the costs increase linearly. We had hoped to light up hundreds of cores.

xyce-users

unread,
Apr 5, 2023, 1:33:21 PMApr 5
to xyce-users

The main documented guidance for running in parallel is in chapter 10 of the user guide, which you can find here:  https://xyce.sandia.gov/files/xyce/Xyce_Users_Guide_7.6.pdf

To answer your question about AWS, the quick answer is that yes, other people have done this.  We (at Sandia) have not, however, as we have a lot of large machines here.

I'm in the middle of a meeting at the moment but I'll provide a more detailed answer later today.

thanks,
Eric

Cristiano Calligaro

unread,
Apr 6, 2023, 4:55:13 AMApr 6
to xyce-users
Hi Eddy,
I am using parallel tran simulations on SRAMs (Single and Dual Port) on a single linux server. I used 1, 2, 4, 8 and 16 cores... give a look to attached graphs... in my case using 8 or 16 cores is more or less the same... maybe with 100k or 200k transistors I can see a difference between 8 and 16 cores....
Hope this can help.
Ciao
Cristiano
Xyce_parallel_performance.png

xyce-users

unread,
Apr 6, 2023, 12:48:10 PMApr 6
to xyce-users

Hello Eddy,

Here is a more detailed answer.    First a little general background.

In a transient circuit simulation, there are two sources of computational expense.  One is the device evaluations, and the other is the linear solve.  The device evaluations produce all the values that are put into the linear system prior to that solve.

Device evaluations tend to dominate the runtime cost in small circuits.  As the problem size gets bigger, however, the two expenses scale differently.  Device evaluations scale linearly, while the linear solve cost scales superlinearly.  So, eventually, you will reach problem sizes where the linear solve is more expensive than device evaluations.

Also, there are many different methods for solving the linear system. Broadly speaking, there are two types of linear solver.  Direct solvers and iterative solvers.    Direct solvers are more robust, and also more "idiot proof", and for smaller problems are much faster.  Iterative solvers are slower and much more tempermental, sensitive to solver options etc.   However, iterative solvers have the benefit of being easier to parallelize, and they can scale much better than direct solvers.   So, for small problems, direct solvers will nearly always be a better choice than iterative, even if the direct solver is serial and the iterative solver is parallel.

In Xyce, the parallelism of the device evaluation is completely separate from the parallelism of the linear solver.  There is a communication layer in-between the two operations.  If you are running a parallel build of Xyce, the device evaluation is always done in parallel.  It doesn't require much communication, and the distribution strategy can be simple and still scale.  However, the linear solver can  be (and often is) done in serial.    So, for smaller problems, it is usually best to use what we call the "parallel load, serial solve" strategy, where the linear solve is performed on proc 0 using KLU (a serial direct solver).

Using iterative solvers in parallel really only becomes a "win" if the problem size is pretty big.    And, the problem size that you are running (20-40K unknowns) is not large enough for the iterative solvers to be a better choice.  You'll want to run with the direct solver.  If you aren't already, you should be setting the linear solver to be KLU using ".options linsol type=klu".

One drawback of this approach is that it limits your parallel speedup, based on Amdahl's law:  https://en.wikipedia.org/wiki/Amdahl%27s_law.  If you only parallelize 1/2 the problem, there is a limit to how much faster you can possibly get.

In the paper you referenced, all the circuits in that paper are much larger.  Most of them have over 1M unknowns.   1M is large enough that a direct solver chokes and dies, so a parallel iterative solver is the only feasible choice.

Regarding AWS, one of my colleagues told me that sometimes on AWS latency can be an issue.  Also it is of course very important to use AWS with a contiguous block of processors. (ie, don't have them on lots of separate servers, or anything like that).    Xyce could do a better job of minimizing its communication volume, which could result in latency being a more significant issue.  We're working on improving that.

One other comment I'll make is that we have a parallel direct solver, called BASKER that is under development.   It has not been included in our formal code releases yet, but it is possible to build development branch Xyce to use it.  BASKER is a direct solver, but as it is threaded it is usually a little faster than KLU. 

thanks,
Eric

Kevin Cameron

unread,
Apr 10, 2023, 12:03:48 PMApr 10
to Eddy Wu, xyce-users
Parallel processing with simulation is tricky, it depends a lot on processor-to-processor latency, and MPI isn't particularly good for low-latency.

I'm looking at building a machine which will have low-latency for some AI applications, let me know if you are interested in running Xyce on it.

Kev.

--
You received this message because you are subscribed to the Google Groups "xyce-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xyce-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xyce-users/5b1c7c8c-927a-4aed-9a8a-b3bcb0a98ed9n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages