Joshua:
Thank you for your interest in Xyce and your example circuit. Your
question doesn't really have a quick answer, as the problem you
present has many layers.
The first thing to note is that the time it takes to run your linear
circuit under AC analysis will be dominated by the linear solver. As
such, there is not very much you can do to speed up this relatively
small problem by throwing more processors at it.
The circuit you posted has only around 14,400 unknowns, which is not
particularly large. For a problem this size (per section 10.3.1 of
the Xyce 6.12 Users Guide), the recommended parallelization choice is
"parallel load, serial solve", which loads of device equations are
spread across processors, but the actual linear solve is done by a
serial, direct linear solver. For nonlinear problems in transient or
DC this can often lead to a substantial reduction in runtime for
circuits with between 1000 and 100000 unknowns. But in AC analysis
with only linear devices present, the device loads are only done once
(because the matrix doesn't change), and so the device load phase is
an insignificant part of the run time that will not much benefit from
parallelizing. The time consuming part, the linear solve, is still
getting done in serial.
However, when run in parallel, Xyce only defaults to "parallel load,
serial solve" for problems from 1 up to 9999 unknowns. When given a
problem as large as yours (14,404) in parallel, Xyce will default to
using its iterative linear solvers, which generally require a good
preconditioner to perform well (and choosing the right preconditioner
is not trivial -- see section 10.4.6 of the Users Guide for some
documentation).
So not only are you not benefiting from the parallelization of device
loads, you're paying for the poor performance of iterative solvers
without a good preconditioner. There are options to force problem of
this size to use "parallel load, serial solve", but as noted above,
even that isn't going to help you here because your device loads are
not the bottleneck, the linear solve is.
For a circuit this size, parallelism probably just slows it down, and
so you are better off using only a serial build. Then the problem
becomes figuring out why this problem is taking so long even in
serial, when other simulators seem to have no trouble.
Xyce's default serial direct solver is KLU, which is often a good
choice. However, we find that for your problem it is the slower of
our direct solvers for the AC analysis phase. Therefore, for this
specific netlist we find that using KLU for the DC operating point and
Kundert's sparse solver *("ksparse") for AC analysis is a better
choice. This can be selected by adding the line ".options linsol-ac
type=ksparse" to your netlist. Ksparse is the linear solver used in
most SPICE-derived tools, and you will find that for this circuit a
serial run of Xyce with Ksparse for the AC analysis will perform
roughly on par with what you see in LTSpice.
Unfortunately, while every analysis type has a way of specifying which
solver to use in that phase, when "linsol-ac" was added to the code
the developer who implemented it forgot to add it to the reference
guide. We will be fixing that in the next release. The syntax of
linsol-ac is the same as for all the other "linsol-" options that are
documented.
But the issues with your circuit don't end there. Your circuit has
14,404 unknowns, but nearly 9,000 of them are voltage nodes with no
DC path to ground. Generally speaking, voltage nodes in SPICE
netlists should always have a DC path to ground, and not satisfying
this condition can lead to ill-behaved simulations. Most simulators,
Xyce included, have some way of avoiding the worst consequences of
this type of connectivity error, but they don't always result in ideal
behavior. The best way to fix this is to make sure to construct a
circuit where every node can trace a path to ground at DC (during
which all capacitors are treated as open circuit).
Xyce has a preprocessing option that can emit a new netlist with
resistors added to all such nodes to force them to have a DC path to
ground. This is done by adding ".preprocess addresistors nodcpath
1G", which will cause Xyce to emit a new netlist with 1GOhm resistors
connected directly between the problematic node and ground. Using
only this option (and running the netlist it emits with a "_xyce.cir"
suffix) and using KLU for the linear solver gives a modest improvement
in run time over the original (though nowhere near as good as doing
both this change and using Ksparse for the AC analysis).
None of this answers your question about what Xyce could do on a still
larger version of your circuit where serial direct solvers are no
longer appropriate. This is a harder question, because parallel,
iterative solvers usually require an appropriate preconditioner to
perform well, and there may be some experimentation required to find
one that helps the AC analysis linear solves. Simply distributing the
work of device loads is unlikely to help you here, either, because
that work is negligible for a linear circuit and your simulation will
still be dominated by the linear solver.
As for your observation that Xyce "uses all threads fully all the
time" we are not sure what you mean. Xyce is not actually a
multithreaded application, although our Linux and OS X binaries do
link to the threaded Intel Math Kernel Library (MKL) that can cause
one of these binaries to use more than one thread, even for a serial
build of Xyce. We have also found that on rare occasions this library
can "run away" and cause Xyce to use too many threads and bog down the
machine. If this is happening, you can restrict MKL's maximum thread
use by setting the environment variable "MKL_NUM_THREADS" to something
small (1 or 2). But this will only happen in Linux and OS X binaries
we provide, which have been linked to that threaded library. It will
not happen in our Windows binary (which is linked only against a
non-threaded MKL) or a binary you create yourself from our source
code.
If you are referring to a parallel build of Xyce apparently using many
cores of your machine fully, this is not technically multithreading,
as multiple Xyce processes are running separately and communicating
with each other through the MPI library (as opposed to threads, which
are less independent). In a case like the netlist you posted, it
would not be surprising for all processes to be spinning their wheels
but not really speeding anything up over a serial run, especially
because the iterative solvers are being used and are not likely
performing well.