General guidelines for long transient simulations

276 views
Skip to first unread message

Eddy Wu

unread,
Jan 5, 2024, 8:31:37 PMJan 5
to xyce-users
Hello,
I have a cmos circuit that has a clock frequency of 25ns and I need to simulate up to 32ms.
The circuit has ~17K unknowns. I've roughly followed the optimizations in the Xyce reference manual and in this forum.
- no floating nodes
- reduced timeint tolerances
- linsol=klu
- integration method = trap

The majority of the simulation time is in the transient analysis. I am using an Intel Xeon E7-8880 V4, and 32 slots in Xyce parallel (tried from 2-32 slots and serial build). 

What I'm seeing is the best case scenario I am only able to achieve 12us transient sim per hour of simulation time. 

Are there any other options I can try to improve simulation performance? 

Thanks,
Eddy

Marcel Hendrix

unread,
Jan 7, 2024, 7:04:17 AMJan 7
to xyce-users
With ~17k nodes/eqs, the time needed to output to disk could be very high, unless there is a .SAVE statement.

The other thing is that with your timing parameters, assuming 10 intermediary steps and 3 internal nodes
per unknown, no .SAVE, I calculate 52e12 bytes of needed diskspace (52,000 TB). Doing something useful 
with that data will be enormously time consuming.

I would personally try to segment the simulation both in time and space: test functional blocks separately, 
then replace as many as possible of them by behavioral / vastly simplified models. Given that a full 
simulation takes 125 days, and you will need multiple revisions, that development 'overhead' might be
a good idea anyway.

-marcel

xyce-users

unread,
Jan 8, 2024, 6:30:08 PMJan 8
to xyce-users

Hello Eddy,

I can't tell from your post if you have tried the "parallel load, serial solve" option with Xyce.   If you haven't, I recommend you give that a shot.  In any circuit simulation the device evaluation phase and the linear solve phase are the 2 most computationally demanding aspects of the simulation.     The linear solve is difficult to scale well in parallel, so it often is best to keep that aspect of the calculation serial (as you are doing with klu).   However, the device evaluation phase doesn't involve very much communication, so it scales almost trivially in parallel.    So, like I said, if you haven't tried this, give it a shot.

Also, to follow up on Marcel's comment.  It is true that Xyce output can be time consuming, especially if you have a really large number of time steps.   Unlike most circuit simulators, Xyce by default outputs the .PRINT line signals at every single time step.    For really long transients, this can be expensive, and also can lead to really large output files.  

There is a way around this however.  In Xyce you can use the command ".options output initial_interval=##", where "##" is whatever output step you want to use.  Xyce will then just output interpolated values at times corresponding to that interval.  If you put multiple "options output" commands in the file, you can have it use different intervals at different phases of the calculation.   Even if this doesn't speed up your calculation very much, it will give  you more manageable output files.

thanks,
Eric

weiji

unread,
Jan 8, 2024, 10:29:43 PMJan 8
to xyce-users

If you compile the software yourself, you can try using the MUMPS solver to solve linear equations in parallel. From our experience, MUMPS has pretty good solving efficiency when running on a single computer node. However, by default, Xyce does not enable MUMPS support. First, you need to have a Trillinos library that supports MUMPS, then recompile Xyce and manually enable MUMPS support. This can be a challenging job. Good luck to you!

Eddy Wu

unread,
Jan 9, 2024, 3:10:50 PMJan 9
to weiji, xyce-users
hi all,

Regarding Marcel's comments, I am only saving a few nodes so I am not going to be memory limited. I'd ideally like to run my entire block in simulation at least once without reductions. 

Eric, I am doing parallel load, serial solve. 99% of the time is spent in the "Beignning Tranisent Calculation ... " phase.
- I believe I am doing parallel load, serial solve. I am doing an mpirun of xyce with .options linsol type=klu set. 
- I've tried the .options output initial_interval; however, that will either truncate my signal, or generate more data than leaving it unset. I lose some accuracy on the clocked edges, but I can probably manage. I would typically set this options to <0.5 max transient step. I'm managing file size by only printing out a few nodes. 
- QUESTION 1) when you say you can have multiple "options output", can those intervals overlap or is the behavior as described on page 157 of the user guide? 
- QUESTION 2) is there an options output where you can set a high number and the simulator decides whether it should save value or toss value based on whether the value has significantly changed or not. For signals that only have a little bit of activity over a large transient, that would be helpful. 

Weiji,
I am interested in trying out the MUMPS solver. Do you have any comparisons between MUMPS and other solvers. How many unknowns was your problem size? 


--
You received this message because you are subscribed to a topic in the Google Groups "xyce-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/xyce-users/CPTQCrVyp8k/unsubscribe.
To unsubscribe from this group and all its topics, send an email to xyce-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xyce-users/1980b519-c62d-4e39-8d8a-2e79ada2fb7dn%40googlegroups.com.


--


#208 - 636 West Broadway, Vancouver BC, V5Z 1G2

This message contains information, which may be confidential and/or privileged and/or subject to non-disclosure and confidentiality agreements.
Unless you are the addressee (or authorized to receive emails for the addressee), you may not use, copy or disclose the communication or any information contained in the message. If you have received the message in error, please advise the sender by reply email, and delete the message.

Mehmet Cirit

unread,
Jan 9, 2024, 5:34:13 PMJan 9
to Eddy Wu, weiji, xyce-users
One more idea. I suppose this is a digital design built out of std cells. If it is schematic based, that is, without the full stack of instance parameters, I have seen long simulation times as well. If that is what you are using, I suggest that you use
extracted netlists but no parasitic resistors in them. That will not change the size of the simulation, but if the defaults of instance parameters are causing a problem, this should get around it. It is counterintuitive, but it happens.

You received this message because you are subscribed to the Google Groups "xyce-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to xyce-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xyce-users/CAJRtbeagWYaCFXuS8Z8BxBPR%3D6Jpr5Qft-y9XnBwbyaANb4wWA%40mail.gmail.com.


--

Dr. Mehmet A. Cirit                    Phone:  (408) 647-6025
Library Technologies, Inc.        Cell:       (408) 647-6025
19959 Lanark Lane                   http://www.libtech.com
Saratoga, CA 95070                 Email: m...@libtech.com

xyce-users

unread,
Jan 10, 2024, 3:35:46 PMJan 10
to xyce-users
Hello Eddy,

Thanks for the information.   It does sound like you are doing the "parallel load serial solve" option with type=klu.

Regarding the "multiple options" for output intervals, they are not supposed to overlap.  Basically you can set different output intervals for different windows of time.  So, if you know that certain windows don't have much happening, you could set a course output and for others which have more activity set a finer output.  And, yes, I'm just referring to the options described on page 157 of the guide.    I mis-spoke a bit about setting up multiple "options output" statements, since you can set multiple intervals on a single line.

BTW, speaking of max time steps, you can also set different max time steps at different time windows as well, using the "schedule" command on the tran line.  It looks something like this:   .tran 0 2.0e-3  {schedule( 0.5e-3, 0, 1.0e-3, 1.0e-6, 1.5e-3, 1.0e-4, 2.0e-3, 0 )}


Regarding solvers like MUMPS, I would be curious to see some timing comparisons as well.  I've been aware of MUMPS for a long time, but the last time we tried it (a long time ago) it didn't perform that well.  But possibly it has improved.

More broadly, there are more solvers available (including MUMPS, but also others) if you are willing to build Trilinios and Xyce yourself.   One example is the Pardiso solver (a threaded direct solver), which has worked well for us on some very large circuits (much larger than 17k).  Another possibility is the BASKER solver which is a Sandia-developed threaded direct solver.  I generally wouldn't expect any of these solvers to perform better than KLU for a problem size this small.  By parallel computing standards, 17k is not very large.  We usually only bother with parallel solvers (of any variety) when the problem size is in the 100s of thousands of unknowns or more.


If the main problem with your calculation is the DCOP phase (and it sounds like it might be), then you could save (via .SAVE) the DCOP result from one calculation, and then use it for subsequent calculations.


Another thing you could do to speed up the simulation would be to use a different step-size selection strategy.   I generally don't recommend using this unless you are desperate, but you can get much larger time steps by setting .OPTIONS TIMEINT ERROPTION=1 .  Setting this turns off the local truncation error based time step size control, and replaces it with a step size strategy based on the success/failure of the nonlinear solver.  You can read about this on page 77 of the guide.  Doing this (ERROPTION=1) will almost certainly result in a less accurate transient calculation.  But, the transient phase of a calculation will usually be much faster as the stepsizes will be larger.


thanks,
Eric

xyce-users

unread,
Jan 10, 2024, 3:53:49 PMJan 10
to xyce-users

Hi again Eddy,

I just realized I didn't answer your second question.  The quick answer is no, Xyce doesn't currently have an option like that.

thanks,
Eric

XYCE

unread,
Jan 10, 2024, 6:25:20 PMJan 10
to xyce-users

 

 

 

 

Hi Eddy,

 

It is hard to give suggestions without knowing some simulation details of your circuit. Is your circuit running slow because Xyce is taking too many steps or it is because each step is very expensive, or Xyce output is taking too much time? You can shorten your simulation and look at the Timing statics of Xyce run at the end of the simulation. It tells you how many accepted/failed steps Xyce takes, and how much time Xyce spends on device evaluation, linear solver, etc..

 

How many time steps for each clock cycle in your simulation? If it is too many, look at the waveform to see if it indicates any stability issues (for example, artificial point-to-point ringing). If so and if you circuit is not autonomous (oscillators), you can try a more stable time integrator gear by using:

.options timeint method=gear

 

Or if you need an even more stable method (BE), use

.options timeint maxord=1

 

These options would normally take care of the stability related performance issue. See Xyce user’s guide 7.3.4 for details.

 

In addition to Eric’s suggestion to turn off local truncation error (LTE) control, you can also try different local truncation error control by setting newlte in the timeint option. Some newlte option is more aggressive and is well suited for digital circuits. See Xyce user’s guide 7.3.5.

 

If Xyce takes reasonable number of step per clock cycle, see if any part, such as device evaluation or linear solver, is taking large part of the run time.

 

You mentioned that most of time is spent at the beginning transient calculation phase, it suggests that Xyce is taking a lot of time steps at the beginning of the simulation (relative to the length of the simulation).

 

If you would like us to take a look at your circuit and simulation, please feel free to send the circuit for us to debug.  Please send it to xyce-de...@sandia.gov. Not all of us get emails from google group.

 

Thanks,

 

Ting

--

You received this message because you are subscribed to the Google Groups "xyce-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to xyce-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xyce-users/104a8d30-3f02-48d0-95a1-af7e76764476n%40googlegroups.com.

Eddy Wu

unread,
Jan 12, 2024, 2:39:27 PMJan 12
to XYCE, xyce-users
Hi all,
Thanks again for all the feedback.

@mehmet: with or without parasitic RC values does impact the simulation, but I didn't notice an appreciable difference. I also do not include any small RC values.
Question 1) what do you mean full stack of instance parameters? Are you referring to the transistor library files?

@Eric: Thanks for the description of transient schedule. I'm looking into trying other direct solvers in the trilinos package.
Question 2)  Are there any guidelines on adding solvers that are not in the example reconfigure file? (ex add -DTrilinos_ENABLE_<solver>=ON or -DAmesos_ENABLE_<solver>=ON)
- with regards to the DCOP, the majority of the time is spent in Transient stepping. A small portion of the time is spent in DCOP. Below is a summary section of a shorter simulation
- I tried the .options timeint erroption=1 method with mixed results. For this particular simulation I didn't notice an appreciable difference.

@ting:
- I think each step is too expensive. 
- Question3) would a cpu with faster cpu clock frequency improve the performance of a simulation type like this (specifically low number of unknowns? )
- I've attached a snapshot of one clock cycle of the simulation. 
- I've found the options timeint method=gear gave slower output in general. To my untrained eye, it doesn't appear to have significant ringing per the snapshot below. 
- I've looked through section 7.3.5 and trying out .options newlte=2 made the simulation slightly slower. Note: I am also setting abstol reltol=1e-2 on the same line.
- Thank you for your offer for help. I'm looking into how I can best send you a file. I am looking to see if I can recreate this issue with a different library, one which we would both have access to. 

***** Problem read in and set up time: 1.23625 seconds
 ***** DCOP time: 17.8228 seconds.  Breakdown follows:
        Number Successful Steps Taken:          1
        Number Failed Steps Attempted:          0
        Number Jacobians Evaluated:             1568
        Number Linear Solves:                   1568
        Number Failed Linear Solves:            0
        Number Residual Evaluations:            1607
        Number Nonlinear Convergence Failures:  0
        Total Residual Load Time:               9.64767 seconds
        Total Jacobian Load Time:               2.62472 seconds
        Total Linear Solution Time:             4.42902 seconds

 ***** Transient Stepping time: 1171.27 seconds.  Breakdown follows:
        Number Successful Steps Taken:          14563
        Number Failed Steps Attempted:          478
        Number Jacobians Evaluated:             86929
        Number Linear Solves:                   86929
        Number Failed Linear Solves:            0
        Number Residual Evaluations:            101970
        Number Nonlinear Convergence Failures:  471
        Total Residual Load Time:               685.049 seconds
        Total Jacobian Load Time:               160.085 seconds
        Total Linear Solution Time:             255.508 seconds


***** Solution Summary *****
        Number Successful Steps Taken:          14564
        Number Failed Steps Attempted:          478
        Number Jacobians Evaluated:             88497
        Number Linear Solves:                   88497
        Number Failed Linear Solves:            0
        Number Residual Evaluations:            103577
        Number Nonlinear Convergence Failures:  471
        Total Residual Load Time:               694.697 seconds
        Total Jacobian Load Time:               162.709 seconds
        Total Linear Solution Time:             259.937 seconds


***** Total Simulation Solvers Run Time: 1189.09 seconds
***** Total Elapsed Run Time:            1190.33 seconds
*****
***** End of Xyce(TM) Simulation
*****

Timing summary of processor 0
                 Stats                   Count        CPU Time              Wall Time
---------------------------------------- ------ --------------------- ---------------------
Xyce                                          1    19:44.645 (100.0%)    19:50.599 (100.0%)
  Analysis                                    1    19:43.433 (99.90%)    19:49.090 (99.87%)
    Transient                                 1    19:43.432 (99.90%)    19:49.090 (99.87%)
      Nonlinear Solve                     15042    18:58.688 (96.12%)    19:02.271 (95.94%)
        Residual                         103577    11:32.958 (58.49%)    11:35.302 (58.40%)
        Jacobian                          88497     2:42.049 (13.68%)     2:42.937 (13.69%)
        Linear Solve                      88497     4:19.536 (21.91%)     4:20.128 (21.85%)
      Successful DCOP Steps                   1        0.041 (<0.01%)        0.191 ( 0.02%)
      Successful Step                     14563       39.852 ( 3.36%)       41.748 ( 3.51%)
      Failed Steps                          478        0.002 (<0.01%)        0.002 (<0.01%)
        Nonlinear Failure                   471        0.000 (<0.01%)        0.000 (<0.01%)
  Netlist Import                              1        0.926 ( 0.08%)        0.962 ( 0.08%)
    Parse Context                             1        0.162 ( 0.01%)        0.165 ( 0.01%)
    Distribute Devices                        1        0.706 ( 0.06%)        0.729 ( 0.06%)
    Verify Devices                            1        0.000 (<0.01%)        0.000 (<0.01%)
    Instantiate                               1        0.021 (<0.01%)        0.026 (<0.01%)
  Late Initialization                         1        0.254 ( 0.02%)        0.261 ( 0.02%)
    Global Indices                            1        0.034 (<0.01%)        0.034 (<0.01%)
  Setup Matrix Structure                      1        0.015 (<0.01%)        0.017 (<0.01%)


Timing summary of 8 pr
                      Time              CPU     Wall Time
                 Statsf System)     Max (% oMax (% of System)

--------------------------------- -----------------------------
Xyce                  46 (12.48%)    19:46.9 19:50.600 (12.50%)
  Analysis            33 (12.47%)    19:45.7 19:49.090 (12.48%)
    Transient         32 (12.47%)    19:45.7 19:49.090 (12.48%)
      Nonlinear Solve 88 (12.00%)    18:59.2 19:02.302 (11.99%)
        Residual      58 ( 7.30%)    11:33.5 11:35.566 ( 7.30%)
        Jacobian      86 ( 1.69%)     2:42.0  2:42.937 ( 1.71%)
        Linear Solve  36 ( 2.73%)     4:21.3  4:21.994 ( 2.75%)
      Successful DCOP 41 (<0.01%)        0.1     0.191 (<0.01%)
      Successful Step 34 ( 0.40%)       39.8    41.748 ( 0.44%)
      Failed Steps    02 (<0.01%)        0.0     0.002 (<0.01%)
        Nonlinear Fail00 (<0.01%)        0.0     0.000 (<0.01%)
  Netlist Import      26 (<0.01%)        0.9     0.962 ( 0.01%)
    Parse Context     62 (<0.01%)        0.1     0.165 (<0.01%)
    Distribute Devices06 (<0.01%)        0.7     0.729 (<0.01%)
    Verify Devices    00 (<0.01%)        0.0     0.000 (<0.01%
    Instantiate       12 (<0.01%)        0.0     0.026 (<0.01%)
  Late Initialization 54 (<0.01%)        0.2     0.267 (<0.01%)
    Global Indices    34 (<0.01%)        0.0     0.040 (<0.01%)
  Setup Matrix Structu15 (<0.01%)        0.0     0.018 (<0.01%)

ringing_num_steps.png

xyce-users

unread,
Jan 12, 2024, 3:06:32 PMJan 12
to xyce-users

Eddy,

To use other linear solvers I recommend compiling Xyce vs. a more recent version of Trilinos.  So, not 12.12.1.    In order to do that you'll need to build Xyce with CMake.  We're transitioning away from the autotools, and the configure-related files are not set up to handle versions of Trilinos more recent than 12.12.1.

There is a file, "INSTALL.md" in the top-level directory of the Xyce source tree that contains CMake instructions.

thanks,
Eric

xyce-users

unread,
Jan 12, 2024, 3:34:37 PMJan 12
to xyce-users
Eddy,

Thanks so much for including the timing summary of the simulation, that gives us the opportunity to see if there are other potential performance issues here. 

For instance, I'm noticing that the residual and Jacobian load time, combined, are 3x more than the linear solver.  If I am following the thread, this is for a circuit that is ~17k devices on 8 processors.  It feels like the residual and Jacobian load time might be improved with a better distribution of devices across processors.  This is covered in Section 11.6 of the Xyce Users Guide, where it discusses a couple alternate approaches that have been useful for some circuits.  In particular, the device balanced approach can be enabled by adding '.options dist strategy=2' to the netlist, which evenly divides up all the device types over the number of processors.  This provides a load balance for devices which improves scaling for residual and Jacobian loads, at the cost of some additional communication.  It's less ideal if you are using iterative solvers, but it looks like direct solvers are better for this size circuit, so you might want to give it a try.

Cheers,
Heidi

Eddy Wu

unread,
Jan 12, 2024, 5:26:06 PMJan 12
to xyce-users
Hello Heidi,

Thanks for the quick response. I ran with .options dist strategy=2 (all other parameters the same) and found the transient step time to be about 5% better. Below is the xyce summary output below:

***** Problem read in and set up time: 1.57195 seconds
 ***** DCOP time: 17.9227 seconds.  Breakdown follows:

        Number Successful Steps Taken:          1
        Number Failed Steps Attempted:          0
        Number Jacobians Evaluated:             1568
        Number Linear Solves:                   1568
        Number Failed Linear Solves:            0
        Number Residual Evaluations:            1607
        Number Nonlinear Convergence Failures:  0
        Total Residual Load Time:               9.62607 seconds
        Total Jacobian Load Time:               2.64314 seconds
        Total Linear Solution Time:             4.58896 seconds

 ***** Transient Stepping time: 1100.66 seconds.  Breakdown follows:

        Number Successful Steps Taken:          14563
        Number Failed Steps Attempted:          478
        Number Jacobians Evaluated:             86929
        Number Linear Solves:                   86929
        Number Failed Linear Solves:            0
        Number Residual Evaluations:            101970
        Number Nonlinear Convergence Failures:  471
        Total Residual Load Time:               643.525 seconds
        Total Jacobian Load Time:               149.585 seconds
        Total Linear Solution Time:             243.285 seconds



***** Solution Summary *****
        Number Successful Steps Taken:          14564
        Number Failed Steps Attempted:          478
        Number Jacobians Evaluated:             88497
        Number Linear Solves:                   88497
        Number Failed Linear Solves:            0
        Number Residual Evaluations:            103577
        Number Nonlinear Convergence Failures:  471
        Total Residual Load Time:               653.151 seconds
        Total Jacobian Load Time:               152.228 seconds
        Total Linear Solution Time:             247.874 seconds


***** Total Simulation Solvers Run Time: 1118.59 seconds
***** Total Elapsed Run Time:            1120.16 seconds

*****
***** End of Xyce(TM) Simulation
*****


Timing summary of processor 0
                 Stats                   Count        CPU Time              Wall Time
---------------------------------------- ------ --------------------- ---------------------
Xyce                                          1    18:35.496 (100.0%)    18:40.413 (100.0%)
  Analysis                                    1    18:33.934 (99.86%)    18:38.586 (99.84%)
    Transient                                 1    18:33.934 (99.86%)    18:38.586 (99.84%)
      Nonlinear Solve                     15042    17:52.286 (96.13%)    17:55.055 (95.95%)
        Residual                         103577    10:51.834 (58.43%)    10:53.686 (58.34%)
        Jacobian                          88497     2:31.782 (13.61%)     2:32.450 (13.61%)
        Linear Solve                      88497     4:07.512 (22.19%)     4:08.008 (22.14%)
      Successful DCOP Steps                   1        0.041 (<0.01%)        0.209 ( 0.02%)
      Successful Step                     14563       37.061 ( 3.32%)       38.754 ( 3.46%)

      Failed Steps                          478        0.002 (<0.01%)        0.002 (<0.01%)
        Nonlinear Failure                   471        0.000 (<0.01%)        0.000 (<0.01%)
  Netlist Import                              1        1.362 ( 0.12%)        1.389 ( 0.12%)
    Parse Context                             1        0.136 ( 0.01%)        0.137 ( 0.01%)
    Distribute Devices                        1        1.169 ( 0.10%)        1.192 ( 0.11%)

    Verify Devices                            1        0.000 (<0.01%)        0.000 (<0.01%)
    Instantiate                               1        0.015 (<0.01%)        0.017 (<0.01%)
  Late Initialization                         1        0.170 ( 0.02%)        0.175 ( 0.02%)
    Global Indices                            1        0.028 (<0.01%)        0.028 (<0.01%)
  Setup Matrix Structure                      1        0.012 (<0.01%)        0.012 (<0.01%)



Timing summary of 8 processors
                                                      CPU Time              CPU Time              CPU Time              Wall Time             Wall Time             Wall Time
                 Stats                   Count    Sum (% of System)     Min (% of System)     Max (% of System)     Sum (% of System)     Min (% of System)     Max (% of System)
---------------------------------------- ------ --------------------- --------------------- --------------------- --------------------- --------------------- ---------------------
Xyce                                          8  2:28:54.731 (100.0%)    18:35.497 (12.48%)    18:37.234 (12.50%)  2:29:23.275 (100.0%)    18:40.406 (12.50%)    18:40.414 (12.50%)
  Analysis                                    8  2:28:42.121 (99.86%)    18:33.934 (12.47%)    18:35.652 (12.49%)  2:29:08.688 (99.84%)    18:38.586 (12.48%)    18:38.586 (12.48%)
    Transient                                 8  2:28:42.119 (99.86%)    18:33.934 (12.47%)    18:35.652 (12.49%)  2:29:08.687 (99.84%)    18:38.586 (12.48%)    18:38.586 (12.48%)
      Nonlinear Solve                    120336  2:22:57.820 (96.01%)    17:51.687 (11.99%)    17:52.492 (12.00%)  2:23:20.230 (95.95%)    17:54.867 (11.99%)    17:55.099 (11.99%)
        Residual                         828616  1:26:54.963 (58.37%)    10:51.528 ( 7.29%)    10:52.033 ( 7.30%)  1:27:09.629 (58.35%)    10:53.656 ( 7.29%)    10:53.753 ( 7.29%)
        Jacobian                         707976    20:07.461 (13.51%)     2:29.162 ( 1.67%)     2:31.957 ( 1.70%)    20:13.145 (13.53%)     2:29.850 ( 1.67%)     2:32.793 ( 1.70%)
        Linear Solve                     707976    33:06.263 (22.23%)     4:07.257 ( 2.77%)     4:09.939 ( 2.80%)    33:10.374 (22.21%)     4:07.700 ( 2.76%)     4:10.634 ( 2.80%)
      Successful DCOP Steps                   8        1.501 ( 0.02%)        0.041 (<0.01%)        0.209 (<0.01%)        1.673 ( 0.02%)        0.209 (<0.01%)        0.209 (<0.01%)
      Successful Step                    116504     4:44.171 ( 3.18%)       34.690 ( 0.39%)       37.061 ( 0.41%)     4:48.028 ( 3.21%)       34.972 ( 0.39%)       38.754 ( 0.43%)
      Failed Steps                         3824        0.018 (<0.01%)        0.002 (<0.01%)        0.002 (<0.01%)        0.017 (<0.01%)        0.002 (<0.01%)        0.002 (<0.01%)
        Nonlinear Failure                  3768        0.002 (<0.01%)        0.000 (<0.01%)        0.000 (<0.01%)        0.002 (<0.01%)        0.000 (<0.01%)        0.000 (<0.01%)
  Netlist Import                              8       11.024 ( 0.12%)        1.362 ( 0.02%)        1.384 ( 0.02%)       11.112 ( 0.12%)        1.389 ( 0.02%)        1.389 ( 0.02%)
    Parse Context                             8        1.091 ( 0.01%)        0.135 (<0.01%)        0.137 (<0.01%)        1.093 ( 0.01%)        0.136 (<0.01%)        0.137 (<0.01%)
    Distribute Devices                        8        9.499 ( 0.11%)        1.169 ( 0.01%)        1.191 ( 0.01%)        9.538 ( 0.11%)        1.192 ( 0.01%)        1.192 ( 0.01%)
    Verify Devices                            8        0.000 (<0.01%)        0.000 (<0.01%)        0.000 (<0.01%)        0.000 (<0.01%)        0.000 (<0.01%)        0.000 (<0.01%)
    Instantiate                               8        0.116 (<0.01%)        0.012 (<0.01%)        0.018 (<0.01%)        0.141 (<0.01%)        0.017 (<0.01%)        0.019 (<0.01%)
  Late Initialization                         8        1.390 ( 0.02%)        0.170 (<0.01%)        0.176 (<0.01%)        1.413 ( 0.02%)        0.175 (<0.01%)        0.177 (<0.01%)
    Global Indices                            8        0.235 (<0.01%)        0.028 (<0.01%)        0.030 (<0.01%)        0.236 (<0.01%)        0.028 (<0.01%)        0.030 (<0.01%)
  Setup Matrix Structure                      8        0.096 (<0.01%)        0.012 (<0.01%)        0.012 (<0.01%)        0.097 (<0.01%)        0.012 (<0.01%)        0.012 (<0.01%)

XYCE

unread,
Jan 15, 2024, 2:02:03 PMJan 15
to Eddy Wu, XYCE, xyce-users

 

Hi Eddy,

 

Thanks for providing the timing statistics and snapshot of some waveforms. It gives more information about the simulation.

 

The 3rd picture shows typical point-to-point ringing during the sharp transition (red waveform). This indicates stability issue. If gear does not fix the problem, try more stable method BE (maxord = 1). It should quickly damp out the overshooting. How many time steps Xyce takes on average for a clock cycle in your simulation?

 

BTW, loosening the tolerance may not always be a good idea to speedup a simulation. It could work in some cases. But it could also make the artificial point-to-point ringing worse for Trap method. The ringing is due to marginal stability of Trap method. Small numerical error can cause ringing and a loose tolerance can make it worse. In general, we should be careful if we loosen reltol in the simulation.        

 

From timing statistics, you can see that majority of the time is spent on device evaluation (residual and Jacobian) and it takes more than 70% of the simulation time. The linear solver only takes about 22%. What device models are you using? You can get the information from the Device Count Summary at the beginning of the simulation. We have seen efficiency problems in device evaluations for certain devices due to different reasons previously and have fixed some of them. Please let us know and we will see if there is any efficiency problem in device evaluation. Could you also provide all the options lines you used in the simulation if you cannot send us the netlist.     

 

Also, with a fast clock cycle and a long simulation length, the simulation has about 1.28 million cycles. According to the current timing statics, we can estimate how much time each step takes. Even with a reasonable number of time steps per cycle, the simulation is going to take many hours. Why do you need such a long simulation length? Does the circuit have some slow behavior that is orders of magnitude slower than the clock speed? If so, a plain transient simulation is not well suited for this case. Advance simulation methods, such as envelope methods normally available in RF simulators (e.g., Spectre RF), would be much faster in this case.  

 

Thanks,

 

Ting

 

From: Eddy Wu <ed...@imagica.technology>
Sent: Friday, January 12, 2024 12:39 PM
To: XYCE <XY...@sandia.gov>
Cc: xyce-users <xyce-...@googlegroups.com>
Subject: [EXTERNAL] Re: [xyce-users] Re: General guidelines for long transient simulations

 

Eddy Wu

unread,
Jan 17, 2024, 12:47:05 PMJan 17
to XYCE, xyce-users
Hi @Ting,

1) I've tried simulations with different options for reltol, abstol to minimize ringing. 
Just changing the tolerances to 1e-3 reduces ringing, but is computationally more expensive. Same with using Gear and Backwards Euler. 

2) I am using BSIM3 models. Here is an excerp from .spice file xyce configuration. This is one configuration I tried among many others. Commented lines were uncommented in different runs. 
.options linsol type=klu
.options timeint reltol=1e-2 abstol=1e-2 *tried different tolerances
.options dist strategy=2
*.option method=trap maxord=1
.TRAN 100n 5u
.op
.PREPROCESS REPLACEGROUND TRUE
.PREPROCESS ADDRESISTORS NODCPATH 100G
.PREPROCESS ADDRESISTORS ONETERMINAL 100G
*.OPTIONS OUTPUT INITIAL_INTERVAL= 25n 1 25n 1u 25n
.print tran format=raw V(*) *tried with printing fewer steps too. 

3) I have about 55 points per clock when tolerances set to 1e-2, whereas I have ~130 points per clock when tolerances set to 1e-3. Tolerances at 1e-2 runs faster. 

4) We are running co-simulatons of digital and analog circuits as well as test full custom digital circuits in the analog simulator.  
Since we are testing full custom digital functionality some of these transitions and operations occur over very long time periods as they are designed to interact with humans.  
We are also running co-simulations of multiple analog blocks, and would prefer to control them with the normal use case.  
In the normal use case some operations also occur over extended periods of time.

XYCE

unread,
Jan 17, 2024, 1:45:55 PMJan 17
to Eddy Wu, xyce-users

 

Hi Eddy,

 

Thanks for the update. Yes. Using tighter tolerance with Trap method would reduce artificial ringing. Using a more stable method (gear, BE) is another way to reduce ringing. With the same tolerance, a more stable method should have less ringing. This is briefly mentioned in Xyce users’ guide and Kundert’s paper has more detailed explanation.


According to new information you provide here, the number of time steps per cycle is already reasonable. The artificial ringing led to some unnecessary steps at the sharp transition, but it damps out.

 

I assume that you use level 9 MOSFET in Xyce which is BSIM3. It is a hand written model and we have not seen performance related issue for device evaluation in this model previously. What other devices are used in the simulation? I wonder why >70% time is spent in device evaluation even with dist strategy=2.

 

Most simulation options are fine. Some minor things. The abstol of 1e-2 is too loose. The default is 1e-6. But abstol has a much less effect in LTE than reltol. Do you have a lot of floating nodes? The two addresistor preprocess lines are not used in this netlist. These commands in this netlist will generate another netlist with the name _xyce.cir which contains extra resistors added to the floating nodes. You will need to run _xyce.cir if you would like these extra resistors.     

 

Thanks for explaining the reason why you need a long simulation length. With 1.28 million cycles, the simulation would have lots of output. I did a quick estimation with 100 time step per cycle. Your simulation is going to take a few months according to the current timing statistics.

 

Since the number of time steps per cycle is reasonable in your simulation, the main way to improve performance is to see if the cost per step can be reduced. As I mentioned before, the device evaluation takes >70% run time and linear solver takes 22%. Improving them would improve performance, but device evaluation would have a larger impact. Or it is possible that there may not be much performance issue, the simulation is slow because there are so many cycles. It is a limitation of plain transient simulation for the circuit.

Reply all
Reply to author
Forward
0 new messages