parallel-build and MPI related issue

57 views
Skip to first unread message

Rich Zhang

unread,
Feb 25, 2021, 2:55:42 AM2/25/21
to xyce-users
Hi there,
     Every time I tried with Parallel-version and after I see the full log and there was still something on the screen and it seems like the process has not finished yet. 
Timing summary of 4 processors
                                                      CPU Time              CPU Time              CPU Time              Wall Time             Wall Time             Wall Time
                 Stats                   Count    Sum (% of System)     Min (% of System)     Max (% of System)     Sum (% of System)     Min (% of System)     Max (% of System)
---------------------------------------- ------ --------------------- --------------------- --------------------- --------------------- --------------------- ---------------------
Xyce                                          4  3:46:14.705 (100.0%)    56:33.459 (25.00%)    56:34.035 (25.00%)  3:46:32.849 (100.0%)    56:38.210 (25.00%)    56:38.214 (25.00%)
  Analysis                                    4  3:46:12.165 (99.98%)    56:32.827 (24.99%)    56:33.395 (25.00%)  3:46:29.274 (99.97%)    56:37.319 (24.99%)    56:37.319 (24.99%)
    Transient                                 4  3:46:12.164 (99.98%)    56:32.827 (24.99%)    56:33.395 (25.00%)  3:46:29.273 (99.97%)    56:37.318 (24.99%)    56:37.318 (24.99%)
      Nonlinear Solve                    126784  3:45:04.512 (99.48%)    56:15.977 (24.87%)    56:16.439 (24.87%)  3:45:21.073 (99.47%)    56:20.198 (24.87%)    56:20.293 (24.87%)
        Residual                         903696    51:48.479 (22.90%)    12:54.502 ( 5.71%)    13:01.339 ( 5.76%)    51:54.256 (22.91%)    12:55.996 ( 5.71%)    13:02.571 ( 5.76%)
        Jacobian                         776856    30:35.931 (13.52%)     7:32.314 ( 3.33%)     7:55.267 ( 3.50%)    30:38.978 (13.53%)     7:33.146 ( 3.33%)     7:56.034 ( 3.50%)
        Linear Solve                     776856  2:21:12.620 (62.41%)    35:03.664 (15.50%)    35:27.364 (15.67%)  2:21:21.770 (62.40%)    35:05.793 (15.49%)    35:29.696 (15.67%)
      Successful DCOP Steps                   4        0.040 (<0.01%)        0.010 (<0.01%)        0.010 (<0.01%)        0.044 (<0.01%)        0.011 (<0.01%)        0.011 (<0.01%)
      Successful Step                     97104       43.709 ( 0.32%)       10.737 ( 0.08%)       11.450 ( 0.08%)       44.117 ( 0.32%)       10.780 ( 0.08%)       11.742 ( 0.09%)
      Failed Steps                        29676        0.108 (<0.01%)        0.025 (<0.01%)        0.028 (<0.01%)        0.105 (<0.01%)        0.024 (<0.01%)        0.028 (<0.01%)
        Nonlinear Failure                 29480        0.011 (<0.01%)        0.002 (<0.01%)        0.003 (<0.01%)        0.010 (<0.01%)        0.002 (<0.01%)        0.003 (<0.01%)
  Netlist Import                              4        1.900 ( 0.01%)        0.466 (<0.01%)        0.485 (<0.01%)        1.996 ( 0.01%)        0.495 (<0.01%)        0.503 (<0.01%)
    Parse Context                             4        0.068 (<0.01%)        0.016 (<0.01%)        0.017 (<0.01%)        0.069 (<0.01%)        0.017 (<0.01%)        0.017 (<0.01%)
    Distribute Devices                        4        1.615 ( 0.01%)        0.390 (<0.01%)        0.414 (<0.01%)        1.680 ( 0.01%)        0.420 (<0.01%)        0.420 (<0.01%)
    Verify Devices                            4        0.000 (<0.01%)        0.000 (<0.01%)        0.000 (<0.01%)        0.000 (<0.01%)        0.000 (<0.01%)        0.000 (<0.01%)
    Instantiate                               4        0.152 (<0.01%)        0.037 (<0.01%)        0.040 (<0.01%)        0.171 (<0.01%)        0.041 (<0.01%)        0.044 (<0.01%)
  Late Initialization                         4        0.565 (<0.01%)        0.139 (<0.01%)        0.142 (<0.01%)        0.589 (<0.01%)        0.147 (<0.01%)        0.147 (<0.01%)
    Global Indices                            4        0.254 (<0.01%)        0.063 (<0.01%)        0.064 (<0.01%)        0.255 (<0.01%)        0.063 (<0.01%)        0.064 (<0.01%)
  Setup Matrix Structure                      4        0.170 (<0.01%)        0.042 (<0.01%)        0.043 (<0.01%)        0.180 (<0.01%)        0.045 (<0.01%)        0.045 (<0.01%)

--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

Any idea what might cause that?  Thank you guys!!!

xyce-users

unread,
Mar 8, 2021, 12:55:32 PM3/8/21
to xyce-users
Rich,

I assume you had to to manually kill the run (with Ctrl-C, e.g.). Correct? Also, it is unclear from what you posted whether the failure was during the run or after it completed. That information would be somewhere above the timing output. Could you post the full output log (preferably in a file), or, at least, the last bit of the output before the timers?

Finally, could you let us know what operating system you are using?

Thanks,
The Xyce Team

Liqian Zhang

unread,
Mar 8, 2021, 10:35:16 PM3/8/21
to xyce-users
Yes, I have to use ctrl+c to kill it. Sorry I kept forgetting posting my system info: CentOS Linux release 7.8.2003 (Core)

gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/data1/tools/gcc10-fortran/libexec/gcc/x86_64-pc-linux-gnu/10.2.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: ./configure --enable-checking=release --enable-languages=c,c++,fortran --disable-multilib --prefix=/data1/tools/gcc10-fortran/
Thread model: posix
Supported LTO compression algorithms: zlib
gcc version 10.2.0 (GCC)

And you’ll find the output log below. I also include the benchmark netlist in the attachment. Again, thanks for the help.
image.png


--
You received this message because you are subscribed to a topic in the Google Groups "xyce-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/xyce-users/13hdxG3pV-o/unsubscribe.
To unsubscribe from this group and all its topics, send an email to xyce-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/xyce-users/3184a5dc-e7fa-40b5-b52a-e3c33f85db09n%40googlegroups.com.


--
log.txt
45nm_HP.pm
c7552.net

xyce-users

unread,
Mar 12, 2021, 6:17:19 PM3/12/21
to xyce-users
Rich,

The hanging you are seeing is a known issue with Red Hat Linux version 7.0, so CentOS 7 has the exact same issue. Basically, the Open MPI supplied with version 7 has a bug that causes random hangs with Xyce. To avoid the hanging issue, you would have to install Open MPI from source (version 2 or later), and then build and install Xyce according to the Building Guide.

As to the simulation failure, Xyce was unable to get a DC Operating Point. There were a lot of warnings about nodes that had no DC path to ground, so that could be a place to start when troubleshooting.

The Xyce Team

Liqian Zhang

unread,
Mar 14, 2021, 9:27:00 PM3/14/21
to xyce-users
Awesome!!!! Thanks for the help. 
Reply all
Reply to author
Forward
0 new messages