nrn_timeout on some large and long simulations

27 views
Skip to first unread message

Ernest Ho

unread,
Nov 9, 2021, 2:46:37 AM11/9/21
to NetPyNE Q&A forum
Hi NetPyNE development group,

I noticed that when I perform long simulations (~2000seconds) with multiple cores, I sometimes encounter nrn_timeout and the simulation fails from that time point on as a result. Here is an example error output:

Setting h global variables ... Setting h global variables ... Setting h global variables ... Setting h global variables ... Setting h global variables ... Setting h global variables ... Setting h global variables ... h.celsius = 36.4 h.v_init = -65 h.celsius = 36.4 h.v_init = -65 h.celsius = 36.4 h.v_init = -65 h.clamp_resist = 0.001 h.tstop = 2000000.0 h.clamp_resist = 0.001 h.tstop = 2000000.0 h.celsius = 36.4 h.v_init = -65 h.clamp_resist = 0.001 h.tstop = 2000000.0 h.celsius = 36.4 h.v_init = -65 h.clamp_resist = 0.001 h.tstop = 2000000.0 h.celsius = 36.4 h.v_init = -65 h.clamp_resist = 0.001 h.tstop = 2000000.0 h.clamp_resist = 0.001 h.tstop = 2000000.0 h.celsius = 36.4 h.v_init = -65 h.clamp_resist = 0.001 h.tstop = 2000000.0 Setting h global variables ... h.celsius = 36.4 h.v_init = -65 h.clamp_resist = 0.001 h.tstop = 2000000.0 Minimum delay (time-step for queue exchange) is 2.08 Running simulation for 2000000.0 ms... nrn_timeout t=289743 nrn_timeout t=289743

I noticed that it probably has something to the with NEURON's methods of integration using MPI. I am just wondering if there is anything that we can do to get around this nrn_timeout issue and complete the simulation? Like this one, it did not even go beyond 15% of the entire integration of 20000s.

(It also looks like to me that this error is core and/or machine dependent...)

Thanks for your help,
Ernest


Ernest Ho

unread,
Nov 9, 2021, 3:14:32 AM11/9/21
to NetPyNE Q&A forum
Hi,

It certainly looked like to me there was some segmentation fault encountered in the middle of the sim that might have caused the nrn_timeout...

It is only a simulation of ~300 cells; I am only recording LFPs at 2 places and K_{o} concentration for every cell (both every 10 ms) and also spike times. Each of the 8 cores (#SBATCH -n 8) had 18G of memory (#SBATCH --mem-per-cpu 18g) so there was a total of 144G of RAM I believe.  The pkl file for outputs of other successful sims of similar sizes and length was ~1G per sim.  If memory is an issue here, I don't see how I can increase it without incurring the wrath of the system administrator...

Thanks,
Ernest


Additional mechanisms from files
 "iext.mod" "ikCa.mod" "ikleak.mod" "ikpump.mod" "ileak.mod" "ipotassium.mod" "isodium.mod" "kbalance.mod"
[cdr719:04470] *** Process received signal ***
[cdr719:04470] Signal: Segmentation fault (11)
[cdr719:04470] Signal code:  (150601864)
[cdr719:04470] Failing at address: 0x8fa0084
[cdr719:04470] [ 0] /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(+0x38980)[0x2ae7065a6980]
[cdr719:04470] [ 1] /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/libopen-pal.so.40(opal_show_help_yylex+0x91)[0x2ae706956b31]
[cdr719:04470] [ 2] /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/libopen-pal.so.40(opal_show_help_vstring+0x188)[0x2ae706956098]
[cdr719:04470] [ 3] /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/libopen-rte.so.40(orte_show_help+0xce)[0x2ae7067df48e]
[cdr719:04470] [ 4] /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/libmpi.so.40(PMPI_Abort+0x6c)[0x2ae705f9f6dc]
[cdr719:04470] [ 5] /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/neuron/8.0.0/bin/../lib/libnrniv.so(nrnmpi_abort+0x2b)[0x2ae7059f166b]
[cdr719:04470] [ 6] /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/neuron/8.0.0/bin/../lib/libnrniv.so(+0x18bea5)[0x2ae705924ea5]
[cdr719:04470] [ 7] /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libpthread.so.0(+0x130f0)[0x2ae7065610f0]
[cdr719:04470] [ 8] /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(__poll+0x4f)[0x2ae70665d2cf]
[cdr719:04470] [ 9] /cvmfs/soft.computecanada.ca/gentoo/2020/usr/lib64/libevent-2.1.so.6(+0x2b9fd)[0x2ae706ba99fd]
[cdr719:04470] [10] /cvmfs/soft.computecanada.ca/gentoo/2020/usr/lib64/libevent-2.1.so.6(event_base_loop+0x2a5)[0x2ae706ba0525]
[cdr719:04470] [11] /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Compiler/gcc9/openmpi/4.0.3/lib/libopen-pal.so.40(+0x5c11e)[0x2ae70692411e]
[cdr719:04470] [12] /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libpthread.so.0(+0x7f27)[0x2ae706555f27]
[cdr719:04470] [13] /cvmfs/soft.computecanada.ca/gentoo/2020/lib64/libc.so.6(clone+0x3f)[0x2ae70666787f]
[cdr719:04470] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
Reply all
Reply to author
Forward
0 new messages