Error during Reconstruct

176 views
Skip to first unread message

Stefan Petrovic

unread,
Apr 17, 2024, 3:05:08 AM4/17/24
to spIsoNet
Hi All,

First of all, thank you for making this great piece of software available. I have just installed it on our machine equipped with 4xA1000 gpus and running CentOS 7 with Cuda 11.8. I am running into the following error. Could you possibly look into what might be going wrong?

""""
(spisonet) [spetrovi@laue J331]$ spisonet.py reconstruct J331_007_volume_map_half_A.mrc J331_007_volume_map_half_B.mrc --aniso_file FSC3D.mrc --mask J331_007_volume_mask_fsc_auto.mrc --limit_res 2.66 --epochs 30 --alpha 1 --beta 0.5 --output_dir isonet_maps --gpuID 0,1,2,3 --acc_batches 2
04-16 23:20:33, INFO     voxel_size 0.8327999711036682
04-16 23:20:34, INFO     spIsoNet correction until resolution 2.66A!
                     Information beyond 2.66A remains unchanged
04-16 23:21:01, INFO     Start preparing subvolumes!
04-16 23:21:38, INFO     Done preparing subvolumes!
04-16 23:21:38, INFO     Start training!
04-16 23:21:43, INFO     Port number: 50745
learning rate 0.0003
['isonet_maps/J331_007_volume_map_half_A_data', 'isonet_maps/J331_007_volume_map_half_B_data']
  0%|                                                                                | 0/125 [00:00<?, ?batch/s][rank0]:[2024-04-16 23:22:28,961] [0/0] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
[rank3]:[2024-04-16 23:22:28,973] [0/0] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
[rank2]:[2024-04-16 23:22:29,175] [0/0] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
[rank1]:[2024-04-16 23:22:29,313] [0/0] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
/home/spetrovi/miniconda3/envs/spisonet/bin/python: relocation error: /home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8: symbol _Z20traceback_iretf_implPKcRKN5cudnn16InternalStatus_tEb, version libcudnn_ops_infer.so.8 not defined in file libcudnn_ops_infer.so.8 with link time reference
/home/spetrovi/miniconda3/envs/spisonet/bin/python: relocation error: /home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8: symbol _Z20traceback_iretf_implPKcRKN5cudnn16InternalStatus_tEb, version libcudnn_ops_infer.so.8 not defined in file libcudnn_ops_infer.so.8 with link time reference
/home/spetrovi/miniconda3/envs/spisonet/bin/python: relocation error: /home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8: symbol _Z20traceback_iretf_implPKcRKN5cudnn16InternalStatus_tEb, version libcudnn_ops_infer.so.8 not defined in file libcudnn_ops_infer.so.8 with link time reference
/home/spetrovi/miniconda3/envs/spisonet/bin/python: relocation error: /home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8: symbol _Z20traceback_iretf_implPKcRKN5cudnn16InternalStatus_tEb, version libcudnn_ops_infer.so.8 not defined in file libcudnn_ops_infer.so.8 with link time reference
Traceback (most recent call last):
  File "/home/spetrovi/miniconda3/envs/spisonet/bin/spisonet.py", line 8, in <module>
    sys.exit(main())
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 549, in main
    fire.Fire(ISONET)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 182, in reconstruct
    map_refine_n2n(halfmap1,halfmap2, mask_vol, fsc3d, alpha = alpha,beta=beta,  voxel_size=voxel_size, output_dir=output_dir,
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/map_refine.py", line 145, in map_refine_n2n
    network.train([data_dir_1,data_dir_2], output_dir, alpha=alpha,beta=beta, output_base=output_base0, batch_size=batch_size, epochs = epochs, steps_per_epoch = 1000,
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 265, in train
    mp.spawn(ddp_train, args=(self.world_size, self.port_number, self.model,alpha,beta,
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 148, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 3 terminated with exit code 127
(spisonet) [spetrovi@laue J331]$ /home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 84 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
^C
(spisonet) [spetrovi@laue J331]$ htop
(spisonet) [spetrovi@laue J331]$ spisonet.py reconstruct J331_007_volume_map_half_A.mrc J331_007_volume_map_half_B.mrc --aniso_file FSC3D.mrc --mask J331_007_volume_mask_fsc_auto.mrc --limit_res 2.66 --epochs 30 --alpha 1 --beta 0.5 --output_dir isonet_maps --gpuID 0,1,2,3 --acc_batches 2
04-16 23:24:53, INFO     The isonet_maps folder already exists, outputs will write into this folder
04-16 23:24:54, INFO     voxel_size 0.8327999711036682
04-16 23:24:55, WARNING  The isonet_maps/J331_007_volume_map_half_A_data folder already exists. The old isonet_maps/J331_007_volume_map_half_A_data folder will be moved to isonet_maps/J331_007_volume_map_half_A_data~
04-16 23:24:55, WARNING  The isonet_maps/J331_007_volume_map_half_B_data folder already exists. The old isonet_maps/J331_007_volume_map_half_B_data folder will be moved to isonet_maps/J331_007_volume_map_half_B_data~
04-16 23:24:55, INFO     spIsoNet correction until resolution 2.66A!
                     Information beyond 2.66A remains unchanged
04-16 23:25:21, INFO     Start preparing subvolumes!
04-16 23:25:59, INFO     Done preparing subvolumes!
04-16 23:25:59, INFO     Start training!
04-16 23:26:02, INFO     Port number: 37913
learning rate 0.0003
['isonet_maps/J331_007_volume_map_half_A_data', 'isonet_maps/J331_007_volume_map_half_B_data']
  0%|                                                                                | 0/125 [00:00<?, ?batch/s][rank1]:[2024-04-16 23:26:46,658] [0/0] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
[rank2]:[2024-04-16 23:26:46,770] [0/0] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
[rank0]:[2024-04-16 23:26:46,803] [0/0] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
[rank3]:[2024-04-16 23:26:46,903] [0/0] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
/home/spetrovi/miniconda3/envs/spisonet/bin/python: relocation error: /home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8: symbol _Z20traceback_iretf_implPKcRKN5cudnn16InternalStatus_tEb, version libcudnn_ops_infer.so.8 not defined in file libcudnn_ops_infer.so.8 with link time reference
/home/spetrovi/miniconda3/envs/spisonet/bin/python: relocation error: /home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8: symbol _Z20traceback_iretf_implPKcRKN5cudnn16InternalStatus_tEb, version libcudnn_ops_infer.so.8 not defined in file libcudnn_ops_infer.so.8 with link time reference
/home/spetrovi/miniconda3/envs/spisonet/bin/python: relocation error: /home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8: symbol _Z20traceback_iretf_implPKcRKN5cudnn16InternalStatus_tEb, version libcudnn_ops_infer.so.8 not defined in file libcudnn_ops_infer.so.8 with link time reference
/home/spetrovi/miniconda3/envs/spisonet/bin/python: relocation error: /home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/../../nvidia/cudnn/lib/libcudnn_cnn_infer.so.8: symbol _Z20traceback_iretf_implPKcRKN5cudnn16InternalStatus_tEb, version libcudnn_ops_infer.so.8 not defined in file libcudnn_ops_infer.so.8 with link time reference
Traceback (most recent call last):
  File "/home/spetrovi/miniconda3/envs/spisonet/bin/spisonet.py", line 8, in <module>
    sys.exit(main())
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 549, in main
    fire.Fire(ISONET)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 182, in reconstruct
    map_refine_n2n(halfmap1,halfmap2, mask_vol, fsc3d, alpha = alpha,beta=beta,  voxel_size=voxel_size, output_dir=output_dir,
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/map_refine.py", line 145, in map_refine_n2n
    network.train([data_dir_1,data_dir_2], output_dir, alpha=alpha,beta=beta, output_base=output_base0, batch_size=batch_size, epochs = epochs, steps_per_epoch = 1000,
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 265, in train
    mp.spawn(ddp_train, args=(self.world_size, self.port_number, self.model,alpha,beta,
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 148, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with exit code 127
(spisonet) [spetrovi@laue J331]$ /home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 84 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
""""

YUNTAO LIU

unread,
Apr 17, 2024, 8:03:10 PM4/17/24
to Stefan Petrovic, spIsoNet
Hi Stefan,

I guess that this problem is caused by insufficient resource you have. I wonder how much VRAM is for a A1000 GPU. The following picture shows the approximate VRAM needed for spIsoNet.

image.png

--
You received this message because you are subscribed to the Google Groups "spIsoNet" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spisonet+u...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/spisonet/30f928e1-026f-4a92-840b-85d582ca4ffan%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Best Regards,
Yuntao Liu,  Postdoc.

California NanoSystem Institute
University of California Los Angeles

Stefan Petrovic

unread,
Apr 17, 2024, 9:03:59 PM4/17/24
to spIsoNet
Hi Yuntao,

Thank you for the fast reply!
(Un)fortunately, I made a typo: I'm running spIsoNet on 4xNVIDIA RTX A6000, each with 48GB of VRAM.

Does it look like a memory-related error message?

Cheers,
Stefan

YUNTAO LIU

unread,
Apr 17, 2024, 9:19:08 PM4/17/24
to Stefan Petrovic, spIsoNet
Hi Stefan,

Then 48GB VRAM is more than sufficient. So that means we should think about other related issues.  

I found this discussion about the cudnn version incompatibility with similar error: https://discuss.pytorch.org/t/could-not-load-library-libcudnn-cnn-train-so-8-while-training-convnet/171334
Probably need to verify whether the version of cudnn, cuda and torch matches.


For more options, visit https://groups.google.com/d/optout.

Stefan Petrovic

unread,
Apr 18, 2024, 5:56:22 PM4/18/24
to spIsoNet
Hi Yuntao,

Thank you for the great feedback. Indeed, my cudnn, cuda and torch versions did not match. For anyone else running in the same problem, consulting compatibility tables is useful: https://github.com/pytorch/pytorch/blob/main/RELEASE.md#release-compatibility-matrix

So, I am running cuda 11.8:
(spisonet) [spetrovi@laue J331]$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

And running cudnn 8.7 / torch 2.2.2:
(spisonet) [spetrovi@laue J331]$ python
Python 3.10.14 (main, Mar 21 2024, 16:24:04) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.__version__)
2.2.2
>>> print(torch.backends.cudnn.version())
8700

I now get the following error:
['isonet_maps/J331_007_volume_map_half_A_data', 'isonet_maps/J331_007_volume_map_half_B_data']
  0%|                                                                                | 0/125 [00:00<?, ?batch/s][rank3]:[2024-04-18 14:34:22,274] [0/0] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
[rank0]:[2024-04-18 14:34:22,618] [0/0] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
[rank2]:[2024-04-18 14:34:22,899] [0/0] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
[rank1]:[2024-04-18 14:34:23,284] [0/0] torch._dynamo.variables.torch: [WARNING] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored
/tmp/tmpr6j127vt/main.c: In function \u2018list_to_cuuint64_array\u2019:
/tmp/tmp1l43fx85/main.c: In function \u2018list_to_cuuint64_array\u2019:
/tmp/tmprczwac99/main.c: In function \u2018list_to_cuuint64_array\u2019:
/tmp/tmpr6j127vt/main.c:354:3: error: \u2018for\u2019 loop initial declarations are only allowed in C99 mode
   for (Py_ssize_t i = 0; i < len; i++) {
   ^
/tmp/tmpu9tb6e5m/main.c: In function \u2018list_to_cuuint64_array\u2019:
/tmp/tmp1l43fx85/main.c:354:3: error: \u2018for\u2019 loop initial declarations are only allowed in C99 mode
   for (Py_ssize_t i = 0; i < len; i++) {
   ^
/tmp/tmpr6j127vt/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code
/tmp/tmprczwac99/main.c:354:3: error: \u2018for\u2019 loop initial declarations are only allowed in C99 mode
   for (Py_ssize_t i = 0; i < len; i++) {
   ^
/tmp/tmp1l43fx85/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code
/tmp/tmprczwac99/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code
/tmp/tmpr6j127vt/main.c: In function \u2018list_to_cuuint32_array\u2019:
/tmp/tmp1l43fx85/main.c: In function \u2018list_to_cuuint32_array\u2019:
/tmp/tmprczwac99/main.c: In function \u2018list_to_cuuint32_array\u2019:
/tmp/tmpu9tb6e5m/main.c:354:3: error: \u2018for\u2019 loop initial declarations are only allowed in C99 mode
   for (Py_ssize_t i = 0; i < len; i++) {
   ^
/tmp/tmpr6j127vt/main.c:365:3: error: \u2018for\u2019 loop initial declarations are only allowed in C99 mode
   for (Py_ssize_t i = 0; i < len; i++) {
   ^
/tmp/tmp1l43fx85/main.c:365:3: error: \u2018for\u2019 loop initial declarations are only allowed in C99 mode
   for (Py_ssize_t i = 0; i < len; i++) {
   ^
/tmp/tmpu9tb6e5m/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code
/tmp/tmprczwac99/main.c:365:3: error: \u2018for\u2019 loop initial declarations are only allowed in C99 mode
   for (Py_ssize_t i = 0; i < len; i++) {
   ^
/tmp/tmpu9tb6e5m/main.c: In function \u2018list_to_cuuint32_array\u2019:
/tmp/tmpu9tb6e5m/main.c:365:3: error: \u2018for\u2019 loop initial declarations are only allowed in C99 mode
   for (Py_ssize_t i = 0; i < len; i++) {
   ^
/tmp/tmp3isx_ban/main.c: In function \u2018list_to_cuuint64_array\u2019:
/tmp/tmpv0o0ry1q/main.c: In function \u2018list_to_cuuint64_array\u2019:
/tmp/tmp3isx_ban/main.c:354:3: error: \u2018for\u2019 loop initial declarations are only allowed in C99 mode
   for (Py_ssize_t i = 0; i < len; i++) {
   ^
/tmp/tmp3isx_ban/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code
/tmp/tmp3isx_ban/main.c: In function \u2018list_to_cuuint32_array\u2019:
/tmp/tmpv0o0ry1q/main.c:354:3: error: \u2018for\u2019 loop initial declarations are only allowed in C99 mode
   for (Py_ssize_t i = 0; i < len; i++) {
   ^
/tmp/tmpv0o0ry1q/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code
/tmp/tmp3isx_ban/main.c:365:3: error: \u2018for\u2019 loop initial declarations are only allowed in C99 mode
   for (Py_ssize_t i = 0; i < len; i++) {
   ^
/tmp/tmpv0o0ry1q/main.c: In function \u2018list_to_cuuint32_array\u2019:
/tmp/tmpv0o0ry1q/main.c:365:3: error: \u2018for\u2019 loop initial declarations are only allowed in C99 mode
   for (Py_ssize_t i = 0; i < len; i++) {
   ^
/tmp/tmp__1zeucj/main.c: In function \u2018list_to_cuuint64_array\u2019:
/tmp/tmppyr76ww_/main.c: In function \u2018list_to_cuuint64_array\u2019:
/tmp/tmp__1zeucj/main.c:354:3: error: \u2018for\u2019 loop initial declarations are only allowed in C99 mode
   for (Py_ssize_t i = 0; i < len; i++) {
   ^
/tmp/tmppyr76ww_/main.c:354:3: error: \u2018for\u2019 loop initial declarations are only allowed in C99 mode
   for (Py_ssize_t i = 0; i < len; i++) {
   ^
/tmp/tmp__1zeucj/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code
/tmp/tmppyr76ww_/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code
/tmp/tmp__1zeucj/main.c: In function \u2018list_to_cuuint32_array\u2019:
/tmp/tmppyr76ww_/main.c: In function \u2018list_to_cuuint32_array\u2019:
/tmp/tmp__1zeucj/main.c:365:3: error: \u2018for\u2019 loop initial declarations are only allowed in C99 mode
   for (Py_ssize_t i = 0; i < len; i++) {
   ^
/tmp/tmppyr76ww_/main.c:365:3: error: \u2018for\u2019 loop initial declarations are only allowed in C99 mode
   for (Py_ssize_t i = 0; i < len; i++) {
   ^
/tmp/tmp24x5boe3/main.c: In function \u2018list_to_cuuint64_array\u2019:
/tmp/tmp7fkgcdz3/main.c: In function \u2018list_to_cuuint64_array\u2019:
/tmp/tmp24x5boe3/main.c:354:3: error: \u2018for\u2019 loop initial declarations are only allowed in C99 mode
   for (Py_ssize_t i = 0; i < len; i++) {
   ^
/tmp/tmp24x5boe3/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code
/tmp/tmp7fkgcdz3/main.c:354:3: error: \u2018for\u2019 loop initial declarations are only allowed in C99 mode
   for (Py_ssize_t i = 0; i < len; i++) {
   ^
/tmp/tmp7fkgcdz3/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code
/tmp/tmp24x5boe3/main.c: In function \u2018list_to_cuuint32_array\u2019:
/tmp/tmp7fkgcdz3/main.c: In function \u2018list_to_cuuint32_array\u2019:
/tmp/tmp24x5boe3/main.c:365:3: error: \u2018for\u2019 loop initial declarations are only allowed in C99 mode
   for (Py_ssize_t i = 0; i < len; i++) {
   ^
/tmp/tmp7fkgcdz3/main.c:365:3: error: \u2018for\u2019 loop initial declarations are only allowed in C99 mode
   for (Py_ssize_t i = 0; i < len; i++) {
   ^
/tmp/tmpb0c_5uek/main.c: In function \u2018list_to_cuuint64_array\u2019:
/tmp/tmpb0c_5uek/main.c:354:3: error: \u2018for\u2019 loop initial declarations are only allowed in C99 mode
   for (Py_ssize_t i = 0; i < len; i++) {
   ^
/tmp/tmpb0c_5uek/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code
/tmp/tmpb0c_5uek/main.c: In function \u2018list_to_cuuint32_array\u2019:
/tmp/tmpb0c_5uek/main.c:365:3: error: \u2018for\u2019 loop initial declarations are only allowed in C99 mode
   for (Py_ssize_t i = 0; i < len; i++) {
   ^
/tmp/tmpxdg_2cjx/main.c: In function \u2018list_to_cuuint64_array\u2019:
/tmp/tmpxdg_2cjx/main.c:354:3: error: \u2018for\u2019 loop initial declarations are only allowed in C99 mode
   for (Py_ssize_t i = 0; i < len; i++) {
   ^
/tmp/tmpxdg_2cjx/main.c:354:3: note: use option -std=c99 or -std=gnu99 to compile your code
/tmp/tmpxdg_2cjx/main.c: In function \u2018list_to_cuuint32_array\u2019:
/tmp/tmpxdg_2cjx/main.c:365:3: error: \u2018for\u2019 loop initial declarations are only allowed in C99 mode
   for (Py_ssize_t i = 0; i < len; i++) {
   ^
  0%|                                                                                | 0/125 [00:34<?, ?batch/s]

Traceback (most recent call last):
  File "/home/spetrovi/miniconda3/envs/spisonet/bin/spisonet.py", line 8, in <module>
    sys.exit(main())
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 549, in main
    fire.Fire(ISONET)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 182, in reconstruct
    map_refine_n2n(halfmap1,halfmap2, mask_vol, fsc3d, alpha = alpha,beta=beta,  voxel_size=voxel_size, output_dir=output_dir,
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/map_refine.py", line 145, in map_refine_n2n
    network.train([data_dir_1,data_dir_2], output_dir, alpha=alpha,beta=beta, output_base=output_base0, batch_size=batch_size, epochs = epochs, steps_per_epoch = 1000,
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 265, in train
    mp.spawn(ddp_train, args=(self.world_size, self.port_number, self.model,alpha,beta,
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
    while not context.join():
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 158, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 2 terminated with the following error:

Traceback (most recent call last):
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 68, in _wrap
    fn(i, *args)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 116, in ddp_train
    preds = model(x1)# + noise.cuda())
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 489, in _fn
    return fn(*args, **kwargs)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
    return fn(*args, **kwargs)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1523, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1359, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/unet.py", line 97, in forward
    x, down_sampling_features = self.encoder(x)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/unet.py", line 98, in resume_in_forward
    x = self.decoder(x, down_sampling_features)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 652, in catch_errors
    return hijacked_callback(frame, cache_entry, hooks, frame_state)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 727, in _convert_frame
    result = inner_convert(frame, cache_entry, hooks, frame_state)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 383, in _convert_frame_assert
    compiled_product = _compile(
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 646, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 244, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 562, in compile_inner
    out_code = transform_code_object(code, transform)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/bytecode_transformation.py", line 1033, in transform_code_object
    transformations(instructions, code_options)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 151, in _fn
    return fn(*args, **kwargs)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py", line 527, in transform
    tracer.run()
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2128, in run
    super().run()
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 818, in run
    and self.step()
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 781, in step
    getattr(self, inst.opname)(inst)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/symbolic_convert.py", line 2243, in RETURN_VALUE
    self.output.compile_subgraph(
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 919, in compile_subgraph
    self.compile_and_call_fx_graph(tx, list(reversed(stack_values)), root)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1087, in compile_and_call_fx_graph
    compiled_fn = self.call_user_compiler(gm)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 244, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1159, in call_user_compiler
    raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/output_graph.py", line 1140, in call_user_compiler
    compiled_fn = compiler_fn(gm, self.example_inputs())
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/backends/distributed.py", line 312, in compile_fn
    return self.backend_compile_fn(gm, example_inputs)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/repro/after_dynamo.py", line 117, in debug_wrapper
    compiled_gm = compiler_fn(gm, example_inputs)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/__init__.py", line 1668, in __call__
    return compile_fx(model_, inputs_, config_patches=self.config)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1168, in compile_fx
    return aot_autograd(
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/backends/common.py", line 55, in compiler_fn
    cg = aot_module_simplified(gm, example_inputs, **kwargs)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 887, in aot_module_simplified
    compiled_fn = create_aot_dispatcher_function(
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 244, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 600, in create_aot_dispatcher_function
    compiled_fn = compiler_fn(flat_fn, fake_flat_args, aot_config, fw_metadata=fw_metadata)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 425, in aot_wrapper_dedupe
    return compiler_fn(flat_fn, leaf_flat_args, aot_config, fw_metadata=fw_metadata)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 630, in aot_wrapper_synthetic_base
    return compiler_fn(flat_fn, flat_args, aot_config, fw_metadata=fw_metadata)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 295, in aot_dispatch_autograd
    compiled_fw_func = aot_config.fw_compiler(fw_module, adjusted_flat_args)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 244, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1100, in fw_compiler_base
    return inner_compile(
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/repro/after_aot.py", line 83, in debug_wrapper
    inner_compiled_fn = compiler_fn(gm, example_inputs)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/debug.py", line 305, in inner
    return fn(*args, **kwargs)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 320, in compile_fx_inner
    compiled_graph = fx_codegen_and_compile(
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 550, in fx_codegen_and_compile
    compiled_fn = graph.compile_to_fn()
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/graph.py", line 1116, in compile_to_fn
    return self.compile_to_module().call
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_dynamo/utils.py", line 244, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/graph.py", line 1070, in compile_to_module
    mod = PyCodeCache.load_by_key_path(
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 1892, in load_by_key_path
    exec(code, mod.__dict__, mod.__dict__)
  File "/tmp/torchinductor_spetrovi/u6/cu6kxoskl3txa5ptho7vjkz6zh7csuhwxngfjk3opdwi3ab6wbhv.py", line 67, in <module>
    async_compile.wait(globals())
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 2486, in wait
    scope[key] = result.result()
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 2330, in result
    kernel = self.kernel = _load_kernel(self.kernel_name, self.source_code)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 2306, in _load_kernel
    kernel.precompile()
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/triton_heuristics.py", line 188, in precompile
    compiled_binary, launcher = self._precompile_config(
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/_inductor/triton_heuristics.py", line 308, in _precompile_config
    binary._init_handles()
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/triton/compiler/compiler.py", line 670, in _init_handles
    bin_path = {driver.HIP: "hsaco_path", driver.CUDA: "cubin"}[driver.backend]
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/triton/runtime/driver.py", line 157, in __getattr__
    self._initialize_obj()
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/triton/runtime/driver.py", line 154, in _initialize_obj
    self._obj = self._init_fn()
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/triton/runtime/driver.py", line 187, in initialize_driver
    return CudaDriver()
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/triton/runtime/driver.py", line 77, in __init__
    self.utils = CudaUtils()
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/triton/runtime/driver.py", line 47, in __init__
    so = _build("cuda_utils", src_path, tmpdir)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/triton/common/build.py", line 106, in _build
    ret = subprocess.check_call(cc_cmd)
  File "/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/subprocess.py", line 369, in check_call
    raise CalledProcessError(retcode, cmd)
torch._dynamo.exc.BackendCompilerFailed: backend='compile_fn' raised:
CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmp24x5boe3/main.c', '-O3', '-I/home/spetrovi/miniconda3/envs/spisonet/lib/python3.10/site-packages/triton/common/../third_party/cuda/include', '-I/home/spetrovi/miniconda3/envs/spisonet/include/python3.10', '-I/tmp/tmp24x5boe3', '-shared', '-fPIC', '-lcuda', '-o', '/tmp/tmp24x5boe3/cuda_utils.cpython-310-x86_64-linux-gnu.so', '-L/lib64', '-L/lib', '-L/lib64', '-L/lib']' returned non-zero exit status 1.

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

Ryan Notti

unread,
Apr 18, 2024, 10:59:21 PM4/18/24
to spIsoNet
Hi all

I'm having the same error as Stefan. Also running CentOS7 and same cuda/cudnn/pytorch versions as Stefan. Appreciate any help anyone has!

Ryan Notti
Rockefeller University

YUNTAO LIU

unread,
Apr 19, 2024, 6:05:43 PM4/19/24
to Ryan Notti, spIsoNet
Hi Ryan and Stefan,

Please check this issue on github https://github.com/IsoNet-cryoET/spIsoNet/issues/6

And  it " appear to have solved it by switching from gcc 4.8.5 to gcc 7.X "


For more options, visit https://groups.google.com/d/optout.

Ryan Notti

unread,
Apr 19, 2024, 6:39:42 PM4/19/24
to YUNTAO LIU, spIsoNet

Thanks Yuntao! Seems to have worked for me!

Ryan

 

Stefan Petrovic

unread,
Apr 20, 2024, 1:27:52 PM4/20/24
to spIsoNet
Thanks Yuntao!
Updating gcc and reinstalling by "option 2" fixed the issue.

Bryce Brownfield

unread,
Apr 30, 2024, 11:40:11 AM4/30/24
to spIsoNet
Thanks all! I ended up fixing this error by installing the gcc_linux-64 package within conda: 

conda install gcc_linux-64

Might be useful for users without sudo permissions trying to set this up.
Reply all
Reply to author
Forward
0 new messages