Train neural network does not have right output

133 views
Skip to first unread message

dacalf2...@gmail.com

unread,
Oct 17, 2023, 7:17:16 PM10/17/23
to EMAN2
Dear EMAN2 developler,

We are using eman 2.99.47

here is output of e2version.py
EMAN 2.99.47 ( GITHUB: 2023-03-04 19:33 - commit: NOT-INSTALLED-FROM-GIT-REPO )
Your EMAN2 is running on: Linux-5.15.0-76-generic-x86_64-with-glibc2.37 5.15.0-76-generic
Your Python version is: 3.9.16

When we do Train the neural network , we always get map with complete white density. See the attached microtubule picture.

Please give some suggestions to deal with such an issue.

My computer was a recently installed Linux, with Cuda 12.2. I tried both GPU and CPU modes, both gave us a blank output during the step of Training the neural network.

Thanks,
Victor



trainout_weired.png

Ludtke, Steven J.

unread,
Oct 17, 2023, 7:30:17 PM10/17/23
to em...@googlegroups.com
I don't have any immediate guesses (we'll see if Muyuan replies). The next thing to do would be to extract say, the first 18 images from the output you showed a screenshot of and post them (or send to me) to take a look at the actual values in the file. 

e2proc2d.py myoutput.hdf for_steve.hdf --last 18

and send/post for_steve.hdf. It should be small enough for email.

---
Steven Ludtke, Ph.D. <slu...@bcm.edu>                      Baylor College of Medicine
Charles C. Bell Jr., Professor of Structural Biology        Dept. of Biochemistry 
Deputy Director, Advanced Technology Cores                  and Molecular Pharmacology
Academic Director, CryoEM Core
Co-Director CIBR Center


--
--
----------------------------------------------------------------------------------------------
You received this message because you are subscribed to the Google
Groups "EMAN2" group.
To post to this group, send email to em...@googlegroups.com
To unsubscribe from this group, send email to eman2+un...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/eman2

---
You received this message because you are subscribed to the Google Groups "EMAN2" group.
To unsubscribe from this group and stop receiving emails from it, send an email to eman2+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/eman2/5cde8582-cc18-4c61-9fe7-c5a88aa6f43dn%40googlegroups.com.
<trainout_weired.png>

dacalf2...@gmail.com

unread,
Oct 17, 2023, 8:07:09 PM10/17/23
to EMAN2
Dear Steven,
Thanks for the quick response. See the attached .hdf files.
for_steve.hdf

Ludtke, Steven J.

unread,
Oct 17, 2023, 10:12:37 PM10/17/23
to em...@googlegroups.com
Looks like you were trying to train on un-normalized clips. Did you go through the "Import or Preprocess" step in the tutorial?

---
Steven Ludtke, Ph.D. <slu...@bcm.edu>                      Baylor College of Medicine
Charles C. Bell Jr., Professor of Structural Biology        Dept. of Biochemistry 
Deputy Director, Advanced Technology Cores                  and Molecular Pharmacology
Academic Director, CryoEM Core
Co-Director CIBR Center

dacalf2...@gmail.com

unread,
Oct 18, 2023, 10:02:55 AM10/18/23
to EMAN2
Hi Steven,
Thank you for the reply
yes, I did preprocess,
I has a .mrc file generated by the IMOD newstack function, then used Preprocess tomograms inside EMAN2 /Segmentation. I checked the folder of Tomogram generated by EMAN2, it has   __preproc.hdf output. 
Victor


dacalf2...@gmail.com

unread,
Oct 18, 2023, 11:48:48 AM10/18/23
to EMAN2
Hi Steven,
I import the tomogram and did preprocess again, 
here is the output of for_steve_1.hdf

e2proc2d.py trainout_nnet_save__mt0.01.hdf for_steve_1.hdf --last 24

Can you get some other hinds from the .hdf? why the trainout has a blank white although EMAN can segment the microtubules? See attached for_steve_1.hdf

Thanks,
Victor
for_steve_1.hdf

dacalf2...@gmail.com

unread,
Oct 20, 2023, 12:37:26 PM10/20/23
to EMAN2
Hi Steve,
I tried to use both GPU and CPU nodes, They all give the blank trainout.
Here is some information before starting to train neural network.

This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
loading particles...
1600 particles loaded, 1600 in training set, 0 in validation set
(1600, 64, 64)
Std of particles:  0.8805271
Setting up model...
Training...



What does the cost mean? because after iteration 1, nothing is changed.  
I attempted resinstall EMAN2.99 and no error message, however, the train out always has the same problem. What else do you recommend to check the functionality of EMAN2.99?
I do appreciate any help/advice you can provide. 
Victor.

Muyuan Chen

unread,
Oct 20, 2023, 1:03:07 PM10/20/23
to em...@googlegroups.com
I would guess there is some issue with the tensorflow/CUDA installation, but it is quite hard to imagine what exactly went wrong. The input seems ok but the neural network is throwing out ridiculously large numbers. Maybe it is some odd input parameter setting, or some installation issue. Can you provide the full input/output from the command line? You can find the input command in the “command” tab in project manager. 

Also maybe run the “test_tensorflow.py” in the examples folder and show us the results? I saw Steve changed that lately but I assume it still works? 

Muyuan

Steve Ludtke

unread,
Oct 20, 2023, 1:14:45 PM10/20/23
to em...@googlegroups.com
Sorry, got behind on a bunch of deadlines. Yes, test_tensorflow should still work fine. The newer version adds some additional benchmarking tests at the end.

The full output of e2version.py would also be useful to see. Clearly something isn’t working right. I agree with Muyuan that something with tensorflow version or installation is the most likely culprit...


dacalf2...@gmail.com

unread,
Oct 20, 2023, 2:13:51 PM10/20/23
to EMAN2
Hi Muyuan and Steve,
Thanks for some suggestions,
Here is the output to test_tensorflow.py

python ./test_tensorflow.py
2023-10-20 13:58:09.285796: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA

To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Testing basic operations...
2023-10-20 13:58:10.127359: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-20 13:58:10.127517: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-20 13:58:10.140237: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-20 13:58:10.140426: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-20 13:58:10.140562: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-20 13:58:10.140687: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-20 13:58:10.140961: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA

To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-20 13:58:10.264079: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-20 13:58:10.264236: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-20 13:58:10.264365: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-20 13:58:10.264479: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-20 13:58:10.264592: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-20 13:58:10.264704: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-20 13:58:10.269799: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-20 13:58:10.269945: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-20 13:58:10.270077: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-20 13:58:10.270203: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-20 13:58:10.270327: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-20 13:58:10.270440: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 22276 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 4090, pci bus id: 0000:41:00.0, compute capability: 8.9
2023-10-20 13:58:10.270711: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-20 13:58:10.270818: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 21385 MB memory:  -> device: 1, name: NVIDIA GeForce RTX 4090, pci bus id: 0000:61:00.0, compute capability: 8.9
1 + 1 = 2
Testing matrix multiplication...
2023-10-20 13:58:10.414032: I tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:630] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
tf.Tensor(
[[ 2. 22. 16.]
 [ 6. 23. 17.]
 [ 0. 12.  4.]], shape=(3, 3), dtype=float32)
Testing convolution...
2023-10-20 13:58:10.425522: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:428] Loaded cuDNN version 8800
tf.Tensor(
[[18. 28. 27. 14.]
 [17. 39. 27.  5.]
 [25. 31. 23.  5.]
 [12. 18.  9.  3.]], shape=(4, 4), dtype=float32)
Testing training set...
tf.Tensor([ 0.04707954 -0.12480973  0.15579893], shape=(3,), dtype=float32)
tf.Tensor([ 0.14167915 -0.05361223  0.06201855], shape=(3,), dtype=float32)
tf.Tensor([ 0.1726551  -0.11010747 -0.55970323], shape=(3,), dtype=float32)
tf.Tensor([-0.12425488  0.173307   -0.06009084], shape=(3,), dtype=float32)
4
Testing training...
2023-10-20 13:58:10.623839: I tensorflow/compiler/xla/service/service.cc:173] XLA service 0x55d0318e3e80 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-10-20 13:58:10.623861: I tensorflow/compiler/xla/service/service.cc:181]   StreamExecutor device (0): NVIDIA GeForce RTX 4090, Compute Capability 8.9
2023-10-20 13:58:10.623866: I tensorflow/compiler/xla/service/service.cc:181]   StreamExecutor device (1): NVIDIA GeForce RTX 4090, Compute Capability 8.9
2023-10-20 13:58:10.626268: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-10-20 13:58:10.690681: I tensorflow/compiler/jit/xla_compilation_cache.cc:477] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
  iter 0, loss 26.09, mean grad 18.62
  iter 5, loss 11.54, mean grad 11.93
  iter 10, loss 3.57, mean grad 5.71
  iter 15, loss 1.02, mean grad 2.29
  iter 20, loss 1.20, mean grad 2.62
  iter 25, loss 1.47, mean grad 3.74
  iter 30, loss 1.03, mean grad 3.16
  iter 35, loss 0.41, mean grad 1.68
  iter 40, loss 0.11, mean grad 0.72
  iter 45, loss 0.11, mean grad 0.94
  iter 50, loss 0.12, mean grad 1.27
Truth:
[[3. 3.]
 [2. 4.]]
Estimate:
[[2.969382  2.8553815]
 [2.0285828 3.9924746]]


For the current eman2 I used, I installed it yesterday with the standard binary installation (before I installed eman2 by the miniconda). 
EMAN 2.99.47 ( GITHUB: 2023-03-04 13:31 - commit: 3f313008c3185410fe859663e763dffb9c0b6fcc )

Your EMAN2 is running on: Linux-5.15.0-76-generic-x86_64-with-glibc2.37 5.15.0-76-generic
Your Python version is: 3.9.16

For cuda, I installed 11.8 because I found eman2 might need cuda11.8 during the installation of the binary package. Please correct me if I used the wrong cuda. 
Nividia-smi output is 
 Oct 20 14:10:22 2023      
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |

### 11.8 cuda is still compatible with driver, because I checked motioncorr2 and aretomo are fucntional when I used cuda11.8.
Again, because CPU mode also generates a blank training output, I guess the issue is not related to CUDA version.

Regarding the command for training, (the tomo has been preprocessed by eman2)

e2tomoseg_convnet.py --trainset=particles/L29_002_tomo3d_xyz_trimed_preproc__mt_trainset.hdf --nettag=mt --learnrate=0.01 --niter=10 --ncopy=1 --batch=20 --nkernel=40,40,1 --ksize=15,15,15 --poolsz=2,1,1 --trainout --training --device=cpu

e2tomoseg_convnet.py --trainset=particles/L29_002_tomo3d_xyz_trimed_preproc__mt_trainset.hdf --nettag=mt --learnrate=0.01 --niter=10 --ncopy=1 --batch=20 --nkernel=40,40,1 --ksize=15,15,15 --poolsz=2,1,1 --trainout --training --device=gpu

### here just show  2 cases of GPU and CPU modes, both outputs have blanks (attached in previous email, named as for_steve_1.hdf).


Much appreciated,

Victor,

 

Reply all
Reply to author
Forward
0 new messages