Problem with theano in gpu

186 views
Skip to first unread message

Ruben Dario Fonnegra Tarazona

unread,
Nov 22, 2017, 12:32:10 PM11/22/17
to theano-dev
Hi.



I'm having ploblems executing code in theano. I installed the dev version and it runs perfectly in CPU. However, when I try to run anything in the gpu (even LeNet with MNIST) the model doesn't even run and it appears an only message saying "Segmentation fault. Core dumped". Despite of that, I tried to verify theano installation using the command THEANO_FLAGS='';python -c "import theano; theano.test()" and it did not work (I attach the output in log file). I tried several things but I couldn't solve the problem. I attach a file with the output after executing the command, and my theanorc file. I work with a Kubuntu 16.04, CUDA 8 and cuDNN 6.2 in the Quadro M4000 series (8GB - the problem is not memory allocation). I hope you could help me solve the problem. Thanks in advice, and I will be very attentive for your answer.



What I've tried?
- Install stable, bleeding edge and dev theano version, using their corresponding libgpuarray library (same issue) 
- Manually compile OpenBLAS for installation
- Use environment variable CUDA_LAUNCH_BLOCKING set to 1
- Use several floatX types (float16, float32, float64)
- Use device=cpu and device=cuda0 flags

Note: It might be the CUDA drivers; however, I can use TensorFlow running on GPU without any problem.




output of theano.test() ---------------------------------------------------------------------

HP-Z840-Workstation:~/Data$ THEANO_FLAGS=''; python -c "import theano; theano.test()"

Theano version 1.0.0

theano is installed in /home/bluegum1/Theano/theano
NumPy version 1.13.3

NumPy relaxed strides checking option: True
NumPy is installed in /home/bluegum1/.local/lib/python2.7/site-packages/numpy

Python version 2.7.12 (default, Nov 19 2016, 06:48:10) [GCC 5.4.0 20160609]

nose version 1.3.7

Using cuDNN version 6021 on context None

Mapped name None to device cuda: Quadro M4000 (0000:04:00.0)

..........................................ERROR (theano.gof.opt): Optimization failure due to: insert_bad_dtype
ERROR (theano.gof.opt): node: Elemwise{add,no_inplace}(<TensorType(float64, vector)>, <TensorType(float64, vector)>)

ERROR (theano.gof.opt): TRACEBACK:
ERROR (theano.gof.opt): Traceback (most recent call last):
  File "/home/bluegum1/Theano/theano/gof/opt.py", line 2059, in process_node
    remove=remove)
  File "/home/bluegum1/Theano/theano/gof/toolbox.py", line 569, in replace_all_validate_remove
    chk = fgraph.replace_all_validate(replacements, reason)
  File "/home/bluegum1/Theano/theano/gof/toolbox.py", line 518, in replace_all_validate
    fgraph.replace(r, new_r, reason=reason, verbose=False)
  File "/home/bluegum1/Theano/theano/gof/fg.py", line 486, in replace
    ". The type of the replacement must be the same.", old, new)

BadOptimization: BadOptimization Error 
  Variable: id 140331434464144 Elemwise{Cast{float32}}.0
  Op Elemwise{Cast{float32}}(Elemwise{add,no_inplace}.0)
  Value Type: <type 'NoneType'>
  Old Value:  None
  New Value:  None
  Reason:  insert_bad_dtype. The type of the replacement must be the same.
  Old Graph:
  Elemwise{add,no_inplace} [id A] <TensorType(float64, vector)> ''   
   |<TensorType(float64, vector)> [id B] <TensorType(float64, vector)>
   |<TensorType(float64, vector)> [id C] <TensorType(float64, vector)>

 
New Graph:
  Elemwise{Cast{float32}} [id D] <TensorType(float32, vector)> ''   
   |Elemwise{add,no_inplace} [id A] <TensorType(float64, vector)> ''   



Hint: relax the tolerance by setting tensor.cmp_sloppy=1
  or even tensor.cmp_sloppy=2 for less-strict comparison


......................................S............................./home/bluegum1/Theano/theano/compile/nanguardmode.py:150: RuntimeWarning: All-NaN slice encountered
  return np.isinf(np.nanmax(arr)) or np.isinf(np.nanmin(arr))
/home/bluegum1/Theano/theano/compile/nanguardmode.py:150: RuntimeWarning: All-NaN axis encountered
  return np.isinf(np.nanmax(arr)) or np.isinf(np.nanmin(arr))

............................................../home/bluegum1/Theano/theano/gof/vm.py:886: UserWarning: CVM does not support memory profile, using Stack VM.
  'CVM does not support memory profile, using Stack VM.')
............/home/bluegum1/Theano/theano/compile/profiling.py:283: UserWarning: You are running the Theano profiler with CUDA enabled. Theano GPU ops execution is asynchronous by default. So by default, the profile is useless. You must set the environment variable CUDA_LAUNCH_BLOCKING to 1 to tell the CUDA driver to synchronize the execution to get a meaningful profile.
  warnings.warn(msg)

....................0.0581646137166

0.0581646137166

0.0581646137166

0.0581646137166

.................................................................................................................................................................../home/bluegum1/Theano/theano/gof/vm.py:889: UserWarning: LoopGC does not support partial evaluation, using Stack VM.
  'LoopGC does not support partial evaluation, '
..........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................Violación de segmento (`core' generado)

---------------------------------------------------------------------

theanorc file ---------------------------------------------------------------------



[global]
floatX = float32
#device = cuda0
#device = cpu
#optimizer=fast_run
#optimizer=fast_compile #Desabilita la GPU
#optimizer=None

[cuda]
root = /usr/local/cuda-8.0

[nvcc]
fastmath = True

[lib]
cnmem = 1.0
.theanorc
log.txt

Frédéric Bastien

unread,
Nov 22, 2017, 1:37:16 PM11/22/17
to thean...@googlegroups.com
fastmath=True isn't used anymore in the new back-end. It don't give significant speed up and generates more NAN.

The error in the log can be ignored. It is catched by the test. It is a test that make sure we raise this error. But we should update the test to don't print this.

For the segfault, can you run the test like this to know which test cause the segfault?

python -c "import theano; theano.test(verbose=2)"

Can you try to run in gdb and give us the backtrace? On the command line execute:

gdb --args python -c "import theano; theano.test(verbose=2)

It should give you a new prompt when the segfault happen. execute the command line "bt" and give me this output.



--

---
You received this message because you are subscribed to the Google Groups "theano-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to theano-dev+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ruben Dario Fonnegra Tarazona

unread,
Nov 22, 2017, 5:32:43 PM11/22/17
to theano-dev

Hi Frederick. 


Thank you in advice for your quick response. So this is what I just did. I commented the nvcc section in my theanorc file, including the fastmath line, and then executed both commands you suggested. The output to the python -c "import theano; theano.test(verbose=2)" command is in file 1st_verbose.txt and the output to the gdb --args python -c "import theano; theano.test(verbose=2)" command is in 2nd_bt.txt. I will be attentive to your response. 
1st_verbose.txt
2nd_bt.txt

Frédéric Bastien

unread,
Nov 23, 2017, 10:39:35 AM11/23/17
to thean...@googlegroups.com
For the second, I missed a step. Can you execute the command again. Then on the prompt execute "r", then it will segfault and drop in the debogger at the segfault point and give you a prompt again. There execute "bt".

The problem seem related to cudnn. You could disable it completly with this flag if you want to start to use it now:

dnn.enabled=False

Ruben Dario Fonnegra Tarazona

unread,
Nov 23, 2017, 11:19:39 AM11/23/17
to theano-dev
Hi Frederic.

This is the output for the r - bt command. Additionally, I have to say that I add a dnn section with the enable=False flag in the theanorc file; tried to run a model and it worked. However, I still want to know what might be causing the problem and how can I get the Theano to work with the cuDNN. Thanks in advice. 
2nd_r_bt.txt

Frédéric Bastien

unread,
Nov 24, 2017, 5:02:11 PM11/24/17
to thean...@googlegroups.com
Can you uninstall cudnn and install cudnn version 7? The segfault is inside cudnn and there was some fixes since v6. This could fix your problem.

--

Frédéric Bastien

unread,
Nov 24, 2017, 5:03:01 PM11/24/17
to thean...@googlegroups.com
If that don't fix the problem, can you give us a way to reproduce it? Are you able to reproduce that on different computer?

Ruben Dario Fonnegra Tarazona

unread,
Nov 26, 2017, 1:57:12 PM11/26/17
to theano-dev
Hi Frederic. I just installed Cudnn 7 and everything worked perfectly. Thank you very much for your help and support. (Y) 
Reply all
Reply to author
Forward
0 new messages