float16 issues

324 views
Skip to first unread message

Fabian Isensee

unread,
Jun 21, 2017, 3:19:36 AM6/21/17
to lasagne-users
Hey there,
I recently wanted to switch to float16 because I have serious problems with GPU memory (working on 3d medical image data). However I run into some issues that I don't know how to solve. I created a short standalone python script to demonstrate that (uses iris dataset, simple fully connected neural network, see attached). If I run it with THEANO_FLAGS=floatX=float32 it returns the following output:

 
epoch 0: train loss: 1.375314 train acc 0.265000 test loss 1.171191 test acc 0.585000
epoch 1: train loss: 1.223649 train acc 0.375000 test loss 1.055289 test acc 0.595000
epoch 2: train loss: 1.116125 train acc 0.465000 test loss 0.956632 test acc 0.555000
epoch 3: train loss: 0.997714 train acc 0.530000 test loss 0.870735 test acc 0.605000
epoch 4: train loss: 1.016216 train acc 0.535000 test loss 0.786455 test acc 0.700000
epoch 5: train loss: 0.858704 train acc 0.670000 test loss 0.708278 test acc 0.765000
epoch 6: train loss: 0.910774 train acc 0.600000 test loss 0.690680 test acc 0.780000
epoch 7: train loss: 0.806082 train acc 0.620000 test loss 0.621082 test acc 0.840000
epoch 8: train loss: 0.744155 train acc 0.700000 test loss 0.598704 test acc 0.825000
epoch 9: train loss: 0.676767 train acc 0.720000 test loss 0.556982 test acc 0.885000
epoch 10: train loss: 0.674348 train acc 0.735000 test loss 0.532380 test acc 0.880000
epoch 11: train loss: 0.650707 train acc 0.730000 test loss 0.528952 test acc 0.885000
epoch 12: train loss: 0.563217 train acc 0.775000 test loss 0.461938 test acc 0.905000
epoch 13: train loss: 0.596845 train acc 0.770000 test loss 0.441529 test acc 0.875000
epoch 14: train loss: 0.572269 train acc 0.765000 test loss 0.475156 test acc 0.845000
epoch 15: train loss: 0.588880 train acc 0.755000 test loss 0.444637 test acc 0.865000
epoch 16: train loss: 0.586087 train acc 0.760000 test loss 0.403933 test acc 0.880000
epoch 17: train loss: 0.496034 train acc 0.800000 test loss 0.414725 test acc 0.915000
epoch 18: train loss: 0.472452 train acc 0.820000 test loss 0.402316 test acc 0.910000
epoch 19: train loss: 0.528015 train acc 0.790000 test loss 0.428631 test acc 0.835000
epoch 20: train loss: 0.474193 train acc 0.825000 test loss 0.388237 test acc 0.880000
epoch 21: train loss: 0.517822 train acc 0.765000 test loss 0.345394 test acc 0.905000
epoch 22: train loss: 0.438922 train acc 0.830000 test loss 0.339430 test acc 0.900000
epoch 23: train loss: 0.467753 train acc 0.800000 test loss 0.332779 test acc 0.910000
epoch 24: train loss: 0.422628 train acc 0.800000 test loss 0.319525 test acc 0.880000
epoch 25: train loss: 0.382614 train acc 0.875000 test loss 0.311163 test acc 0.865000
epoch 26: train loss: 0.403976 train acc 0.820000 test loss 0.374219 test acc 0.825000
epoch 27: train loss: 0.369395 train acc 0.860000 test loss 0.303532 test acc 0.905000
epoch 28: train loss: 0.401725 train acc 0.835000 test loss 0.310684 test acc 0.870000
epoch 29: train loss: 0.342000 train acc 0.895000 test loss 0.303024 test acc 0.890000

So it's working as expected. When running with float16, however, I do get a lot of error messages like these:

Disabling C code for Elemwise{mul,no_inplace} due to unsupported float16
Disabling C code for Elemwise{mul,no_inplace} due to unsupported float16
Disabling C code for Elemwise{Cast{float32}} due to unsupported float16
ERROR (theano.gof.opt): Optimization failure due to: local_gpu_elemwise_careduce
ERROR (theano.gof.opt): node: GpuCAReduceCuda{add}{0}(GpuElemwise{sqr,no_inplace}.0)
ERROR (theano.gof.opt): TRACEBACK:
ERROR (theano.gof.opt): Traceback (most recent call last):
  File "/home/fabian/deeplearning_venv/local/lib/python2.7/site-packages/theano/gof/opt.py", line 2036, in process_node
    remove=remove)
  File "/home/fabian/deeplearning_venv/local/lib/python2.7/site-packages/theano/gof/toolbox.py", line 569, in replace_all_validate_remove
    chk = fgraph.replace_all_validate(replacements, reason)
  File "/home/fabian/deeplearning_venv/local/lib/python2.7/site-packages/theano/gof/toolbox.py", line 518, in replace_all_validate
    fgraph.replace(r, new_r, reason=reason, verbose=False)
  File "/home/fabian/deeplearning_venv/local/lib/python2.7/site-packages/theano/gof/fg.py", line 486, in replace
    ". The type of the replacement must be the same.", old, new)
BadOptimization: BadOptimization Error 
  Variable: id 140574082265808 GpuCAReduceCuda{pre=sqr,red=add}{0}.0
  Op GpuCAReduceCuda{pre=sqr,red=add}{0}(GpuElemwise{sub,no_inplace}.0)
  Value Type: <type 'NoneType'>
  Old Value:  None
  New Value:  None
  Reason:  local_gpu_elemwise_careduce. The type of the replacement must be the same.
  Old Graph:
  GpuCAReduceCuda{add}{0} [id A] <GpuArrayType<None>(float32, vector)> ''   
   |GpuElemwise{sqr,no_inplace} [id B] <GpuArrayType<None>(float16, matrix)> ''   
     |GpuElemwise{sub,no_inplace} [id C] <GpuArrayType<None>(float16, matrix)> ''   
       |GpuElemwise{add,no_inplace} [id D] <GpuArrayType<None>(float16, matrix)> ''   
       | |GpuDot22 [id E] <GpuArrayType<None>(float16, matrix)> ''   
       | | |GpuElemwise{add,no_inplace} [id F] <GpuArrayType<None>(float16, matrix)> ''   
       | | |W [id G] <GpuArrayType<None>(float16, matrix)>
       | |InplaceGpuDimShuffle{x,0} [id H] <GpuArrayType<None>(float16, row)> ''   
       |   |b [id I] <GpuArrayType<None>(float16, vector)>
       |GpuElemwise{Cast{float16}}[]<gpuarray> [id J] <GpuArrayType<None>(float16, row)> ''   
         |GpuElemwise{true_div,no_inplace} [id K] <GpuArrayType<None>(float32, row)> ''   
           |InplaceGpuDimShuffle{x,0} [id L] <GpuArrayType<None>(float32, row)> ''   
           |GpuFromHost<None> [id M] <GpuArrayType<None>(float32, (True, True))> ''   
  New Graph:
  GpuCAReduceCuda{pre=sqr,red=add}{0} [id N] <GpuArrayType<None>(float16, vector)> ''   
   |GpuElemwise{sub,no_inplace} [id C] <GpuArrayType<None>(float16, matrix)> ''   


Hint: relax the tolerance by setting tensor.cmp_sloppy=1
  or even tensor.cmp_sloppy=2 for less-strict comparison

 
I am not worried by Disabling C code for Elemwise{mul,no_inplace} due to unsupported float16 since that will only fall back to (slower) python (numpy?) implementations. However, the optimization errors bug me and more importantly, the network does not train properly anymore:

epoch 0: train loss: nan train acc 0.390000 test loss nan test acc 0.175000
epoch 1: train loss: nan train acc 0.375000 test loss nan test acc 0.215000
epoch 2: train loss: nan train acc 0.360000 test loss nan test acc 0.140000
epoch 3: train loss: nan train acc 0.390000 test loss nan test acc 0.240000
epoch 4: train loss: nan train acc 0.370000 test loss nan test acc 0.215000
epoch 5: train loss: nan train acc 0.390000 test loss nan test acc 0.190000
epoch 6: train loss: nan train acc 0.345000 test loss nan test acc 0.200000
epoch 7: train loss: nan train acc 0.345000 test loss nan test acc 0.230000
epoch 8: train loss: nan train acc 0.430000 test loss nan test acc 0.210000
epoch 9: train loss: nan train acc 0.375000 test loss nan test acc 0.190000
epoch 10: train loss: nan train acc 0.390000 test loss nan test acc 0.190000
epoch 11: train loss: nan train acc 0.340000 test loss nan test acc 0.160000
epoch 12: train loss: nan train acc 0.410000 test loss nan test acc 0.265000
epoch 13: train loss: nan train acc 0.360000 test loss nan test acc 0.225000
epoch 14: train loss: nan train acc 0.445000 test loss nan test acc 0.165000
epoch 15: train loss: nan train acc 0.350000 test loss nan test acc 0.245000
epoch 16: train loss: nan train acc 0.345000 test loss nan test acc 0.225000
epoch 17: train loss: nan train acc 0.410000 test loss nan test acc 0.185000
epoch 18: train loss: nan train acc 0.420000 test loss nan test acc 0.195000
epoch 19: train loss: nan train acc 0.335000 test loss nan test acc 0.185000
epoch 20: train loss: nan train acc 0.360000 test loss nan test acc 0.185000
epoch 21: train loss: nan train acc 0.335000 test loss nan test acc 0.265000
epoch 22: train loss: nan train acc 0.360000 test loss nan test acc 0.220000
epoch 23: train loss: nan train acc 0.345000 test loss nan test acc 0.175000
epoch 24: train loss: nan train acc 0.350000 test loss nan test acc 0.255000
epoch 25: train loss: nan train acc 0.355000 test loss nan test acc 0.235000
epoch 26: train loss: nan train acc 0.350000 test loss nan test acc 0.185000
epoch 27: train loss: nan train acc 0.370000 test loss nan test acc 0.165000
epoch 28: train loss: nan train acc 0.365000 test loss nan test acc 0.255000
epoch 29: train loss: nan train acc 0.420000 test loss nan test acc 0.160000

I attached the complete output of the float16 run to this post as well. Any help would be very much appreciated!

Cheers,

Fabian

PS: I also tried running with tensor.cmp_sloppy=2 (which was hinted by one of the error messages) but that did not work out. I set the required variable in .theanorc under [tensor] and verified the value in ipython (In [4]: theano.config.tensor.cmp_sloppy Out[4]: 2) but the exact error stating that I should use cmp_sloppy=1 or 2 reappeared. Strange.
float16_support.py
float_16_output.txt

Frédéric Bastien

unread,
Jun 21, 2017, 9:05:52 AM6/21/17
to lasagne-users

We fixed problem with float16 since the last release. Update to the dev version of Theano and update libgpuarray array to 0.6.6


--
You received this message because you are subscribed to the Google Groups "lasagne-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lasagne-user...@googlegroups.com.
To post to this group, send email to lasagn...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lasagne-users/97f7ca33-d5d9-4289-a201-12698b48f53f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Fabian Isensee

unread,
Jun 21, 2017, 9:24:50 AM6/21/17
to lasagne-users
Hi Frédéric,
I just updated everything. Still the same problem. I have attached a file where you can see the output when I run my script in ipython. At the end I check the theano and lasagne version as well as the value used for floatX and tensor.cmd_sloppy.
Did you try running my script on your machine? Does it give the same results?
Cheers,
Fabian
float_16_output.txt

Fabian Isensee

unread,
Jun 28, 2017, 6:50:11 AM6/28/17
to lasagne-users
Hi Frédéric,
did you in the meantime have the opportunity to run my script on your machine? It would be great if we could solve this issue! Let me know if there is anything I can do.
Cheers,
Fabian

Frédéric Bastien

unread,
Jun 29, 2017, 9:37:48 AM6/29/17
to lasagn...@googlegroups.com
Hi,

the optimization error can be. ignored. Theano is just skipping them. It is more a warning that Theano need some fixes then an an error. But as Theano just skip that problematic optimization, your code should be fine.

I'm able to run your script, so I'll be able to find the optimization error.

So the real problem is that it don't train anymore. This could be caused by a bug in Theano or just that float16 storage don't work with your model.

I would guess the problem is the second case. NVIDIA found that just moving to float16 cause too much gradients to be truncated to 0 in many models. They could a simple trick, scale the cost to 256 to 2048. Then adjust the learning rate in consequence. This will make the gradients computation not be truncated to zeros.

Can you try that?

--

Jan Schlüter

unread,
Jun 29, 2017, 10:52:57 AM6/29/17
to lasagne-users
They could a simple trick, scale the cost to 256 to 2048. Then adjust the learning rate in consequence.

Or if you use ADAM, leave the learning rate as it is -- ADAM is invariant to the scale of the cost (and gradients).

Nice trick!

Fabian Isensee

unread,
Jun 29, 2017, 11:09:03 AM6/29/17
to lasagne-users
HI,
thank you very much for your replies! Unfortunately, I am already using adam
updates = lasagne.updates.adam(loss_tr, lasagne.layers.get_all_params(output_layer, trainable=True), 0.0005)
which is why I am a bit confused about the network still not training properly. I will try to scale the cost and learning rate and get back to you as soon as I figures something out.
Cheers,
Fabian
Message has been deleted

Fabian Isensee

unread,
Jun 29, 2017, 11:25:19 AM6/29/17
to lasagne-users
Hi,
I reuploaded my script. It now uses nesterov momentum instead of adam and I multiply the loss and divide the learning rate by 1024. The network trains with float 32, but still does not train with float16 (still nan's in the loss).
Since theano, lasagne and libgpuarray are up to date, maybe my driver or cuda version is the problem?

driver: Driver Version: 367.44

cuda: nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61

Your help is very much appreciated!!
Cheers,
Fabian

float16_support.py

Frédéric Bastien

unread,
Jun 29, 2017, 12:02:56 PM6/29/17
to lasagne-users
Hi,

I have a PR that fix the opt warning, but it won't fix your non training problem: https://github.com/Theano/Theano/pull/6088

I think the problem is the model itself. Not the GPU code, driver of cuda version (if you use an up to date Theano and libgpuarray version).

Try NanGuardMode. It will help you find where the model generate Nan and that can help you fix the problem. Some codes use eps that get rounded to 0 in float16 for example.

Fred

--
You received this message because you are subscribed to the Google Groups "lasagne-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lasagne-user...@googlegroups.com.
To post to this group, send email to lasagn...@googlegroups.com.

Fabian Isensee

unread,
Jun 29, 2017, 12:49:31 PM6/29/17
to lasagne-users
Hi everyone,
thank you very much for your replies. Thanks to NanGuardMode and selectively disabling components of the network I boiled the problem down to the GaussianNoiseLayer. From there I looked into the implementation and the error message (see below) which actually indicates that the error happens in self._srng.normal() (of MRG_RandomStreams). I tracked it down to the following line (1067 in rng_mrg.py) there the following happens: sqrt_ln_U1 = sqrt(-2.0 * log(U1)). It seems like U1 (which may be the product of the values along a specific axis) gets rounded to zero. Maybe we need an epsilon here?

AssertionError: Inf detected
Big value detected
NanGuardMode found an error in the output of a node in this variable:
GpuElemwise{Composite{sqrt((i0 * log(i1)))}}[]<gpuarray> [id A] ''   
 |GpuArrayConstant{[-2.]} [id B]
 |GpuSubtensor{:int64:} [id C] ''   
   |GPUA_mrg_uniform{GpuArrayType<None>(float16, vector),inplace}.1 [id D] ''   
   | |<GpuArrayType<None>(int32, matrix)> [id E]
   | |MakeVector{dtype='int64'} [id F] ''   
   |   |Elemwise{Composite{(i0 + (i0 % i1))}}[(0, 0)] [id G] ''   
   |     |Prod{axis=None, dtype='int64', acc_dtype='int64'} [id H] ''   
   |     | |MakeVector{dtype='int64'} [id I] ''   
   |     |   |Shape_i{0} [id J] ''   
   |     |   | |<TensorType(float16, matrix)> [id K]
   |     |   |Shape_i{1} [id L] ''   
   |     |     |W [id M]
   |     |TensorConstant{2} [id N]
   |ScalarFromTensor [id O] ''   
     |Elemwise{IntDiv}[(0, 0)] [id P] ''   
       |Prod{axis=None, dtype='int64', acc_dtype='int64'} [id Q] ''   
       | |MakeVector{dtype='int64'} [id R] ''   
       |   |Shape_i{0} [id S] ''   
       |     |HostFromGpu(gpuarray) [id T] ''   
       |       |GPUA_mrg_uniform{GpuArrayType<None>(float16, vector),inplace}.1 [id D] ''   
       |TensorConstant{2} [id N]


Cheers,

Fabian

Jan Schlüter

unread,
Jun 29, 2017, 3:18:38 PM6/29/17
to lasagne-users
thank you very much for your replies! Unfortunately, I am already using adam
updates = lasagne.updates.adam(loss_tr, lasagne.layers.get_all_params(output_layer, trainable=True), 0.0005)
which is why I am a bit confused about the network still not training properly. I will try to scale the cost and learning rate

Just to clarify (although it turned out not to be the issue): With ADAM, you'd only scale the cost, and leave the learning rate as it is. This would still avoid gradients becoming too small to be expressed in float16 precision.

Good detective work with the NanGuardMode!

Fabian Isensee

unread,
Jun 30, 2017, 7:59:42 AM6/30/17
to lasagne-users
Hi Jam
thanks for getting back to me. Yeah I may have misunderstood what you said - my bad!
Do you think we should add an epsilon to that specific line? It should not influence fp32 networks but may help others run fp16 in the future.
Cheers,
Fabian

Jan Schlüter

unread,
Jun 30, 2017, 9:26:39 AM6/30/17
to lasagne-users
thanks for getting back to me. Yeah I may have misunderstood what you said - my bad!
Do you think we should add an epsilon to that specific line? It should not influence fp32 networks but may help others run fp16 in the future.

I'm not familiar with that code, I can't tell whether that's the best solution. In any case, "it should not influence fp32 networks" is not enough, any change would affect backwards compatibility (i.e., experiments cannot be exactly reproduced) -- if at all, this would need to be limited to float16. You can try in your local Theano installation whether this fixes the problem for you, or have a look at the surrounding code and try to understand what it's trying to accomplish in the first place.

Best, Jan

Fabian Isensee

unread,
Jun 30, 2017, 10:05:53 AM6/30/17
to lasagne-users
Hi Jan,
you are right, that would probably not have been a very good idea without testing the consequences thoroughly.
After being able to run the simple fully connected network, I got overly optimistic about running my actual stuff in float16, with the result that neither segmentation nor classification would work (getting nan's again). Therefore I (again) prepared a little standalone script to demonstrate my problem. This will run cifar10 on a very simple network. On float32 it will train (slowly) but on float16 NanGuardMore instantly raises an assertion during the first backpropagation. Unfortunately I was unable to make sense of the error message and everything I tried to get it to run failed so far. Do you have any idea what the problem could be?
I appreciate your help very much!
Best,
Fabian
float16_cifar.py
float16_cifar_output.txt
architecture.png

Jan Schlüter

unread,
Jun 30, 2017, 11:39:14 AM6/30/17
to lasagne-users
After being able to run the simple fully connected network, I got overly optimistic about running my actual stuff in float16, with the result that neither segmentation nor classification would work (getting nan's again). Therefore I (again) prepared a little standalone script to demonstrate my problem. This will run cifar10 on a very simple network.

Just by looking at the graph -- did you try the same without batch normalization? (Using an initialization and hyperparameters that are known to work reasonably well on CIFAR10, you may need to check some papers and experiment with float32 first.)

Best, Jan

Frédéric Bastien

unread,
Jul 1, 2017, 12:13:04 PM7/1/17
to lasagne-users
I think it isn't easy then just add an epsilon. This will break the interface. The real fix is more complicated. I see 2 ways to fix the real cause:

1) modify our uniform sampler to never return 0. this is what is done in the basic implementation on wikipedia, but we don,t do it in Theano: https://en.wikipedia.org/wiki/Box%E2%80%93Muller_transform#Implementation
2) Find another algo to convert the uniform to a normal.

In the mean time, if you don't care too much about the quality of the random number, you could just use an epsilon in your code.

Fred

--
You received this message because you are subscribed to the Google Groups "lasagne-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lasagne-user...@googlegroups.com.
To post to this group, send email to lasagn...@googlegroups.com.

Frédéric Bastien

unread,
Jul 1, 2017, 12:20:28 PM7/1/17
to lasagne-users
I got it wrong. Our MRG uniform sampler should exclude the lower bound (0). So 0 should not have been generated.

Currently, from what I understand in the code, we just cast to float16 the generated values. I think we also do that in float32! We end-up casting to 0 too frequently in float16. This is why we probably didn't found this before.

Fred

Frédéric Bastien

unread,
Jul 1, 2017, 12:26:13 PM7/1/17
to lasagne-users
Reply all
Reply to author
Forward
0 new messages