NaNs in lrn layer during the GPU forward pass

61 views
Skip to first unread message

Alex Binder

unread,
Nov 5, 2015, 7:51:11 AM11/5/15
to Caffe Users
Hi all,

in the end I am talking about

scale[index]  in  template <typename Dtype>
__global__ void LRNComputeOutput(const int nthreads, const Dtype* const in,
    const Dtype* const scale, const Dtype negative_beta, Dtype* const out)


in lrn_layer.cu ... but the same problem also appears when using cpu.



with custom data I do observe after as little as 4 iterations NaNs which occur during the forward pass in the lrn layer with across_channels normalization. It is definitely the forward pass and definitely the lrn layer, I can code in C++ (for CPUs, no ecxperience with GPU programming) to trace it back.

it happens already in iteration 4 with base_lr = 0.01 and momentum = 0.9

I am doing retraining, so switching to within_channel normalization is no option for me. Also, I am in time pressure, this makes learning GPU programming in 3 hours a bit hard.

I am using a caffe master from 3rd october 2015 . I cannot switch to a newer caffe master because my version of caffe computes extra stuff in the backward pass (LRP, Bach et al, plos one 2015) .

My question is: anybody had similar troubles ?
How did you fix that?


I can check: my meanfile gets used in data_transformer.cpp . so this is not the problem.

k_ in lrn_layer.cpp is set to 1 .

I have found the problem: scale[index] becomes negative on the order of  - 10^3 . of course taking a pow with beta = 0.75 yields funny Nans .what is -1000 ^{0.75} :)

My question is: should scale[index] in lrn_layer.cu become negative or not ?

Best, Alex


Youssef

unread,
Jun 30, 2016, 7:41:51 AM6/30/16
to Caffe Users
Hello Alex,

I'm running into a similar problem when running examples/cifar10/cifar10_full_solver.prototxt with net "examples/cifar10/cifar10_full_train_test.prototxt"
Randomly, regardless of how long I've been training, the norm1 layer would produce element(s) with "inf" during the forward pass. This results into weights taking on NaN values. For all iterations that follow, the loss is stays fixed at loss = 87.3365 and accuracy is fixed at 0.1 (1/10 classes).

I'm unable to reproduce it systematically. It occurs at random iterations during training. I only know that an "inf" is produced by the norm1 layer (an LRN layer). Even if I start training from a snapshot directly preceeding the failure, doesn't guarantee another failure and sometimes the training just proceeds normally and possibly fail at a later iteration.

I suspected a division by zero somewhere. The norm1 layer receives its input from an ReLU, so I thought maybe a local input patch of zeros was causing this. I set up a dummy network with only an LRN layer and found that if you feed it an all zero input it produces an all-zero output. So the zero input wasn't the problem.

I'm very concerned about this randomness.

Will check if scale[index] takes on a similar value as you observed.

Best,

Youssef
Reply all
Reply to author
Forward
0 new messages