Loss Function Query Relating to Futile Training

Ross Andrew Donnachie

unread,

May 29, 2020, 3:18:13 PM5/29/20

to knet-users

Good day,

I am looking to create a detection-localisation CNN. I am starting with a pretty brute architecture so suffice it to say that it is merely conv4, pool and FC layers. In order to have it learn localisation, I sought to implement a loss function based on intersection with the groundtruth mask (based on the common IoU metric). It is defined as IntersectionLoss as follows

```

IntersectionLoss(x, g) = sum(abs.(g.-x))

(c::Chain)(x,y) = IntersectionLoss(c(x),y)

cnn(first(dtst)[1], first(dtst)[2])

```

The output is scalar obviously eg. 538.0f0

The trainingresults function is straight out of the excellent MNIST-CNN example. Training results in absolutely no change in the model's mean loss, across various learning rates.

I started thinking that the loss function has to actually define for each cell in the output how wrong it is. But when I changed the loss function to reflect that idea, the following error occured:

```

AssertionError: Only scalar valued functions supported.

Stacktrace:
 [1] differentiate(::Function; o::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at C:\Users\Ross\.julia\packages\AutoGrad\6QsMu\src\core.jl:153

```

I am looking for a little pointer on how to get further with this idea please!

Kind Regards,

Ross

Iulian-Vasile Cioarca

unread,

May 30, 2020, 3:45:21 AM5/30/20

to knet-users

It's hard to point you the right direction without seeing the implementation of the network and loss function. Do you use a cell grid approach?
You can have a look at how yolo loss was implemented:
https://github.com/Ybakman/YoloV2-Trainable/blob/master/YoloLoss.jl

Ross Andrew Donnachie

unread,

May 30, 2020, 4:11:41 AM5/30/20

to knet-users

Upon further investigation it seems that the ```@diff cnn(first(dtrn)[1], first[dtst][2])``` yields all zeros,,, that is to say that the gradient of the model with the loss function is zero. Obviously that leads to no changes in a training cycle.

@Iulian-Vasile I understand! I didn't want to do an unsolicited code dump, especially because I find this platform to be pretty bad at presenting code (or I haven't figured it out yet). I will look into your shared link, thanks! If I find a solution I will share, otherwise I will share more code in the hopes that someone can take the time to help.

Until then!

Ross Andrew Donnachie

unread,

May 30, 2020, 4:26:06 AM5/30/20

to knet-users

To add more information, without code:

The CNN has the following inter-layer data sizes

6-element Array{Any,1}: 
(128, 160, 1, 1) 
(40, 56, 20, 1) 
(18, 26, 50, 1) 
(14, 22, 60, 1) 
(12, 20, 1, 1) 
(1280, 1)

The result is a flat 2D array, with each value representing a 4x4 pixel-window in an equivalent mask (1280 = 32*40 = 128/4 * 160/4). The groundtruth binary mask (I only have one detection class) is collapsed to this representative form with a

reshape(pool(mask, window=stride=(4,4)), (1280,1))

Iulian-Vasile Cioarca

unread,

May 30, 2020, 4:26:33 AM5/30/20

to knet-users

I had a similar issue with zero gradients once. It turned out I had forgotten to track one of the layers.
Another time was when I tried to train the yolo network in the link from scratch. The yolo pad function always preallocated some zeros and that broke the chain of training with zero gradients.
Another thing to keep in mind during training is to constantly check for exploding gradients which lead to weights being updated to NaN. Since I don't know if Knet has a built in mechanism for that I usually have a function check all the weights during the training loop every few epochs.
Checking for Inf is also useful since it might inficate an error in your loss function(divide by zero) or (as in the case of yolo) exponentiating large values resulting from the forward pass when estimating the bounding box dimensions.

Ross Andrew Donnachie

unread,

May 30, 2020, 5:10:47 AM5/30/20

to knet-users

I think a similar pitfall has happened to me.

The gradient is all-zero after training, before training it is non-trivial. I believe this loss function has found the local minimum that outputs an all zero mask, always.

For future references, the gradient is calculated as per an initial Knet tutorial:



using Knet: Knet, AutoGrad, dir, Data, Param, @diff, value, params, grad, progress, progress!


# Compute gradients on loss function:
J = @diff cnn(first(dtrn)[1],first(dtrn)[2])
# J is a struct, to get the actual loss value from J:
@show value(J)
# params(J) returns an iterator of Params J depends on (i.e. model.b, model.w):
@show params(J) |> collect


# To get the gradient of a parameter from J:
∇w = [grad(J,l.w) for l in cnn.layers] 
# Note that each gradient has the same size and shape as the corresponding parameter:
∇b = [grad(J,l.b) for l in cnn.layers]
[max(Array(w)...) for w in ∇w]