Custom Loss function Keras combining Cross entropy loss and mae loss

adit bhardwaj

unread,

Jul 30, 2018, 4:31:54 PM7/30/18

to keras...@googlegroups.com

Hi,

I am trying to perform a multi-class classification. Ideally I would use a cross entropy loss to train my neural network. However, my classes are Ordinal variables. Hence, I would what my loss function to enforce some sort of order in the prediction. For example y_true = 2, then I would prefer y_predict = 3 rather than y_predict = 4. For this, I am thinking of using a custom loss function with a combination of Cross entropy loss and mean_absolute_loss after a softmax layer:

import from keras import backend as K

from keras import losses

loss_weight = [1,0.0001]

loss_weight_tensor = K.variable(value=loss_weight)

def custom_loss(y_true,y_pred):

l1 = K.sparse_categorical_crossentropy(y_true,y_pred)

y_pred_argmax = K.cast( K.argmax(y_pred,axis=1),dtype=K.tf.float32) # y_pred_argmax get the class from softmax output

l2 = losses.mean_absolute_error(y_pred_argmax, y_true)

return l1*loss_weight_tensor[0] + l2*loss_weight_tensor[1]

Is there a fallacy in my thinking or construction of this loss function. Does it look like it is a valid loss function (piecewise-differentiable, etc ) given i am using argmax? And do you think tensorflow backend will calculate a valid gradient ? Or are there any better alternative to achieve an ordinal classification?

Thanks,

Adit

Sergey O.

unread,

Jul 30, 2018, 5:47:10 PM7/30/18

to adit bhardwaj, Keras-users

Hi Adit,

argmax is not differentiable. you'd want to use softmax. mean_absolute_error could be problematic (no gradient at zero), better to use mean_squared_error

Assuming y_pred is categorical with softmax activation, something like this should work:

let's say you have:

y_true = 3

y_pred = (0,0,0,0,1,0)

(0,0,0,0,1,0) * (0,1,2,3,4,5) = (0,0,0,0,4,0)

sum((0,0,0,0,4,0)) = 4

(y_true - 4)^2 = 1

Implementation:

def cat_loss(y_true,y_pred):

y_range = K.arange(0,K.shape(y_pred)[-1],dtype='float32')

y_p = K.sum(y_pred * y_range,-1,keepdims=True)

loss = K.sum(K.square(y_true-y_p),axis=1)

return loss

-Sergey

--
You received this message because you are subscribed to the Google Groups "Keras-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to keras-users+unsubscribe@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/keras-users/CAA4bTr8iJqwEN4BAfLKxogukg_OaYWnKOBn3UFbgwB2%3D6GdKcw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

adit bhardwaj

unread,

Aug 2, 2018, 2:38:13 PM8/2/18

to Sergey O., Keras-users

Hi Sergey,

Thanks for you suggestion. However, in you explanation where you assumption of y_pred as the softmax is correct by later when you use as

y_pred = (0,0,0,0,1,0) makes that assumption wrong since (0,0,0,0,1,0) is a output of a hard-max not a softmax, which infact can only be obtained using non-differentiable argmax.

In your code implementation, the loss will not yield what we expect (higher penalty for more higher difference in the predication classes.) as

y_pred = (0.05,0.1,0.3,0.1,0.4,0.05) == softmax prediction

y_pred * (0,1,2,3,4,5) != 4

I did find something on how I can achieve different loss when the class values are very different. By the scaling of cross-entropy loss differently for different pair of class prediction.

For eg

true_y = i

pred_y^ = j

x-entropy loss = C_ij * -Sum(* y^_d * log(y_d) )

where, y^_d is the dth component of one hot vec of true_y.

y_d is the dth component of the output of softmax.

C_ij is scaling factor for the loss when true_y = j but model predicts pred_y = j

Thanks,

Adit

Sergey O.

unread,

Aug 2, 2018, 6:18:11 PM8/2/18

to adit bhardwaj, Keras-users

if you minimize the loss,

the hope would be that (0.17,0.17,0.17,0.17,0.17,0.17) would become (0,0,0,0,1,0). With initial random variables it, of course, will not be!

adit bhardwaj

unread,

Aug 2, 2018, 7:21:40 PM8/2/18

to Sergey O., Keras-users

Hi Sergey,

Thanks for the prompt reply. I do get your intuition for the trick. However I still see one challenge.
The solution to the problem

fx : y_pred *(0,1,2,3,4,5) with constraint hx : sum(y_pred) = 1

does not have a unique solution. For example y_true = 3
one trivial solution is definitely y_pred = (0,0,0,1,0,0) , which is what we hope to get, from lets say any state (0.17,0.17,0.17,0.17,0.17,0.17) of softmax.

But there are several other solutions for y_pred:
1. (0,0,.5,0,.5,0)*(0,1,2,3,4,5) = 2*.5 + 4*.5 = 3

2. (0,.5,0,0,0,.5)*(0,1,2,3,4,5) = 1*.5 + 5*.5 = 3

So this can get stuck in any of these solution. Unlike cross entropy or any other standard loss function, which are strictly non-zero for any y_pred except for y_pred == y_true.

Regards,

Adit

Sergey O.

unread,

Aug 2, 2018, 8:06:22 PM8/2/18

to adit bhardwaj, Keras-users

Good point! I wonder if you can add (1-sum(x^2)) activation regularization to the softmax layer to favor a sparse output.

To unsubscribe from this group and stop receiving emails from it, send an email to keras-users...@googlegroups.com.

Sergey O.

unread,

Aug 2, 2018, 9:58:16 PM8/2/18

to adit bhardwaj, Keras-users

This might be silly but...

why even bother predicting categories if they are in order?

why not just predict a continuous scalar value (Dense(1)) with "mean_squared_error" as the loss function and then just take np.round(model.predict(input)) to select a category?

You can even use a sigmoid activation to force the scalar value to be between 0 and num_cat:

model.add(Dense(1))

model.add(Lambda(lambda x: K.sigmoid * (num_cat - 1)))

model.compile("adam","mean_squared_error")

To unsubscribe from this group and stop receiving emails from it, send an email to keras-users+unsubscribe@googlegroups.com.

adit bhardwaj

unread,

Aug 6, 2018, 6:28:00 PM8/6/18

to Sergey O., Keras-users

The solution with cross-entropy weighted by true_y, pred_y pair works to some extent, in the sense I can penalize some pair of (true_y, pred_y) more than others and hence sort of enforce the ordering. In practice, I did saw improvement in what i was look, but not to the extent I want.

In the past, I did try np.round(model.predict(input)) with simple mean_squared_error regression but the outputs can be out of range and sometimes terribly wrong.

I didnt think about using a sigmoid output. I think it can be smart, since it will enforce the output in a range. Going to try it next.

Another solution I found is : https://www.cs.waikato.ac.nz/~eibe/pubs/ordinal_tech_report.pdf and it looks promising.