On the topic of how to handle very unbalanced data sets:
With all the answers I see, they all seem to assume that the targets are categories (0 or 1 or ...) but I am wondering how to apply them if the target Y values are one-hot encoded vectors?
Say I have have ten categories that are encoded as
(1,0,0,0,0,0,0,0,0,0,0)
(0,1,0,0,0,0,0,0,0,0,0)
etc etc
and the first one is >99% of the training and test samples. And now assume we want to provide a custom class_weights dictionary that assigns a lower weight to the default 99% case and a relatively higher weight to the other cases. But it seems a dictionary does not accept a vector as a hash key.
So something like
class_weights = {(1,0,0,0,0,0,0,0,0,0): 0.1,
(0,1,0,0,0,0,0,0,0,0): 1.0,
(0,0,1,0,0,0,0,0,0,0): 1.0,
...etc...}
is NOT allowed.
Therefore: Is there any way to set (in keras/Tensorflow) class_weight for cases where the targets are one hot encoded vectors? I found one post (
https://stackoverflow.com/questions/43481490/keras-class-weights-class-weight-for-one-hot-encoding) that suggests using sample _weigths instead (apparently sugesting that the samples with the rare event are weighted higher). Ok, but this too won't work in the case of my data: They are longer sequences in which one out of 100 tokens (or less) is of a rare non-default class, but in almost all samples there is such a rare class. So weighting samples differently does not work. I'd really need to find a way to assign a higher weight to these rareclasses (tokens) in my sequences.
Any suggestions?