Hi,
I'm doing binary classification of a big set of documents. Each document is classified by the user as either 0 (not interresting) or 1 (interresting).
I've come up with a lot of different classifers (mostly Nak-based with different config, plus som other stuff) that have been trained on the documents the user has classified thus far.
I'm now looking for a way to combine the classifiers. I want the weights to represent each user's preferences. For example, one of the classifers focus on geographic info, a lot of the other classifers on text content. I don't want a simple average, as that would almost negate the geographic classifier, which could be a really important for some users.
I want classifers to be weighted according to how good they are at predicting the correct class. However, all the classifiers should have some weight.
I've come up with an implementation, but I'm not entirely sure that my line of reasoning is sound.
//1
def weighLabeledPredictions(predictions: DenseMatrix[Double], labels: DenseVector[Double]): DenseVector[Double] = {
//2
val sqErrs: DenseMatrix[Double] = DenseMatrix.horzcat(predictions, labels.asDenseMatrix.t).apply(*, ::).map { row =>
val l: Double = row(-1)
val ps: DenseVector[Double] = row(0 to -2)
ps.map(p => math.pow(l - p, 2))
}
//3
val sumSqErrs = sqErrs(::, *).map(sum(_)).inner
//4
val normSumSqErrs = normalize(sumSqErrs)
//5
val invNormSumSqErrs = normSumSqErrs.map(1d - _)
//6
val normInvNormSumSqErrs = {
val sm = sum(invNormSumSqErrs)
invNormSumSqErrs.map(_ / sm)
}
normInvNormSumSqErrs
}
Explanation:
- The method takes predictions where rows are each of the N documents that have been classified, and cols are the M classifier outputs for document m. Labels are correct labels that users have assigned for each of the document (0 or 1).
- Each prediction by a classifier is mapped to the squared error between the prediction and the actual label
- Then squared error for classifier n is summed for each of the M documents, resulting in a vector of squared error sums for each classifier.
- This sum-of-squared-error vector is normalize to get it to a number between 0 and 1 where 0 is good and 1 is bad. This is something I'm not so sure is the right thing to do..
- Swap the vector so 1 is good and 0 is bad
- Make sure sum of the weights is 1
I'm looking for a "peer review" of this implementation. Any comments are very welcome.
Regards, Eirik