Is it sound to compare scores coming from different models? For example, let's say I have labelled data (this is a fictitious case - there are better ways to do the following problem, but it's for the sake of example), and my classes are C1 and C2. Let's say I train a model for each class - I take all the text of C1 and train a model on it, and do the same for C2. If LogProb(doc|class) is the output of getLogProb, would it then make sense to say that a new document is a mixture of both classes according to the distribution [LogProb(doc|C1), LogProb(doc|C2)]? I think a core part of the question is just whether the log probe are normalized.