Kenneth Heafield
unread,Jul 6, 2013, 12:12:11 PM7/6/13Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Sign in to report message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to jane-...@googlegroups.com
Dear Jane,
I'm writing about this piece of code.
Cost returnCost;
if (word != unknownWord_ ) {
lm::ngram::Model::State outState;
lm::WordIndex *ebuffer = new lm::WordIndex[kenLm_->Order()];
lm::WordIndex *sbuffer = ebuffer;
// We have to check, whether the context only contains symbols of the internal alphabet.
// However, this is just a workaround.
// LanguageModel.cc should provide a clean and correct context.
while ((contextBegin < contextEnd) && (*contextBegin < internalAlphabet_->size()))
*ebuffer++ = *contextBegin++;
returnCost = kenLm_->FullScoreForgotState(sbuffer, ebuffer, word, outState).prob;
delete[] sbuffer;
if (returnCost == LogP_Zero)
returnCost = -unknownCost_;
} else
returnCost = -unknownCost_;
return -returnCost;
First comment is that new/delete is really slow and you don't want to do that every LM query. Stack allocating lm::WordIndex[KENLM_MAX_ORDER] would be much faster. Also I think it can be Order() - 1.
Second comment is that there appears to be no mechanism to disable the unknownCost_ override. Note that p(<unk> | foo) = b(foo)p(<unk>) != b(bar)p(<unk>) = p(<unk> | bar). Because unknown words cause the language model to backoff, the language model can assist in placing unknown words where open-class words are liable to appear. This is impossible under your scheme because unknown words are forced to have a constant cost.
Kenneth