Inefficient new/delete cycle; let the LM decide unknownWord_

50 views
Skip to first unread message

Kenneth Heafield

unread,
Jul 6, 2013, 12:12:11 PM7/6/13
to jane-...@googlegroups.com
Dear Jane,

    I'm writing about this piece of code. 

                Cost returnCost;
            if (word != unknownWord_ ) {
                lm::ngram::Model::State outState;

                lm::WordIndex *ebuffer = new lm::WordIndex[kenLm_->Order()];
                lm::WordIndex *sbuffer = ebuffer;

                // We have to check, whether the context only contains symbols of the internal alphabet.
                // However, this is just a workaround.
                // LanguageModel.cc should provide a clean and correct context.

                while ((contextBegin < contextEnd) && (*contextBegin < internalAlphabet_->size()))
                    *ebuffer++ = *contextBegin++;
               
                returnCost = kenLm_->FullScoreForgotState(sbuffer, ebuffer, word, outState).prob;

                delete[] sbuffer;

                if (returnCost == LogP_Zero)
                    returnCost = -unknownCost_;
            } else
                returnCost = -unknownCost_;

            return -returnCost;

First comment is that new/delete is really slow and you don't want to do that every LM query.  Stack allocating lm::WordIndex[KENLM_MAX_ORDER] would be much faster.  Also I think it can be Order() - 1. 

Second comment is that there appears to be no mechanism to disable the unknownCost_ override.  Note that p(<unk> | foo) = b(foo)p(<unk>) != b(bar)p(<unk>) = p(<unk> | bar).  Because unknown words cause the language model to backoff, the language model can assist in placing unknown words where open-class words are liable to appear.  This is impossible under your scheme because unknown words are forced to have a constant cost. 

Kenneth

Kenneth Heafield

unread,
Jul 6, 2013, 12:14:35 PM7/6/13
to jane-...@googlegroups.com
Er kMaxOrder from lm/max_order.hh because your version of KenLM is out of date. 

Kenneth Heafield

unread,
Jul 6, 2013, 1:46:47 PM7/6/13
to jane-...@googlegroups.com
Furthermore, Jane appears to assume that a phrase table OOV will be a language model OOV.  Here's an example: "Voldemort" was not in the parallel data but it is in the monolingual data. 

If I query the LM directly,
log p("who is Voldemort ?") = -17.386
log p(" who is fjkdsjfkdfjfkl ? ") = -12.5494

where log p(Voldemort | <s> who is) = -7.97847. 

This is what Jane does normally:

7 0 # " wer ist <unknown-word> ? " # " who is Voldemort ? " # " who is Voldemort ? " # janecosts 101.942 phraseFeature0 1.62269 phraseFeature1 4.90987         phraseFeature2 1.73611 phraseFeature3 3.00514 phraseFeature4 -1.99979 wordPenalty 2.60577 passthrough -1 glue 0 LM 9.40753

It completely skips over Voldemort even though it's in the language model.  Apparently all phrase-table OOVs are mapped to unknownWord_, triggering the if statement. 

If I remove the if statement as recommended, then it I get <unk>. 

7 0 # " wer ist <unknown-word> ? " # " who is Voldemort ? " # " who is Voldemort ? " # janecosts 102.579 phraseFeature0 1.62269 phraseFeature1 4.90987         phraseFeature2 1.73611 phraseFeature3 3.00514 phraseFeature4 -1.99979 wordPenalty 2.60577 passthrough -1 glue 0 LM 12.5494

But what I really want is LM 17.386 which is the correct probability in this case. 

Stephan Peitz

unread,
Jul 8, 2013, 4:58:21 AM7/8/13
to jane-...@googlegroups.com
Fixed. Thank you again!

Kenneth Heafield

unread,
Jul 8, 2013, 10:34:00 AM7/8/13
to jane-...@googlegroups.com
Does that cover both issues or just the first one?  I'm blocking on the OOV handling in the sense that it's preventing you from appearing in the same graph as Moses, cdec, and Joshua. 

Stephan Peitz

unread,
Jul 9, 2013, 1:52:28 AM7/9/13
to jane-...@googlegroups.com
Sorry, it covers just the first issue. The second one is more complex due to our separation of decoder and language model.
However, I am working on that! Stay tuned!

Stephan Peitz

unread,
Jul 24, 2013, 5:43:16 AM7/24/13
to jane-...@googlegroups.com
Second issue is also fixed now!

Kenneth Heafield

unread,
Jul 24, 2013, 5:46:52 AM7/24/13
to jane-...@googlegroups.com
Great.  How do I download it?  http://www-i6.informatik.rwth-aachen.de/jane/ still gives me the same tarball when I enter a fake name. 
Reply all
Reply to author
Forward
0 new messages