Inefficient new/delete cycle; let the LM decide unknownWord

Kenneth Heafield

unread,

Jul 6, 2013, 12:12:11 PM7/6/13

to jane-...@googlegroups.com

Dear Jane,

    I'm writing about this piece of code.

                Cost returnCost;
            if (word != unknownWord_ ) {
                lm::ngram::Model::State outState;

                lm::WordIndex *ebuffer = new lm::WordIndex[kenLm_->Order()];
                lm::WordIndex *sbuffer = ebuffer;

                // We have to check, whether the context only contains symbols of the internal alphabet.
                // However, this is just a workaround.
                // LanguageModel.cc should provide a clean and correct context.

                while ((contextBegin < contextEnd) && (*contextBegin < internalAlphabet_->size()))
                    *ebuffer++ = *contextBegin++;

                returnCost = kenLm_->FullScoreForgotState(sbuffer, ebuffer, word, outState).prob;

                delete[] sbuffer;

                if (returnCost == LogP_Zero)
                    returnCost = -unknownCost_;
            } else
                returnCost = -unknownCost_;

            return -returnCost;

First comment is that new/delete is really slow and you don't want to do that every LM query. Stack allocating lm::WordIndex[KENLM_MAX_ORDER] would be much faster. Also I think it can be Order() - 1.

Second comment is that there appears to be no mechanism to disable the unknownCost_ override. Note that p(<unk> | foo) = b(foo)p(<unk>) != b(bar)p(<unk>) = p(<unk> | bar). Because unknown words cause the language model to backoff, the language model can assist in placing unknown words where open-class words are liable to appear. This is impossible under your scheme because unknown words are forced to have a constant cost.

Kenneth

Kenneth Heafield

unread,

Jul 6, 2013, 12:14:35 PM7/6/13

to jane-...@googlegroups.com

Er kMaxOrder from lm/max_order.hh because your version of KenLM is out of date.

Kenneth Heafield

unread,

Jul 6, 2013, 1:46:47 PM7/6/13

to jane-...@googlegroups.com

Furthermore, Jane appears to assume that a phrase table OOV will be a language model OOV. Here's an example: "Voldemort" was not in the parallel data but it is in the monolingual data.

If I query the LM directly,
log p("who is Voldemort ?") = -17.386
log p(" who is fjkdsjfkdfjfkl ? ") = -12.5494

where log p(Voldemort | <s> who is) = -7.97847.

This is what Jane does normally:

7 0 # " wer ist <unknown-word> ? " # " who is Voldemort ? " # " who is Voldemort ? " # janecosts 101.942 phraseFeature0 1.62269 phraseFeature1 4.90987 phraseFeature2 1.73611 phraseFeature3 3.00514 phraseFeature4 -1.99979 wordPenalty 2.60577 passthrough -1 glue 0 LM 9.40753

It completely skips over Voldemort even though it's in the language model. Apparently all phrase-table OOVs are mapped to unknownWord_, triggering the if statement.

If I remove the if statement as recommended, then it I get <unk>.

7 0 # " wer ist <unknown-word> ? " # " who is Voldemort ? " # " who is Voldemort ? " # janecosts 102.579 phraseFeature0 1.62269 phraseFeature1 4.90987 phraseFeature2 1.73611 phraseFeature3 3.00514 phraseFeature4 -1.99979 wordPenalty 2.60577 passthrough -1 glue 0 LM 12.5494

But what I really want is LM 17.386 which is the correct probability in this case.

Stephan Peitz

unread,

Jul 8, 2013, 4:58:21 AM7/8/13

to jane-...@googlegroups.com

Fixed. Thank you again!

Kenneth Heafield

unread,

Jul 8, 2013, 10:34:00 AM7/8/13

to jane-...@googlegroups.com

Does that cover both issues or just the first one? I'm blocking on the OOV handling in the sense that it's preventing you from appearing in the same graph as Moses, cdec, and Joshua.

Stephan Peitz

unread,

Jul 9, 2013, 1:52:28 AM7/9/13

to jane-...@googlegroups.com

Sorry, it covers just the first issue. The second one is more complex due to our separation of decoder and language model.

However, I am working on that! Stay tuned!

Stephan Peitz

unread,

Jul 24, 2013, 5:43:16 AM7/24/13

to jane-...@googlegroups.com

Second issue is also fixed now!

Kenneth Heafield

unread,

Jul 24, 2013, 5:46:52 AM7/24/13

to jane-...@googlegroups.com

Great. How do I download it? http://www-i6.informatik.rwth-aachen.de/jane/ still gives me the same tarball when I enter a fake name.

Reply all

Reply to author

Forward

Inefficient new/delete cycle; let the LM decide unknownWord_

Kenneth Heafield

Kenneth Heafield

Kenneth Heafield

Stephan Peitz

Kenneth Heafield

Stephan Peitz

Stephan Peitz

Kenneth Heafield