Re: Exceptions trying to create model from Google Books N-gram

39 views
Skip to first unread message

Adam Pauls

unread,
Jul 15, 2014, 9:50:18 PM7/15/14
to berkeleyl...@googlegroups.com
It looks like that line has a " in it, meaning it has 4 words:
"
.24
23
40

So it's not a 3-gram?


On Tue, Jul 15, 2014 at 5:07 AM, Roman Prokofyev <roman.p...@gmail.com> wrote:
Hello,

Unfortunately, I'm getting errors again with 3grams data:

                Reading ngrams of order 3 {
                        Reading ngrams from file ./google_data2/3gms/3gm-0001 {
                                Line 0
Exception in thread "main" java.lang.RuntimeException: Could not parse line 140 '" .24 23       40' from file ./google_data2/3gms/3gm-0001

        at edu.berkeley.nlp.lm.io.GoogleLmReader.parse(GoogleLmReader.java:91)
        at edu.berkeley.nlp.lm.io.GoogleLmReader.parse(GoogleLmReader.java:21)
        at edu.berkeley.nlp.lm.io.LmReaders.buildMapCommon(LmReaders.java:484)
        at edu.berkeley.nlp.lm.io.LmReaders.secondPassGoogle(LmReaders.java:418)
        at edu.berkeley.nlp.lm.io.LmReaders.readLmFromGoogleNgramDir(LmReaders.java:229)
        at edu.berkeley.nlp.lm.io.LmReaders.readLmFromGoogleNgramDir(LmReaders.java:204)
        at edu.berkeley.nlp.lm.io.MakeLmBinaryFromGoogle.main(MakeLmBinaryFromGoogle.java:36)
Caused by: java.lang.RuntimeException: Failed to add line " .24 23
        at edu.berkeley.nlp.lm.io.NgramMapAddingCallback.call(NgramMapAddingCallback.java:51)
        at edu.berkeley.nlp.lm.io.GoogleLmReader.parseLine(GoogleLmReader.java:131)
        at edu.berkeley.nlp.lm.io.GoogleLmReader.parse(GoogleLmReader.java:89)
        ... 6 more 


It seems it cannot parse the line while I don't see anything special about the line itself and no idea why it cannot parse it.

--
You received this message because you are subscribed to the Google Groups "berkeleylm-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to berkeleylm-disc...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Roman Prokofyev

unread,
Jul 16, 2014, 3:10:38 AM7/16/14
to berkeleyl...@googlegroups.com
No, 40 was actually a count separated by tab,
by I figured out this, there were again out of vocabulary words in my training data.
Thanks!
To unsubscribe from this group and stop receiving emails from it, send an email to berkeleylm-discuss+unsub...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages