Sample code for Google N-gram with stupid backoff?

491 views
Skip to first unread message

Joseph Turian

unread,
Aug 19, 2012, 6:22:17 AM8/19/12
to berkeleyl...@googlegroups.com
Could you provide code that does that following?

Load stupid backoff model with English Google N-gram binary + vocab
For each line in stdin:
   Run stupid backoff on the string and output the probability

This would be very useful. I'm trying to figure out how to do this, but having difficulty.

I think a lot of people just want to get probabilities like this, and it lowers the friction for using the library a lot.

Thanks!

    Joseph

Adam Pauls

unread,
Aug 20, 2012, 2:55:56 PM8/20/12
to berkeleyl...@googlegroups.com
I've attached a file that loads either a google binary or regular Berkeley LM binary (which already has the vocab) and computes the log-prob of a set of files (or STDIN if none is given, in standard UNIX style). I'll release a new version of the code with this file available shortly. 

By the way, it's hard for me to tell why this is hard to figure out how to do (because I know the code so well). Can you tell me what documentation you read and what you blocked on, so I can figure out the weak points in the documentation?

Thanks,

Adam
ComputePerplexity.java

Joseph Turian

unread,
Sep 6, 2012, 5:23:26 AM9/6/12
to berkeleyl...@googlegroups.com


On Monday, 20 August 2012 14:55:56 UTC-4, Adam Pauls wrote:
I've attached a file that loads either a google binary or regular Berkeley LM binary (which already has the vocab) and computes the log-prob of a set of files (or STDIN if none is given, in standard UNIX style). I'll release a new version of the code with this file available shortly. 

By the way, it's hard for me to tell why this is hard to figure out how to do (because I know the code so well). Can you tell me what documentation you read and what you blocked on, so I can figure out the weak points in the documentation?

Adam,

Thanks for this.

I believe I was confused about some of the parameters.

When I first dive into code, I highly prefer to work from example code. So my main barrier to hacking was the lack of examples.

Joseph Turian

unread,
Sep 6, 2012, 5:30:56 AM9/6/12
to berkeleyl...@googlegroups.com
Another barrier is that I don't work with the Java stack that often (I use C++ and Python more frequently), so there's friction for me simply in building and invoking Java code. But naturally this is to be expected.

Harta Wijaya

unread,
Sep 8, 2012, 9:29:58 AM9/8/12
to berkeleyl...@googlegroups.com
Hi Adam,

Thanks for the example, however I can get the probability right.
What I did was:
1. I convert SRILM ARPA format to berkeley binarized format.
I followed the instruction from Joshua decoder:
java -cp $JOSHUA/lib/berkeleylm.jar -server -mxMEM edu.berkeley.nlp.lm.io.MakeLmBinaryFromArpa lm.arpa lm.berkeleylm

2. Next, I tried to run your ComputePerplexity.java to get the probability of the text from output file.
However, I always get -100.0 for the probability.

Thanks a lot for your help.

Adam Pauls

unread,
Sep 8, 2012, 7:22:20 PM9/8/12
to berkeleyl...@googlegroups.com
Could you send me the file you and command that you are running in step 2 so I can replicate? Thanks,

Adam

Harta Wijaya

unread,
Sep 8, 2012, 9:33:40 PM9/8/12
to berkeleyl...@googlegroups.com
Hi Adam,

Thanks for your reply. You can download those files and replicate the steps at http://www.comp.nus.edu.sg/~harta01/

Thank you.

Harta Wijaya

Adam Pauls

unread,
Sep 9, 2012, 2:38:51 AM9/9/12
to berkeleyl...@googlegroups.com
Actually, it's probably just because I made some changes to the code in the meanwhile and ComputePerplexity needs them to work properly. I will release a new version shortly -- please rerun then and let me know it works. 

Harta Wijaya

unread,
Sep 9, 2012, 3:27:19 AM9/9/12
to berkeleyl...@googlegroups.com
Ok. Thanks

Adam Pauls

unread,
Sep 14, 2012, 8:30:12 PM9/14/12
to berkeleyl...@googlegroups.com
A new version has been uploaded to the Googlecode site which should fix this bug. 

Harta Wijaya

unread,
Sep 17, 2012, 10:01:01 AM9/17/12
to berkeleyl...@googlegroups.com
HI Adam,

Sorry for the late reply.
Yea, it works now.

Thanks a lot.

Oren Melamud

unread,
Sep 17, 2014, 9:42:40 AM9/17/14
to berkeleyl...@googlegroups.com
Hi Adam,

I'm using your web 1T binary with stupid back-off. Very helpful. Thanks!
I looked at the sample code that you provided here and it seems to me like it's not padding the sentences with start/end sentence symbols, which should normally be added.
Am I right?

Thanks,
Oren.

Oren Melamud

unread,
Sep 17, 2014, 10:27:16 AM9/17/14
to berkeleyl...@googlegroups.com
Actually, maybe I'm missing something, but it seems like the use of lm.getLogProb(words) for every text line returns the conditional probability of the last word in the line given the previous words.
I guess using scoreSentence() would be the appropriate choice there, right?

Adam Pauls

unread,
Sep 19, 2014, 4:55:50 AM9/19/14
to berkeleyl...@googlegroups.com
Yes, this is fixed at head in SVN. 

--
You received this message because you are subscribed to the Google Groups "berkeleylm-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to berkeleylm-disc...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages