Google books ngrams

pz

unread,

Jul 25, 2012, 4:34:58 PM7/25/12

to berkeleyl...@googlegroups.com

Hi. Is a binary with google books ngrams available somewhere? All ngrams since 1800. I started downloading raw data but it's a mammoth download.

Regards, pz

Adam Pauls

unread,

Jul 25, 2012, 5:08:38 PM7/25/12

to berkeleyl...@googlegroups.com

Unfortunately, I haven't compiled a binary for the books corpus (just the Web1T corpus). It would take a little bit of work to compile it together. Out of curiousity, what are you planning on using the corpora for? Note that a binary with Berkeley LM would only really be useful if you want to query the corpus programmatically from Java.

Adam

pz

unread,

Jul 26, 2012, 2:16:27 AM7/26/12

to berkeleyl...@googlegroups.com

For identifying wordnet synsets. Given for example a noun and adjective replace noun with synonyms from all synsets, then search for digrams. The ones with no occurences (= very low probability) are from wrong synsets. The same method for longer ngrams (the longer the more certainty) and other parts of speech. At the end you'll be left with one main synset with all or almost occurences, and almost no occurences from the others. That's an idea of mine that I'm going to test.

Arguably Web 1T is much better for this, but I can't afford it at the moment. Not to mention that because of dvd shipping (24GB is too much to download, really?) customs would take atleast a month, downloading ~900GB of english books ngrams is probably going to be faster.

Adam Pauls

unread,

Jul 26, 2012, 3:20:29 PM7/26/12

to berkeleyl...@googlegroups.com

I didn't realize how big the books corpus was. I won't be able to build a binary from a collection that is 900GB when gzipped (it will probably take something like 200GB in memory, and more than that to build). Sorry about that.

pz

unread,

Jul 26, 2012, 6:48:39 PM7/26/12

to berkeleyl...@googlegroups.com

It's because there's added info for years (occurrences/year) + number of books with word + pages with word. Counted unigrams from 1.91GB of gzipped unigrams are just a 99MB csv. Exactly 7380256 unigrams.

Also 900GB is wrong, my mistake. I saw it somewhere but I think it's the size of unpacked data. The actual size of compressed data is ~325GB.

Adam Pauls

unread,

Aug 1, 2012, 12:21:14 AM8/1/12

to berkeleyl...@googlegroups.com

Sorry, even at 325GB, it would still not fit in memory on any machine I know of, so I'm not sure if it's worth modifying the code to handle that data. Wish I could be of more help!

Joseph Turian

unread,

Aug 19, 2012, 6:16:26 AM8/19/12

to berkeleyl...@googlegroups.com

Actually, it's far less than 325 GB once you sum each ngram over years. This is because each ngram is repeated 50-200 times, one line per year.

I can provide scripts to do so.

I also tend to strip years that are too long ago, because there are weird OCR errors.

It would actually be pretty cool to release the LM over Google Books Ngrams, because they are freely available.

Adam Pauls

unread,

Aug 20, 2012, 2:57:17 PM8/20/12

to berkeleyl...@googlegroups.com

Interesting. One thing you (an pz) could do is make a script that converts the books corpus into the same format used by Web1T, and build the binaries yourself using the tools in BerkeleyLM. In the meanwhile, I am starting to download some of the books corpus and looking into building some binaries on my side.

Roman Prokofyev

unread,

Jul 11, 2014, 5:29:39 AM7/11/14

to berkeleyl...@googlegroups.com

Hello, I see this is an old topic, but I'm also interested in building a language model from Google Books N-gram corpus. So I was wondering how I could do it with BerkleyLM.

I can tell that the total size of n-grams aggregated by year and POS tags excluded is ~46GB, I have it in the following format:

word1 word2 ... \t COUNT

....

What is the format used by Web1T?

Thanks.

Adam Pauls

unread,

Jul 11, 2014, 1:10:25 PM7/11/14

to berkeleyl...@googlegroups.com

There are pre-built binaries at http://tomato.banatao.berkeley.edu:8080/berkeleylm_binaries/. Do those do what you need?

--
You received this message because you are subscribed to the Google Groups "berkeleylm-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to berkeleylm-disc...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Roman Prokofyev

unread,

Jul 14, 2014, 7:14:29 AM7/14/14

to berkeleyl...@googlegroups.com

Hi, thanks for the reply!

Actually I want to compare different language models so I want to build custom binaries myself.

I tried to utilize code make-binary-from-google.sh to build a bigram language model and it seemed to work.

The binary I got is ~1.2Gb. Is it ok?

Now I would like to evaluate the perplexity of this model. It is correct that "ComputeLogProbabilityOfTextStream" computes it?

To unsubscribe from this group and stop receiving emails from it, send an email to berkeleylm-discuss+unsub...@googlegroups.com.

Roman Prokofyev

unread,

Jul 14, 2014, 7:41:31 AM7/14/14

to berkeleyl...@googlegroups.com

Hello again, sorry to bother, but it seems that not everything is ok, I tried to run the compute log probability and experience this following error:

java -ea -mx5g -server -cp ./src/ edu.berkeley.nlp.lm.io.ComputeLogProbabilityOfTextStream google.binary brown.txt

Reading LM Binary google.binary {

} [19s]

Scoring file -; current log probability is 0.0 {

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2

at edu.berkeley.nlp.lm.map.CompressedNgramMap.getValueAndOffset(CompressedNgramMap.java:71)

at edu.berkeley.nlp.lm.StupidBackoffLm.getLogProb(StupidBackoffLm.java:59)

at edu.berkeley.nlp.lm.ArrayEncodedNgramLanguageModel$DefaultImplementations.getLogProb(ArrayEncodedNgramLanguageModel.java:70)

at edu.berkeley.nlp.lm.StupidBackoffLm.getLogProb(StupidBackoffLm.java:129)

at edu.berkeley.nlp.lm.io.ComputeLogProbabilityOfTextStream.computeProb(ComputeLogProbabilityOfTextStream.java:84)

at edu.berkeley.nlp.lm.io.ComputeLogProbabilityOfTextStream.main(ComputeLogProbabilityOfTextStream.java:64)

In brown.txt I have 1 sentence per line, as described.

Adam Pauls

unread,

Jul 14, 2014, 3:01:32 PM7/14/14

to berkeleyl...@googlegroups.com

I can't help without seeing what the Google n-gram dir looked like, and what brown.txt looks like.

To unsubscribe from this group and stop receiving emails from it, send an email to berkeleylm-disc...@googlegroups.com.

Roman Prokofyev

unread,

Jul 14, 2014, 3:30:29 PM7/14/14

to berkeleyl...@googlegroups.com

Hello Adam,

thank for the reply!

My google n-gram dir has the following structure:

berkeleylm/google_data/1gms:

total 105M

-rw-rw-r-- 1 roman roman 105M Jul 11 17:18 vocab_cs.gz

berkeleylm/google_data/2gms:

total 2.7G

-rw-rw-r-- 1 roman roman 2.7G Jul 11 17:16 2gm-0001

both vocab_cs.gz and 2gm-0001 are tab-separated files with n-gram as a first value and count as a second.

brown.txt looks like the following

The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place .\n

... sentences....

I don't think the problem is in the brown.txt since I also tried to run it with a simple sentence from the example file using "echo" command:

echo "This is a sample sentence ." | java -ea -mx5000m -server -cp ./src edu.berkeley.nlp.lm.io.ComputeLogProbabilityOfTextStream google.binary

Reading LM Binary google.binary {

} [14s]

Scoring file -; current log probability is 0.0 {

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2

at edu.berkeley.nlp.lm.map.CompressedNgramMap.getValueAndOffset(CompressedNgramMap.java:71)

at edu.berkeley.nlp.lm.StupidBackoffLm.getLogProb(StupidBackoffLm.java:59)

at edu.berkeley.nlp.lm.ArrayEncodedNgramLanguageModel$DefaultImplementations.getLogProb(ArrayEncodedNgramLanguageModel.java:70)

at edu.berkeley.nlp.lm.StupidBackoffLm.getLogProb(StupidBackoffLm.java:129)

at edu.berkeley.nlp.lm.io.ComputeLogProbabilityOfTextStream.computeProb(ComputeLogProbabilityOfTextStream.java:84)

at edu.berkeley.nlp.lm.io.ComputeLogProbabilityOfTextStream.main(ComputeLogProbabilityOfTextStream.java:64)

To unsubscribe from this group and stop receiving emails from it, send an email to berkeleylm-discuss+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Roman Prokofyev

unread,

Jul 14, 2014, 4:47:04 PM7/14/14

to berkeleyl...@googlegroups.com

Ok, I think I got it,

the library seems to be hard-coded to build 3gram models, meaning that I need to create "3gms" dir with some data.

Then it doesn't throw the exception.

Reply all

Reply to author

Forward