Google n-grams

Jim Breen

unread,

Jun 2, 2009, 8:02:06 PM6/2/09

to nlp-ja...@googlegroups.com

Greetings,

Is anyone using the Google n-grams?
(http://www.ldc.upenn.edu/Catalog/docs/LDC2009T08/README.utf8.english)

I have been trialling it as an alternative to live web searches to
find the usage frequencies of various words. It seems a viable
alternative in most cases. I have to get around the fact that many of
the "words" I am considering are not in the IPADIC lexicon and hence
have been split up by MeCab, which means they appear in the Google
file as bi-grams or tri-grams. My workaround has been to pass them
through MeCab and use the results to construct a search pattern
and select which Google file to search.

Is anyone using particular tools to investigate the n-gram files?
Their size (e.g. the 2gm files total 1.6Gb) makes them a bit
indigestible. I have been using simple greps, and I'm curious
if anyone is using more sophisticated tools. I don't think I want
to load them into a database if I can avoid it.

Cheers

Jim

--
Jim Breen
Adjunct Senior Research Fellow
Clayton School of Information Technology,
Monash University, VIC 3800, Australia
http://www.csse.monash.edu.au/~jwb/

Francis Bond

unread,

Jun 2, 2009, 9:01:21 PM6/2/09

to nlp-ja...@googlegroups.com

2009/6/3 Jim Breen <jimb...@gmail.com>:

>
> Greetings,
>
> Is anyone using the Google n-grams?
> (http://www.ldc.upenn.edu/Catalog/docs/LDC2009T08/README.utf8.english)
>
> I have been trialling it as an alternative to live web searches to
> find the usage frequencies of various words. It seems a viable
> alternative in most cases. I have to get around the fact that many of
> the "words" I am considering are not in the IPADIC lexicon and hence
> have been split up by MeCab, which means they appear in the Google
> file as bi-grams or tri-grams. My workaround has been to pass them
> through MeCab and use the results to construct a search pattern
> and select which Google file to search.
>
> Is anyone using particular tools to investigate the n-gram files?
> Their size (e.g. the 2gm files total 1.6Gb) makes them a bit
> indigestible. I have been using simple greps, and I'm curious
> if anyone is using more sophisticated tools. I don't think I want
> to load them into a database if I can avoid it.

Tobias Hawker (Sydney) had a nice tool for efficiently querying the
English n-grams:

Using Contexts of One Trillion Words for WSD
<http://mandrake.csse.unimelb.edu.au/pacling2007/files/final/36/36_Paper_meta.pdf>

They basically make a big list of what they want to look up and then
look them all up at the same time. This can be done quite
efficiently.

--
Francis Bond <http://www2.nict.go.jp/x/x161/en/member/bond/>
NICT Language Infrastructure Group

Jim Breen

unread,

Jun 2, 2009, 11:24:59 PM6/2/09

to nlp-ja...@googlegroups.com

2009/6/3 Francis Bond <fcb...@gmail.com>:
> 2009/6/3 Jim Breen <jimb...@gmail.com>:

>> Is anyone using particular tools to investigate the n-gram files?

> Tobias Hawker (Sydney) had a nice tool for efficiently querying the

> English n-grams:
>
> Using Contexts of One Trillion Words for WSD
> <http://mandrake.csse.unimelb.edu.au/pacling2007/files/final/36/36_Paper_meta.pdf>
>
> They basically make a big list of what they want to look up and then
> look them all up at the same time. This can be done quite
> efficiently.

Thanks for pointing that one out. I was at that PacLing, but didn't go to that
paper.

I had been thinking of one-pass approaches. I see Tobias used hash tables,
which might be a bit messy for me. I'm looking to search for about 240k
words (including inflected forms and orthographic variants), but since I
can sort them and compare with the already-sorted n-gram files, I can probably
do a glorified single-pass merge.

Timothy Baldwin

unread,

Jun 3, 2009, 1:13:15 PM6/3/09

to nlp-ja...@googlegroups.com, jimb...@gmail.com

> > Tobias Hawker (Sydney) had a nice tool for efficiently querying the
> > English n-grams:
> >
> > Using Contexts of One Trillion Words for WSD
> > <http://mandrake.csse.unimelb.edu.au/pacling2007/files/final/36/36_Paper_meta.pdf>
> >
> > They basically make a big list of what they want to look up and then
> > look them all up at the same time. This can be done quite
> > efficiently.
>
> Thanks for pointing that one out. I was at that PacLing, but didn't go to that
> paper.