Is anyone using the Google n-grams?
(http://www.ldc.upenn.edu/Catalog/docs/LDC2009T08/README.utf8.english)
I have been trialling it as an alternative to live web searches to
find the usage frequencies of various words. It seems a viable
alternative in most cases. I have to get around the fact that many of
the "words" I am considering are not in the IPADIC lexicon and hence
have been split up by MeCab, which means they appear in the Google
file as bi-grams or tri-grams. My workaround has been to pass them
through MeCab and use the results to construct a search pattern
and select which Google file to search.
Is anyone using particular tools to investigate the n-gram files?
Their size (e.g. the 2gm files total 1.6Gb) makes them a bit
indigestible. I have been using simple greps, and I'm curious
if anyone is using more sophisticated tools. I don't think I want
to load them into a database if I can avoid it.
Cheers
Jim
--
Jim Breen
Adjunct Senior Research Fellow
Clayton School of Information Technology,
Monash University, VIC 3800, Australia
http://www.csse.monash.edu.au/~jwb/
Tobias Hawker (Sydney) had a nice tool for efficiently querying the
English n-grams:
Using Contexts of One Trillion Words for WSD
<http://mandrake.csse.unimelb.edu.au/pacling2007/files/final/36/36_Paper_meta.pdf>
They basically make a big list of what they want to look up and then
look them all up at the same time. This can be done quite
efficiently.
--
Francis Bond <http://www2.nict.go.jp/x/x161/en/member/bond/>
NICT Language Infrastructure Group
>> Is anyone using particular tools to investigate the n-gram files?
> Tobias Hawker (Sydney) had a nice tool for efficiently querying the
> English n-grams:
>
> Using Contexts of One Trillion Words for WSD
> <http://mandrake.csse.unimelb.edu.au/pacling2007/files/final/36/36_Paper_meta.pdf>
>
> They basically make a big list of what they want to look up and then
> look them all up at the same time. This can be done quite
> efficiently.
Thanks for pointing that one out. I was at that PacLing, but didn't go to that
paper.
I had been thinking of one-pass approaches. I see Tobias used hash tables,
which might be a bit messy for me. I'm looking to search for about 240k
words (including inflected forms and orthographic variants), but since I
can sort them and compare with the already-sorted n-gram files, I can probably
do a glorified single-pass merge.
A couple of MySQL solutions (possibly with better support for Japanese) are:
http://mysqlftppc.wiki.sourceforge.net/
http://qwik.jp/tritonn/download.html
I don't have direct experience with either, but they might be worth looking
at.
Tim