Find most frequent terms

14 views
Skip to first unread message

Valentin Tablan

unread,
Feb 22, 2013, 9:29:44 AM2/22/13
to mg...@googlegroups.com
Hi,

Is there a way to (quickly) find the global count for a given term? I
would rather not have to iterate the whole postings list to accumulate
the counts.

My use case is that I need to find the top X most frequently occurring
terms, so I would need to get the counts for the whole dictionary, hence
my reluctance to iterate the whole index. I don't just need the term
frequency, but the actual sum of its counts.

There seem to be some files in an MG4J index that are invitingly called
???.globcounts. Do they contain what I think they do, and how can one
open them?

thanks,
Valentin

Sebastiano Vigna

unread,
Feb 22, 2013, 9:32:41 AM2/22/13
to mg...@googlegroups.com

On 22 Feb 2013, at 3:29 PM, Valentin Tablan <v.ta...@gmail.com> wrote:

> Is there a way to (quickly) find the global count for a given term? I
> would rather not have to iterate the whole postings list to accumulate
> the counts.

OK. This can mean

0) The frequency (number of documents in which the term appears)
1) The occurrency (the number of occurrencies of the term in the whole collection).

If you want 0), of course IndexIterator has a .frequency() method. If you want 1), it's in .globcounts (but that's the old name--they're called .occurrencies by now), gamma-coded. You can use SemiExternalGammaList to access the data in a relatively speedy manner. I should add a cache to that stuff...

Ciao,

seba

Reply all
Reply to author
Forward
0 new messages