Mapping terms across partitions

Skip to first unread message

Andrea Cardaci

Mar 10, 2017, 2:40:05 PM3/10/17
to MG4J

I'm using some of the MG4J tools as the first steps in my processing pipeline, which is more or less like this:

1. start from the actual documents;
2. mg4j.tool.IndexBuilder;
3. mg4j.tool.PartitionDocumentally (or PartitionLexically);
4. a couple of other steps using ad hoc tools that are used to transform the resulting index (should not matter in this context).

The big picture is that I'm building a distributed IR framework for testing purposes, so I want to be able to query the partitions using term ids from the monolithic index. But after the third step what I obtain are completely independent indexes, in which the relationship with the original one is lost.

I think this problem is mentioned in [1] but it's not clear to me the solution. Up to now I'm only using the bundled tools of MG4J; I didn't really dig into the library yet.

Thanks in advance,


Mar 14, 2017, 6:25:20 AM3/14/17
to MG4J

Well... Why do you say that? If you partitioned lexically, you have the strategy you used. So you can get to the right index. If you partitioned documentally, the tool can create approximate dictionaries (Bloom filters) that will tell you which sub indices contain a certain term. Of course starting from the term list you can build a different dictionary (e.g., a GOV3 function from Sux4J).
Reply all
Reply to author
0 new messages