Re: Digest for mg4j@googlegroups.com - 1 Message in 1 Topic

5 views
Skip to first unread message

Dmitri Portnov

unread,
Feb 13, 2013, 1:35:53 PM2/13/13
to mg...@googlegroups.com
You could use Concatenate to produce final index for all the portions. In the following code sample I merge the batches produced by our system (these are different from batches produced by mg4j itself). 
HTH
...
public boolean mergeIndex(final IndexingContext ic) throws ConfigurationException, ClassNotFoundException,
SecurityException, InstantiationException, IllegalAccessException, InvocationTargetException,
NoSuchMethodException, IOException, URISyntaxException {
boolean result = false;
final Path previousMergeFolder = ic.getLastValidMergedFolder();
final LongList batchesToMerge = ic.getBatchesToMerge();
if (batchesToMerge == null || batchesToMerge.isEmpty()) {
LOG.info("No batches found to merge");
return true;
}
final Path mg4jIndexDir = ic.getNextMergedFolder();
FileUtils7.prepareCleanDirectory(mg4jIndexDir);
LOG.debug("Merge semantic index of batch[es]: {} into {}", batchesToMerge, mg4jIndexDir);
final Path outputBasename = mg4jIndexDir.resolve("output-");
final int numberOfPathsToMerge = batchesToMerge.size() + (previousMergeFolder != null ? 1 : 0);
int i = 0;
final Path[] pathsToMerge = new Path[numberOfPathsToMerge];
if (previousMergeFolder != null) {
pathsToMerge[i++] = previousMergeFolder;
}
for (final long batchId : batchesToMerge) {
pathsToMerge[i++] = fs.getBatchDirPath(batchId);
}
LOG.info("Concatenating old indexes and newly created for batches {}", batchesToMerge);
final Mg4jFields[] fields = Mg4jFields.values();
try {
for (final Mg4jFields field : fields) {
final String name = field.getIndexName();
LOG.info("Merging field {}", name);
final String[] inputFieldNameBases = new String[numberOfPathsToMerge];
i = 0;
for (final Path path : pathsToMerge) {
inputFieldNameBases[i++] = path.resolve("output-" + name).toString();
}
final String outputFieldNameBase = outputBasename + name;
final int combineBufferSize = Combine.DEFAULT_BUFFER_SIZE;
// TODO DVP restrict input paths array size (process by portions) to avoid reaching limit of open files in
// process (this didn't happen so far)
if (field.getFieldType() == DocumentFactory.FieldType.TEXT) {
new Concatenate(
outputFieldNameBase,
inputFieldNameBases,
NOT_METADATA_ONLY,
combineBufferSize,
field.getWriterFlags(),
IndexType.QUASI_SUCCINCT,
SKIPS,
QUANTUM,
HEIGHT,
SKIP_BUFFER_SIZE,
LOG_INTERVAL).run();
}
else {
new Concatenate(
outputFieldNameBase,
inputFieldNameBases,
NOT_METADATA_ONLY,
combineBufferSize,
field.getWriterFlags(),
IndexType.INTERLEAVED,
SKIPS,
QUANTUM,
HEIGHT,
SKIP_BUFFER_SIZE,
LOG_INTERVAL).run();
}
final String termsFileName = outputFieldNameBase + DiskBasedIndex.TERMS_EXTENSION;
BinIO.storeObject(
StringMaps.synchronize(TERM_MAP_CLASS.getConstructor(Iterable.class).newInstance(
new FileLinesCollection(termsFileName, "UTF-8") //
) //
), //
outputFieldNameBase + DiskBasedIndex.TERMMAP_EXTENSION //
);
LOG.debug("Created term maps (class: {}) for field {}", TERM_MAP_CLASS.getSimpleName(),
field.getIndexName());

}
LOG.info("All the index fields are successfully merged...");
LOG.info("Saving resulting sentence ids");
Mg4jFSUtils.append(mg4jIndexDir.resolve(Mg4jFSConfig.SENTENCE_IDS_BIN),
Mg4jFSUtils.resolve(pathsToMerge, Mg4jFSConfig.SENTENCE_IDS_BIN));
result = true;
LOG.info("Index is successfully builded");
}
finally {
if (!result) {
FileUtils7.forceDelete(mg4jIndexDir);
}
}
return result;
}
...


On Mon, Feb 11, 2013 at 8:48 AM, <mg...@googlegroups.com> wrote:

Group: http://groups.google.com/group/mg4j/topics

    Alireza Noori <alirezan...@gmail.com> Feb 10 09:03AM -0800  

    I have multiple collection of data. Each one is updated with different
    intervals. I was wondering whether I can index each one to a different
    index and then when I'm running a query, use them all. If this runs exactly
    like the regular query process, I'm guessing this would help me get far
    better performance.
     
    If it's possible, please tell me how. I want to use the source code (rather
    than using the compiled .jar). So please, tell me how to do it via Java
    code.
     
    Thanks in advance.

     

You received this message because you are subscribed to the Google Group mg4j.
You can post via email.
To unsubscribe from this group, send an empty message.
For more options, visit this group.

--
You received this message because you are subscribed to the Google Groups "MG4J" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mg4j+uns...@googlegroups.com.
To post to this group, send email to mg...@googlegroups.com.
Visit this group at http://groups.google.com/group/mg4j?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Reply all
Reply to author
Forward
0 new messages