Typical index sizes for Lucene-based meta collectors

6 views

Skip to first unread message

Martin Wunderlich

unread,

Sep 6, 2015, 6:48:43 PM9/6/15

to dkpro-t...@googlegroups.com

Hi all,

I am wondering, what the typical index sizes would be for the Lucene indices, because I keep running into storage problems. My experimental setup is an ExperimentTrainTestStore batch task with 7 unit FEs in ablation mode, so there are 8 sub tasks („8“ because of the 7 FEs + 1 setup with no FEs dropped). I am using LuceneNGramUFE as one of the FEs, with the top 1000 unigrams, bigrams and trigrams being extracted from the source text. The text itself has a word count of approx. 60k.

Now, the problem arises from the fact that the lucene index has about 1.1 GB per task and there are 8 MetaTasks, so this takes up nearly 9 GB of disk space for the MetaTask alone. I am running several similar experiments on the same storage space in parallel and the available disk space fills up rather quickly.

So, the question basically is: Are my index sizes way off or are they to be expected to be in that range for this kind of text volume? I don’t have too much experience with Lucene, only a bit of background with Solr.

Cheers,

Martin

Reply all

Reply to author

Forward

0 new messages