Typical index sizes for Lucene-based meta collectors

5 views
Skip to first unread message

Martin Wunderlich

unread,
Sep 6, 2015, 6:48:43 PM9/6/15
to dkpro-t...@googlegroups.com
Hi all, 

I am wondering, what the typical index sizes would be for the Lucene indices, because I keep running into storage problems. My experimental setup is an ExperimentTrainTestStore batch task with 7 unit FEs in ablation mode, so there are 8 sub tasks („8“ because of the 7 FEs + 1 setup with no FEs dropped). I am using LuceneNGramUFE as one of the FEs, with the top 1000 unigrams, bigrams and trigrams being extracted from the source text. The text itself has a word count of approx. 60k. 

Now, the problem arises from the fact that the lucene index has about 1.1 GB per task and there are 8 MetaTasks, so this takes up nearly 9 GB of disk space for the MetaTask alone. I am running several similar experiments on the same storage space in parallel and the available disk space fills up rather quickly. 

So, the question basically is: Are my index sizes way off or are they to be expected to be in that range for this kind of text volume? I don’t have too much experience with Lucene, only a bit of background with Solr.  

Cheers, 

Martin
 
Reply all
Reply to author
Forward
0 new messages