Hi all,
I am wondering, what the typical index sizes would be for the Lucene indices, because I keep running into storage problems. My experimental setup is an ExperimentTrainTestStore batch task with 7 unit FEs in ablation mode, so there are 8 sub tasks („8“ because of the 7 FEs + 1 setup with no FEs dropped). I am using LuceneNGramUFE as one of the FEs, with the top 1000 unigrams, bigrams and trigrams being extracted from the source text. The text itself has a word count of approx. 60k.
Now, the problem arises from the fact that the lucene index has about 1.1 GB per task and there are 8 MetaTasks, so this takes up nearly 9 GB of disk space for the MetaTask alone. I am running several similar experiments on the same storage space in parallel and the available disk space fills up rather quickly.
So, the question basically is: Are my index sizes way off or are they to be expected to be in that range for this kind of text volume? I don’t have too much experience with Lucene, only a bit of background with Solr.