Hi all,
I am ingesting data using druid, which has 8 attributes, including the timestamp. The ingestion succeeds but the index size is larger than the original dataHere are the other details:
Data size: 200 GB(in gzip)
Index: 358 GB
Number of segments: ~5000
targetPartitionSize : 1000000
Ingestion method: Hadoop
Job Properties: {"mapred.max.split.size":128000000,"mapred.reduce.tasks":100,"mapreduce.reduce.memory.mb":15240,"mapreduce.reduce.java.opts":"-Xmx15240m","mapreduce.task.timeout":18000000,"mapreduce.task.userlog.limit.kb":0 }
Actually, I experimented with the targetPartitionSize. As per Druid documentation, with a larger targetPartitionSize, the number of reducers will be lesser. But, the reducers were throwing OutOfMemory Exception. Hence, I decreased the targetPartitionSize to 1 million. Now, the ingestion completes, but the index has ~5000 segments and consumes 358 GB space. I am feeling suspicious due to index size larger than data size. Please suggest.