Index size larger than data size

111 views

Skip to first unread message

Anil Verma

unread,

Feb 14, 2017, 6:57:42 AM2/14/17

to Druid Development

Hi all,
I am ingesting data using druid, which has 8 attributes, including the timestamp. The ingestion succeeds but the index size is larger than the original dataHere are the other details:
Data size: 200 GB(in gzip)
Index: 358 GB
Number of segments: ~5000
targetPartitionSize : 1000000
Ingestion method: Hadoop
Job Properties: {"mapred.max.split.size":128000000,"mapred.reduce.tasks":100,"mapreduce.reduce.memory.mb":15240,"mapreduce.reduce.java.opts":"-Xmx15240m","mapreduce.task.timeout":18000000,"mapreduce.task.userlog.limit.kb":0 }

Actually, I experimented with the targetPartitionSize. As per Druid documentation, with a larger targetPartitionSize, the number of reducers will be lesser. But, the reducers were throwing OutOfMemory Exception. Hence, I decreased the targetPartitionSize to 1 million. Now, the ingestion completes, but the index has ~5000 segments and consumes 358 GB space. I am feeling suspicious due to index size larger than data size. Please suggest.

Gian Merlino

unread,

Feb 23, 2017, 2:19:08 AM2/23/17

to druid-de...@googlegroups.com

Hey Anil,

Druid segments have some features that can compress the data (like rollup, dictionary encoding, lz4 compression) and some features that can expand it (like bitmap indexes). For some datasets the "expand" wins out and yours may be one of them. If I had to guess, one of your attributes is probably high cardinality and this (a) limits ability of Druid to do rollup, and (b) generated a large bitmap index. Druid will still work fine but the compression is not as effective.

Gian

_____________________________________________________________
The information contained in this communication is intended solely for the use of the individual or entity to whom it is addressed and others authorized to receive it. It may contain confidential or legally privileged information. If you are not the intended recipient you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this information is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by responding to this email and then delete it from your system. The firm is neither liable for the proper and complete transmission of the information contained in this communication nor for any delay in its receipt.
--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/e0e8ab8c-3754-496e-9247-56a9dd288b16%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages