Single dimension partitioning

james...@optimizely.com

unread,

Feb 25, 2016, 2:29:44 PM2/25/16

to Druid User

Hi all,

In the docs for the PartitionsSpec, under "single-dimension partitioning" it says "Each segment will contain all rows with values of that dimension in that range."

Is there anyway to store data in segments with this rule, but also split across multiple shards if the segment becomes too large? If I lowered the values for targetPartitionSize or maxPartitionSize, would this take precedence and force another shard with the same dimension values?

I am encountering a problem where a large increase in one customer's data (we are partitioning by the customer's id) results in uneven shard sizes. So much so that the druid indexing jobs are failing due to memory issues as well.

I realize hash-based partitioning is recommended for just this reason, but we decided to partition by customer id because the access patterns usually come in streams by customer id and we wanted to optimize for query performance and minimize data load into memory.

Thanks,

James

Gian Merlino

unread,

Feb 25, 2016, 4:05:00 PM2/25/16

to druid...@googlegroups.com

Hey James,

No, that is not currently an option- currently with single dimension partitioning, all rows with a particular value for that dimension need to go into the same segment. I think this'd be a reasonable option to add if you want to go that way. Or you could try some workarounds,

- Isolate this customer into its own dataSource.

- Create a new dimension called something like "partition_key", and make that customer id + a random suffix (or, just add the suffix for this customer). This should still keep the rows for each customer generally together, but allow them to be split across segments. This will be a high cardinality column though and will increase the size of your segments somewhat.

Gian

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/3b31f866-a272-472a-a11e-e4e8d0b554b4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

james...@optimizely.com

unread,

Feb 25, 2016, 9:30:25 PM2/25/16

to Druid User

Thanks for the reply Gian.

One follow up question: Our segment granularity is currently set to "DAY". If we were to switch that to "HOURLY", that would automatically split up the segments into 24 correct? Are there any downsides of doing this on the fly without re-ingesting previous data ingested at "DAY" granularity? Would this affect queries on old "DAY" granularity data?

Gian Merlino

unread,

Feb 25, 2016, 9:46:09 PM2/25/16

to druid...@googlegroups.com

That'd work fine. DAY and HOUR can coexist in the same datasource.

Gian

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/a6a60b60-ed55-44ec-bd29-b9f276f97012%40googlegroups.com.

james...@optimizely.com

unread,

Feb 25, 2016, 10:03:30 PM2/25/16

to Druid User

Cool, probably gonna do that as a short term fix to buy us some time to implement the partition key solution.

Thanks!

James

james...@optimizely.com

unread,

May 2, 2016, 9:54:02 PM5/2/16

to Druid User

Hi Gian,

I was able to get a partition_key dimension into our datasource and re-ran the batch indexing job with single dimension partitioning enabled on this partition_key dimension. I ran a few queries on this datasource with "bySegment" set to true to see all the segments the query was hitting. I noticed it was still hitting all shards of a segment, despite the relevant data being only located on one of the shards. Is this correct behavior? I do include the partition_key dimension in my query as a filter, so I assumed it would be able to limit the shards for each segment it has to hit to pull out relevant info, but that doesn't seem to be the case.

Thanks,

James

Gian Merlino

unread,

May 3, 2016, 2:34:49 PM5/3/16

to druid...@googlegroups.com

Hey James,

We don't do filter pruning for the secondary dimension right now, so yes the query would still hit all the segments for that time range. But most of them should drop out quickly due to the fact that the value you're filtering on won't appear in the bitmap indexes.

Gian

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/df966ca4-a30a-4041-8fb8-bbf807dcd043%40googlegroups.com.

Reply all

Reply to author

Forward