Queries on high cardinality dimension

ankur jain

unread,

Jan 10, 2018, 7:32:28 PM1/10/18

to Druid User

Hi Everyone,

-- Please bear with me if I am not asking it right.

========================

We have about 5 dimensions in our data. 2 of them lets say dimension A and dimension B have very high cardinality.

dimension A - 10m

dimension B - 100m

Datasource - D

Our query pattern:

1) Aggregated counts on data filtered by a single value of dimension A and between an interval.

Example:

{

"queryType": "timeseries",

"dataSource": "D",

"intervals": "2018-01-10T08Z/2018-01-11T08Z",

"granularity": "all",

"context": {

"timeout": 6000000

},

"filter": {

"type": "selector",

"dimension": "A",

"value": "10271023423"

},

"aggregations": [

{

"name": "__VALUE__",

"type": "doubleSum",

"fieldName": "count"

}

]

}

2) Aggregated counts on data filtered by a single value of dimension A and grouped by dimension B.

Example:

{

"queryType": "topN",

"dataSource": "D",

"intervals": "2018-01-10T08Z/2018-01-11T08Z",

"granularity": "all",

"context": {

"timeout": 6000000

},

"filter": {

"type": "selector",

"dimension": "A",

"value": "10271023423"

},

"dimension": {

"type": "default",

"dimension": "B",

"outputName": "B"

},

"aggregations": [

{

"name": "count",

"type": "doubleSum",

"fieldName": "count"

}

],

"metric": "count",

"threshold": 1000000

}

=======================================================

Do you guys foresee any problem with such a query pattern when we have such high cardinality?

I have heard that druid has issues if cardinality reaches near 100m and hence I am asking this question.

Any input is highly appreciated.

Thanks

Eric Tschetter

unread,

Jan 11, 2018, 12:21:48 AM1/11/18

to druid...@googlegroups.com

I would have no worries with that usage. I would recommend using semantic partitioning for your segments instead of hash partitioning, though. If there is some interdependence between the two dimensions, that’s even less of a worry then.

—Eric

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/b4e3d6e2-2ca6-45ba-8e8d-2d9f85b63253%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ankur jain

unread,

Jan 11, 2018, 1:58:13 PM1/11/18

to Druid User

Hey Eric,

Thanks for replying.

When u say semantic partitioning do you mean writing a custom partitioner for a segment on the basis of dimensions?

Also, on the cardinality; when do you think it will become a problem. I mean what if the cardinality of dimension B reaches 1b. Do you know what is the upper limit?

Another query pattern I am looking at is the calculation of unique values. Will the above assumption hold in this case as well?

Example:

{

"queryType": "timeseries",

"dataSource": "D",

"intervals": "2017-12-01T08Z/2018-01-01T08Z",

"granularity": "all",

"context": {

"timeout": 60000

},

"aggregations": [

{

"name": "_main.countDistinct_B_-d4c",

"type": "cardinality",

"fields": [

"B"

]

}

]

}

Thanks

Eric Tschetter

unread,

Jan 11, 2018, 9:11:00 PM1/11/18

to druid...@googlegroups.com

If you use semantic partitioning (yes, the dimension shard spec, it exists you don’t have to write it), then even 1b wouldn’t really be a problem given the queries you’ve shown.

If you were to do a query that expected a 1b row result set, that could be somewhat slow. But the queries you’ve shown would tend to generate rather small result sets so I don’t think it will be a problem.

—Eric

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/91a17bdb-fc97-42d8-8185-5c166ed85572%40googlegroups.com.

ankur jain

unread,

Jan 14, 2018, 9:16:23 PM1/14/18

to Druid User

Thanks, Eric. I will try this out and report back.

ankur jain

unread,

Jan 16, 2018, 8:48:36 PM1/16/18

to Druid User

Hey Eric,

Is the partitionspec configurable for real-time ingestion? All the examples I have seen till now are for hadoop ingestion.

Even here the code is using HadoopIndexConfig by default.

https://github.com/druid-io/druid/blob/master/indexing-hadoop/src/main/java/io/druid/indexer/partitions/PartitionsSpec.java

Am I missing something?

Thanks

Eric Tschetter

unread,

Jan 16, 2018, 10:54:52 PM1/16/18

to druid...@googlegroups.com

It’s really batch only. For real-time workloads you can still ingest using hash and then re-index the real-time segments using the dimension shard spec. With the kind of data you are dealing with, doing this will also generally shrink the total size of your segments do to better data locality and the compression benefits that come with that.

Also, when you list the dimensions in your ingestion spec, list the one you are more likely to filter most by first. Basically, the data will be sorted in the segment based on the order you list your dimensions. So putting the commonly filtered one first will cause those queries to have very good data locality.

—Eric

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/d2187407-0867-42fd-8056-8b2fbab13d93%40googlegroups.com.

Reply all

Reply to author

Forward