Queries on high cardinality dimension

309 views
Skip to first unread message

ankur jain

unread,
Jan 10, 2018, 7:32:28 PM1/10/18
to Druid User
Hi Everyone,

-- Please bear with me if I am not asking it right.
========================
We have about 5 dimensions in our data. 2 of them lets say dimension A and dimension B have very high cardinality.

dimension A - 10m
dimension B - 100m
Datasource - D

Our query pattern: 
1) Aggregated counts on data filtered by a single value of dimension A and between an interval.
Example:
{
  "queryType": "timeseries",
  "dataSource": "D",
  "intervals": "2018-01-10T08Z/2018-01-11T08Z",
  "granularity": "all",
  "context": {
    "timeout": 6000000
  },
  "filter": {
    "type": "selector",
    "dimension": "A",
    "value": "10271023423"
  },
  "aggregations": [
    {
      "name": "__VALUE__",
      "type": "doubleSum",
      "fieldName": "count"
    }
  ]
}

2) Aggregated counts on data filtered by a single value of dimension A and grouped by dimension B.
Example:
{
  "queryType": "topN",
  "dataSource": "D",
  "intervals": "2018-01-10T08Z/2018-01-11T08Z",
  "granularity": "all",
  "context": {
    "timeout": 6000000
  },
  "filter": {
    "type": "selector",
    "dimension": "A",
    "value": "10271023423"
  },
  "dimension": {
    "type": "default",
    "dimension": "B",
    "outputName": "B"
  },
  "aggregations": [
    
    {
      "name": "count",
      "type": "doubleSum",
      "fieldName": "count"
    }
  ],
  "metric": "count",
  "threshold": 1000000
}
=======================================================

Do you guys foresee any problem with such a query pattern when we have such high cardinality? 

I have heard that druid has issues if cardinality reaches near 100m and hence I am asking this question.

Any input is highly appreciated. 

Thanks





Eric Tschetter

unread,
Jan 11, 2018, 12:21:48 AM1/11/18
to druid...@googlegroups.com
I would have no worries with that usage.  I would recommend using semantic partitioning for your segments instead of hash partitioning, though.  If there is some interdependence between the two dimensions, that’s even less of a worry then.

—Eric


--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/b4e3d6e2-2ca6-45ba-8e8d-2d9f85b63253%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ankur jain

unread,
Jan 11, 2018, 1:58:13 PM1/11/18
to Druid User
Hey Eric,

Thanks for replying.
When u say semantic partitioning do you mean writing a custom partitioner for a segment on the basis of dimensions? 

Also, on the cardinality; when do you think it will become a problem. I mean what if the cardinality of dimension B reaches 1b.  Do you know what is the upper limit?

Another query pattern I am looking at is the calculation of unique values.  Will the above assumption hold in this case as well?
Example: 
{
  "queryType": "timeseries",
  "dataSource": "D",
  "intervals": "2017-12-01T08Z/2018-01-01T08Z",
  "granularity": "all",
  "context": {
    "timeout": 60000
  },
  "aggregations": [
    {
      "name": "_main.countDistinct_B_-d4c",
      "type": "cardinality",
      "fields": [
        "B"
      ]
    }
  ]
}

Thanks

Eric Tschetter

unread,
Jan 11, 2018, 9:11:00 PM1/11/18
to druid...@googlegroups.com
If you use semantic partitioning (yes, the dimension shard spec, it exists you don’t have to write it), then even 1b wouldn’t really be a problem given the queries you’ve shown.

If you were to do a query that expected a 1b row result set, that could be somewhat slow.  But the queries you’ve shown would tend to generate rather small result sets so I don’t think it will be a problem.

—Eric


ankur jain

unread,
Jan 14, 2018, 9:16:23 PM1/14/18
to Druid User
Thanks, Eric. I will try this out and report back.

ankur jain

unread,
Jan 16, 2018, 8:48:36 PM1/16/18
to Druid User
Hey Eric,

Is the partitionspec configurable for real-time ingestion? All the examples I have seen till now are for hadoop ingestion.
Even here the code is using HadoopIndexConfig by default.

Am I missing something?

Thanks

Eric Tschetter

unread,
Jan 16, 2018, 10:54:52 PM1/16/18
to druid...@googlegroups.com
It’s really batch only.  For real-time workloads you can still ingest using hash and then re-index the real-time segments using the dimension shard spec.  With the kind of data you are dealing with, doing this will also generally shrink the total size of your segments do to better data locality and the compression benefits that come with that.

Also, when you list the dimensions in your ingestion spec, list the one you are more likely to filter most by first.  Basically, the data will be sorted in the segment based on the order you list your dimensions.  So putting the commonly filtered one first will cause those queries to have very good data locality.

—Eric

Reply all
Reply to author
Forward
0 new messages