GroupBy query performance on multivalue columns

Alvin Hom

unread,

Mar 27, 2013, 4:02:12 PM3/27/13

to druid-de...@googlegroups.com

I have a data source of about 100M rows with just a couple of columns:

- id [string uuid]

- category [multi value array of integers] This column can have up to 200 possible values, but each row typically has 10-20 values.

I am using a groupby query with a set of filters. If I don't put in any value in the dimensions in the query,

{

"queryType": "groupBy",

"dataSource": "dataSource",

"granularity": "all",

"dimensions": [],

"aggregations":[

{"type":"count", "name":"eventcount"}

],

"filter": {

"type": "and",

"fields": [

{

"type": "selector",

"dimension": "categories",

"value": "1"

},

{

"type": "selector",

"dimension": "categories",

"value": "2"

},

{

"type": "selector",

"dimension": "categories",

"value": "3"

},

{

"type": "selector",

"dimension": "categories",

"value": "4"

},

{

"type": "selector",

"dimension": "categories",

"value": "5"

}

]

},

"intervals":["1970-01-01T00:00/1970-01-02T00"]

}

then the query returns in about 1-2 seconds.

However, if I put in a query where the dimension is a multivalue column:

{

"queryType": "groupBy",

"dataSource": "dataSource",

"granularity": "all",

"dimensions": ["categories"],

"aggregations":[

{"type":"count", "name":"eventcount"}

],

"filter": {

"type": "and",

"fields": [

{

"type": "selector",

"dimension": "categories",

"value": "1"

},

{

"type": "selector",

"dimension": "categories",

"value": "2"

},

{

"type": "selector",

"dimension": "categories",

"value": "3"

},

{

"type": "selector",

"dimension": "categories",

"value": "4"

},

{

"type": "selector",

"dimension": "categories",

"value": "5"

}

]

},

"intervals":["1970-01-01T00:00/1970-01-02T00"]

}

The query takes very long time, 3-4 minutes. Is this expected behavior? I would think the groupby count with filters should be pretty fast since the data is already in memory.

- Alvin

Eric Tschetter

unread,

Mar 27, 2013, 4:29:57 PM3/27/13

to druid-de...@googlegroups.com

Wow, yeah, that is a long time. A couple of questions:

1) What is the cardinality of the multiple values for the categories dimension? You mention it might have up to 200 values per row, is it the same set of 200 values across rows, or are the values potentially unique? Asked another way, how many values are you getting in the result set?

2) Is this running on one compute node or on multiple? How large are the segments versus how much RAM is available on the box?

3) Not a question but a clarification about the difference between the two queries. The first one that doesn't specify a dimension never actually looks at the categories column, so all it actually loads into memory to process the request are two things (1) the bitmap indices and (2) the timestamp column. So, the columns aren't actually in memory yet. When you run the query a second time, does it also run in 3-4 minutes?

Just an explanation of how multi-values work as well, when you groupBy a column with multiple values it extracts out that row and essentially acts like it is N rows, one for each of the multiple values. So, if you have an average of 20 elements across 100M rows and you process it without any filter, it is doing the equivalent of 2B rows worth of processing, which could justify some increase in processing time, but not as much as you are seeing.

If you are only processing frequency counts, then it is possible to optimize it a bit more by just doing bitmap algebra with the various values and getting a count of the matches, I would expect this to be significantly faster for this type of query, but it is not an optimization that the code current employs because the code is assuming that you also want to do things like sum up values or compute averages on the same set of rows as you go (in production, we don't do any queries with less than 3 aggregators and the majority of them also have a postAgg or two set).

If your use case is strongly focused around frequency counts, but you also need some of the aggregation capability Druid provides you can look into creating a new query type that does the bitmap algebra to speed up this particular query type. If your use case is entirely around frequency counts and the rest of what Druid provides is not that interesting, then you might actually be better served by a search-based system (Solr, elastic search, SenseiDB, etc.) as frequency counts are their bread and butter while Druid targets more arbitrary aggregations.

Lastly, for your first query, a groupBy without any dimensions is actually the same as a "timeseries" query. There is an implementation of timeseries queries that is generally 50% faster than groupBy for the special case when you aren't specifying a dimension. You can try it out by just changing the queryType from "groupBy" to "timeseries".

--Eric

- Alvin

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msg/druid-development/-/MTOzcCc9WDYJ.
For more options, visit https://groups.google.com/groups/opt_out.

Alvin Hom

unread,

Mar 28, 2013, 3:46:03 PM3/28/13

to druid-de...@googlegroups.com

Eric,

Here are the answers to your questions:

1) What is the cardinality of the multiple values for the categories dimension? You mention it might have up to 200 values per row, is it the same set of 200 values across rows, or are the values potentially unique? Asked another way, how many values are you getting in the result set?

[AH] The cardinality of the values is 200. But each row only have about 20 values for the multi-value column. I guess we think of it like a sparse matrix and we compress the columns into a single multi-value column. The reason we model it this way is we want to do fast filtering and fast groupbys. If we split each category into its own columns, then I cannot group them together.

2) Is this running on one compute node or on multiple? How large are the segments versus how much RAM is available on the box?

I am testing right now with 2 nodes m2.xlarge, which has about 13G each. The segment statistics are show below:

[ {

"id" : "merged",

"intervals" : [ "1970-01-01T00:00:00.000Z/1970-01-02T00:00:00.000Z" ],

"columns" : {

"__time" : {

"type" : "LONG",

"size" : 1066452520,

"cardinality" : null

},

"id" : {

"type" : "STRING",

"size" : 4265810080,

"cardinality" : 1999999

},

"categories" : {

"type" : "STRING",

"size" : 19087970744,

"cardinality" : 1035

},

"events" : {

"type" : "FLOAT",

"size" : 853162016,

"cardinality" : null

}

},

"size" : 25699976368

} ]

The total segment size is about 25.6G.

Thanks for the explanation of how groupby is implemented. I do have other queries that do different type of aggregation, but we also have a lot of frequency count queries, so we want to optimize those as well. I will look into implementing a count using the inverted bitmap index. In my mind, I think I can apply the following algorithms:

1. Apply the filters first to reduce the data. (Get back a bitmap)

2. For each dimension value in the dimension

a. Get the dimension value's inverted bitmap index.

b. BitAnd it with the filter bitmap

c. Do a count of the result bitmap.

Let me know if I am going down the right path. It would require we implement a new type of query and plug it into Druid. Would appreciate any pointers for doing this.

Thanks.

- Alvin

Eric Tschetter

unread,

Mar 29, 2013, 10:09:21 AM3/29/13

to druid-de...@googlegroups.com

Cool! You figured out how to use the segmentMetadata query! One caveat, that is actually approximating the size of the segment as if it were tsv input... We should add an "actual size" thing to it, but in general, that should be large enough to fit in memory.

Thanks for the explanation of how groupby is implemented. I do have other queries that do different type of aggregation, but we also have a lot of frequency count queries, so we want to optimize those as well. I will look into implementing a count using the inverted bitmap index. In my mind, I think I can apply the following algorithms:

1. Apply the filters first to reduce the data. (Get back a bitmap)
2. For each dimension value in the dimension
a. Get the dimension value's inverted bitmap index.

b. BitAnd it with the filter bitmap
c. Do a count of the result bitmap.

Let me know if I am going down the right path. It would require we implement a new type of query and plug it into Druid. Would appreciate any pointers for doing this.

Yup, that's exactly it. Fwiw, the count() method is already implemented on ImmutableConciseBitmap. This will end up significantly faster for the query. If you go down this path, might I recommend "queryType": "frequencyCount" ? For pointers, you can look at SegmentMetadataQueryQueryToolchest/QueryRunnerFactory and the Timeseries variants. For now, you can just return null from getCacheStrategy() if you don't want to figure it out.

If you have questions about the actual implementation of stuff, send to the list of you can pop on IRC and try to get answers in more real-time :).

--Eric

To view this discussion on the web visit https://groups.google.com/d/msg/druid-development/-/4m45kVusOKIJ.

Reply all

Reply to author

Forward