How and on what parameters Druid Decide to Roll Up

1,050 views
Skip to first unread message

Pravesh Gupta

unread,
Jan 20, 2017, 5:14:56 AM1/20/17
to Druid User
Hi All,

I want to understand on what factors and parameters druid decides to roll up two entries . I assume Druid must be calculating something similar to  hash to figure out the equality between two rows and then decide to roll up accordingly.So how is that Hash Calculated.
I am sharing my use with example below :

I am doing a batch ingestion into my druid (Version : druid-0.9.0, 0.9.1.1) :

Following is my batch Ingestion Script:
{
        "type": "index_hadoop",
        "spec": {
                "ioConfig": {
                        "type": "hadoop",
                        "inputSpec": {
                                "type": "static",
                                "paths": "MyEventListFile.json"
                        }
                },
                "dataSchema": {
                        "dataSource": "<<DATASOURCE_NAME>>>",
                        "parser": {
                                "type": "string",
                                "parseSpec": {
                                        "format": "json",
                                        "timestampSpec": {
                                                "column": "timestamp",
                                                "format": "millis"
                                        },
                                        "dimensionsSpec": {
                                "dimensions" :
                                [
A,B,C,D
                                ],
                                "dimensionExclusions" : []
                            }
                                }
                        },
                        "metricsSpec" :
                    [
                        { "type" : "longMax", "name" : "eventCount", "fieldName": "count"  }
                    ],
                        "granularitySpec": {
                                "type": "uniform",
                                "segmentGranularity": "HOUR",
                                "queryGranularity": "none",
                                "intervals": ["2019-07-18/2019-12-01"]
                        }
                },
                "tuningConfig": {
                        "type": "hadoop"
                }
        }
}

{"timestamp":"1572557460000","A":"0","B":"Pravesh300000","C":"30000","D":"praveshgmail.com","NOTADIMENSION":"test1","count":1}
{"timestamp":"1572557460000","A":"0","B":"Pravesh300000","C":"30000","D":"praveshgmail.com","NOTADIMENSION":"test2","count":1}

I have two event which has same timestamp and same values for the dimensions as well (A,B,C,D) BUT the column which I cannot declare as dimension is different (NOTADIMENSION).
I am getting count as 1 in this case but I want count as 2.

Whats the solution here ? Is there anything which I can specify explicitly to tell druid how to roll up, i.e. on what columns and timestamp calculate the hash ?

Hoping to hear back soon as I am blocked on this.

Thanks,
Pravesh Gupta

Slim Bouguerra

unread,
Jan 20, 2017, 10:36:47 AM1/20/17
to druid...@googlegroups.com
it is bases on   “queryGranularity”.
but in your example you have none so most likely you will not get any rollup change it to other granularities.
-- 

B-Slim
_______/\/\/\_______/\/\/\_______/\/\/\_______/\/\/\_______/\/\/\_______

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/952ebde2-5f39-44f1-9944-84a82cd33a1c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Pravesh Gupta

unread,
Jan 20, 2017, 1:12:13 PM1/20/17
to Druid User
Then why roll up is happening in my case.

However , Can I tell Druid that these set of columns are to be considered to calculate equality ? How would my queryGranuality look like in that case ?

Is it possible to tell Druid that If My dimensions and some particular non dimension column are same then do the roll up else don't do it ?

One more follow up question, does timestamp gets in picture in druid roll up.

I couldn't find out much details about how druid do roll up online unfortunately.

Thanks,
Pravesh Gupta

Slim Bouguerra

unread,
Jan 20, 2017, 1:20:12 PM1/20/17
to druid...@googlegroups.com

-- 

B-Slim
_______/\/\/\_______/\/\/\_______/\/\/\_______/\/\/\_______/\/\/\_______

On Jan 20, 2017, at 10:12 AM, Pravesh Gupta <gupta.p...@gmail.com> wrote:

Then why roll up is happening in my case.

probably you have matching rows -> same_time_stamp, same dimension_values


However , Can I tell Druid that these set of columns are to be considered to calculate equality ?

druid will use all the dimensions to bucket are rollup data you can see it as a hash of time-dimValues

How would my queryGranuality look like in that case ?

so with a different granularity let say hour druid will round the time column of your data to hour granularity hence you have more collusion between rows. 


Is it possible to tell Druid that If My dimensions and some particular non dimension column are same then do the roll up else don't do it ?

i am not sure i am getting this question what the point of having a partial rollup ? i can see use cases where you don’t want rollup to happen but not at all and not based on a sub set of dimensions.


One more follow up question, does timestamp gets in picture in druid roll up.

of course that’s the main grain of the hash


I couldn't find out much details about how druid do roll up online unfortunately.


Thanks,
Pravesh Gupta

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.

Pravesh Gupta

unread,
Jan 20, 2017, 1:28:06 PM1/20/17
to Druid User
Thanks for the breakdown answer.

One of my use case is ..Two users clicked on Same link at exactly same time .So Events corresponding to these two users happen to have same dimensions and same time.The only thing which differentiate them is unique id which I am ingesting in Druid but not as a dimension column(because of some reason let's say)
. So in this case how should i ensure there is no roll up.
I want my event Count to be 2 .

Hope I have made myself clearer .

Slim Bouguerra

unread,
Jan 20, 2017, 1:45:46 PM1/20/17
to druid...@googlegroups.com

--

B-Slim
_______/\/\/\_______/\/\/\_______/\/\/\_______/\/\/\_______/\/\/\_______

> On Jan 20, 2017, at 10:28 AM, Pravesh Gupta <gupta.p...@gmail.com> wrote:
>
> Thanks for the breakdown answer.
>
> One of my use case is ..Two users clicked on Same link at exactly same time .So Events corresponding to these two users happen to have same dimensions and same time.The only thing which differentiate them is unique id which I am ingesting in Druid but not as a dimension column(because of some reason let's say)

not sure what is the reason you are striping the userId but under those assumption how can you query the data afterward ? if you remove userID how is that helpful to have 2 rows with the same input ?

> . So in this case how should i ensure there is no roll up.
> I want my event Count to be 2 .
>
> Hope I have made myself clearer .
>
> --
> You received this message because you are subscribed to the Google Groups "Druid User" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
> To post to this group, send email to druid...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/44c33a20-c306-4354-a62f-b0410863ef4c%40googlegroups.com.

Pravesh Gupta

unread,
Jan 23, 2017, 9:33:20 AM1/23/17
to Druid User
Thanks for the answer.
Actually I am still confused on how actually decide to Roll Up.

Basically what all columns does it consider to come to a decision that I have to roll up these two rows ?

I am guessing it considers all the dimension columns and timestamp, Nothing else.

I do have a case when I have a column which is neither dimension nor metric nor timestamp column, but I do want to have value of this column to be consider when Druid decides to roll up the rows. Is it even possible first of all, lets not talk about whether does it make sense or does it have any proper use case .

Please help in here.

Slim Bouguerra

unread,
Jan 23, 2017, 10:53:19 AM1/23/17
to druid...@googlegroups.com
Hi

On Jan 23, 2017, at 6:33 AM, Pravesh Gupta <gupta.p...@gmail.com> wrote:

Thanks for the answer.
Actually I am still confused on how actually decide to Roll Up.

Basically what all columns does it consider to come to a decision that I have to roll up these two rows ?

I am guessing it considers all the dimension columns and timestamp, Nothing else.

yes Dimensions and Timestamp. Please look at this example http://druid.io/blog/2013/09/12/the-art-of-approximating-distributions.html it will give you a better idea. 


I do have a case when I have a column which is neither dimension nor metric nor timestamp column,

what is the nature of this column ? is it a projection from other columns ? 

but I do want to have value of this column to be consider when Druid decides to roll up the rows. Is it even possible first of all, lets not talk about whether does it make sense or does it have any proper use case .

well is this dimension column part of the ingested data ? if so it will be part of the decision.
if you give me more examples about your use case i can answer to this question but to be honest still not getting it sorry :(


Please help in here.


On Saturday, 21 January 2017 00:15:46 UTC+5:30, Slim Bouguerra wrote:

--

B-Slim
_______/\/\/\_______/\/\/\_______/\/\/\_______/\/\/\_______/\/\/\_______

> On Jan 20, 2017, at 10:28 AM, Pravesh Gupta <gupta.p...@gmail.com> wrote:
>
> Thanks for the breakdown answer.
>
> One of my use case is ..Two users clicked on Same link at exactly same time .So Events corresponding to these two users happen to have same dimensions and same time.The only thing which differentiate them is unique id which I am ingesting in Druid but not as a dimension column(because of some reason let's say)

not sure what is the reason you are striping the userId but under those assumption how can you query the data afterward ? if you remove userID how is that helpful to have 2 rows with the same input ?  

> . So in this case how should i ensure there is no roll up.
> I want my event Count to be 2 .
>
> Hope I have made myself clearer .
>
> --
> You received this message because you are subscribed to the Google Groups "Druid User" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
> To post to this group, send email to druid...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/44c33a20-c306-4354-a62f-b0410863ef4c%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.

Pravesh Gupta

unread,
Sep 18, 2018, 12:42:37 AM9/18/18
to Druid User
Hi All,


We are doing some activity and require understanding of rollUp.
We have use case where we want to rollUp only based on timestamp and one particular dimension not all dimensions . Is this possible ?

In my data source I have many dimension and metrics but I want rollUp to happen based on only one dimension value and timestamp ofcourse.

Anything to do with druid version , Currently we are running 0.9.1.1 in production which I guess does not support this as Slim confirmed in above mail.

Nishant Bangarwa

unread,
Sep 18, 2018, 9:30:51 AM9/18/18
to Druid User
Hi Pravesh, 

Can you elaborate more on how you want to interpret/implement rollup for other dimensions when rolling up rows based on only one dimension value ? 

Cheers, 
Nishant
Hortonworks


Pravesh Gupta

unread,
Sep 20, 2018, 7:22:44 AM9/20/18
to Druid User
Hi Nishant,
We have following requirement :

Assume these are the following two events we received at Druid :

Event 1 is received and then Event 2 , both these events are having same value for rollUpDim ('GUID') , and we want to rollup on this dimension .

Event 1 : {timestamp:123, rollUpDim : "GUID", dim1 : "111", dim2:"444", dim3:"666", metricCount: 1}
Event 2: {timestamp:128,rolUpDim: "GUID", dim1 : "222", dim2:"444", dim4: "777", metricCount: 1}

Following is how we want to roll up these (Strategy : LongMax)  :

{{timestamp:128,rolUpDim: "GUID", dim1 : "222", dim2:"444", dim4: "777", dim3: "666", metricCount: 1}}

We kind of merged these rows .

Hope it is clear.

We are also worried for this approach to be possible in Kafka Indexing Service Windowless Ingestion.

Erik Dubbelboer

unread,
Sep 21, 2018, 12:07:34 AM9/21/18
to Druid User
In this case aren't dim1, dim2, dim3 and dim4 metrics that you rollup using LongMax? Or do you want to later filter based on these in your queries?

Pravesh Gupta

unread,
Sep 21, 2018, 4:46:43 AM9/21/18
to Druid User
Yeah we want to filter based on these in our queries.

RollUp decision we want only for rollUpDim Column .
Reply all
Reply to author
Forward
0 new messages