Evaluating Druid for our use cases (window functions)

946 views
Skip to first unread message

Benjamin Ross

unread,
Aug 14, 2015, 2:37:38 PM8/14/15
to Druid User
Hi all,
I work at a startup that is interested in using Druid for a variety of use cases, and I'm wondering how well it fits our needs.

In particular, it seems as though the timeseries query doesn't have support for filters on window functions that lie outside the time range.  We need to be able to ask whether, within a particular period of resulting data, whether some expression using an aggregate on possibly past periods is true.

As a concrete example, let's say I have the following data of rolled up purchases of products x and y and the amount of money that was spent:

AccountID   PeriodID   Product_x_purchased?   Product_x_spend   Product_y_purchased?   Product_y_spend
1              1             0                       0                  0                    0
1              2             1                       100                0                    0
1              3             1                       300                0                    0
1              4             0                       0                  0                    0
2              1             1                       1000               0                    0
2              2             0                       0                  1                    300
2              3             1                       250                1                    300
2              4             1                       100                1                    300


I want to be able to ask the following:
1. In period 4, the accounts that have spent more than 50 dollars:  Account 1 (count = 1)
2. Over all periods, the accounts and periods for which product x spend was greater than 150:  {Account 1, Period 3}, {Account 2, Period 1}, {Account 2, Period 3} (count = 3)
3. In period 4, the accounts that have spent in total more than $300 on product x over the last three periods AND have purchased product y: {Account 2}, (count = 1) (because 250+100>300 and account 1 didn't purchase product y in period 4)
4. Over all periods, the accounts and periods that have spent in total more than $300 on product x over the last three periods AND have purchased product y: {Account 2, Period 4}, {Account 2, Period 3}, {Account 2, Period 2} (count = 3)

Items 3 and 4 are basically usages of row frame window functions (see for reference http://blog.jooq.org/2013/11/03/probably-the-coolest-sql-feature-window-functions/)

It seems as though items 3 and 4 are not supported.  Let me know if I'm reading that right.

Thanks in advance,
Ben

Fangjin Yang

unread,
Aug 16, 2015, 11:19:07 AM8/16/15
to Druid User
Hi Benjamin, please see inline.


On Friday, August 14, 2015 at 11:37:38 AM UTC-7, Benjamin Ross wrote:
Hi all,
I work at a startup that is interested in using Druid for a variety of use cases, and I'm wondering how well it fits our needs.

In particular, it seems as though the timeseries query doesn't have support for filters on window functions that lie outside the time range.  We need to be able to ask whether, within a particular period of resulting data, whether some expression using an aggregate on possibly past periods is true.

Druid's native query language supports an "interval" for data being queried. For example, I know of several data applications built on top of Druid that have a "compare" feature, where data from arbitrarily time ranges can be compared. Can you post the SQL query you are thinking of issuing?

As a concrete example, let's say I have the following data of rolled up purchases of products x and y and the amount of money that was spent:

AccountID   PeriodID   Product_x_purchased?   Product_x_spend   Product_y_purchased?   Product_y_spend
1              1             0                       0                  0                    0
1              2             1                       100                0                    0
1              3             1                       300                0                    0
1              4             0                       0                  0                    0
2              1             1                       1000               0                    0
2              2             0                       0                  1                    300
2              3             1                       250                1                    300
2              4             1                       100                1                    300


I want to be able to ask the following:
1. In period 4, the accounts that have spent more than 50 dollars:  Account 1 (count = 1)

You can take a look the having filter for groupBy queries here.
 
2. Over all periods, the accounts and periods for which product x spend was greater than 150:  {Account 1, Period 3}, {Account 2, Period 1}, {Account 2, Period 3} (count = 3)

 
3. In period 4, the accounts that have spent in total more than $300 on product x over the last three periods AND have purchased product y: {Account 2}, (count = 1) (because 250+100>300 and account 1 didn't purchase product y in period 4)
4. Over all periods, the accounts and periods that have spent in total more than $300 on product x over the last three periods AND have purchased product y: {Account 2, Period 4}, {Account 2, Period 3}, {Account 2, Period 2} (count = 3)

Items 3 and 4 are basically usages of row frame window functions (see for reference http://blog.jooq.org/2013/11/03/probably-the-coolest-sql-feature-window-functions/)

To answer items 3 and 4, you will have to issue multiple Druid queries and do a little bit of work on the client side. For example, you can use the results of the first query as a filter for the results of the second query. 

It seems as though items 3 and 4 are not supported.  Let me know if I'm reading that right.

Let me know if that helps.
-- FJ
 
Thanks in advance,
Ben

Benjamin Ross

unread,
Aug 16, 2015, 10:13:29 PM8/16/15
to Druid User
Hi Fangjin. Thanks, that helps.

Benjamin Ross

unread,
Aug 16, 2015, 11:38:38 PM8/16/15
to Druid User
Fanglin, it seems like in general this won't work for us unless we modify Druid.  This is because we'll need to stream a potentially large amount of data to the web server and do further processing on it there before passing it up for consumption somewhere else.  For example, if we simply wanted to return the number of matching accounts in case 3, we'd need to stream all those records to the web server via multiple queries and then finally post-process and aggregate the results to obtain a count.

It looks like we'll want to look into modifying Druid for our use case.  Specifically, it *seems* as though we'll need to figure out a way to add support for filters against aggregations, and then to modify aggregations to be able to support windowing constraints.  If you or anyone else has any insight into that, it would be appreciated.

Thanks again!
Ben

  

On Sunday, August 16, 2015 at 10:13:29 PM UTC-4, Benjamin Ross wrote:
Hi Fangjin.  Thanks, that helps.

Nishant Bangarwa

unread,
Aug 17, 2015, 9:11:11 AM8/17/15
to druid...@googlegroups.com
Hi Benjamin, 

Druid already have support for Filtered Aggregator (http://druid.io/docs/latest/querying/aggregations.html) where you can do your basic filtering. 
It seems they might be helpful for your use case ?

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/13dbd507-8434-4426-ac79-cd36fc05b01a%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

Benjamin Ross

unread,
Aug 17, 2015, 10:44:12 AM8/17/15
to druid...@googlegroups.com
Hey Nishant,
Thanks.  It seems as though that's support for filtering the aggregator output by using pre-aggregated columns, not filters against outputted aggregator columns.  In addition it looks like it doesn't change the cardinality of the resulting set.  But all this being said I haven't tried it to be sure.  Thanks.

Ben


On Aug 17, 2015, at 6:11 AM, Nishant Bangarwa <nishant....@metamarkets.com> wrote:

Hi Benjamin, 

Druid already have support for Filtered Aggregator (http://druid.io/docs/latest/querying/aggregations.html) where you can do your basic filtering. 
It seems they might be helpful for your use case ?
On Mon, Aug 17, 2015 at 9:08 AM, Benjamin Ross <benjam...@gmail.com> wrote:
Fanglin, it seems like in general this won't work for us unless we modify Druid.  This is because we'll need to stream a potentially large amount of data to the web server and do further processing on it there before passing it up for consumption somewhere else.  For example, if we simply wanted to return the number of matching accounts in case 3, we'd need to stream all those records to the web server via multiple queries and then finally post-process and aggregate the results to obtain a count.

It looks like we'll want to look into modifying Druid for our use case.  Specifically, it *seems* as though we'll need to figure out a way to add support for filters against aggregations, and then to modify aggregations to be able to support windowing constraints.  If you or anyone else has any insight into that, it would be appreciated.

Thanks again!
Ben

  

On Sunday, August 16, 2015 at 10:13:29 PM UTC-4, Benjamin Ross wrote:
Hi Fangjin.  Thanks, that helps.

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe Sent from my iPhone

To post to this group, send email to druid...@googlegroups.com.

--
You received this message because you are subscribed to a topic in the Google Groups "Druid User" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/druid-user/W2TUlRNA1ew/unsubscribe.
To unsubscribe from this group and all its topics, send an email to druid-user+...@googlegroups.com.

To post to this group, send email to druid...@googlegroups.com.

Gurdeep Singh

unread,
Jul 31, 2017, 12:16:07 PM7/31/17
to Druid User
Hi Ben,

Did you guys go down the route to making changes to Druid to achieve your use cases on Windowing? We are exploring possibilities to deliver similar use cases using Druid and would like to get your inputs into this.

Gurdeep

Benjamin Ross

unread,
Jul 31, 2017, 7:27:19 PM7/31/17
to druid...@googlegroups.com
Hey Gurdeep,
We ended up going with Redshift for our use case.  It supported all that we needed to do, and also allowed us to create a relational object model - which was a nice added bonus and was lacking in Druid.

Ben

Sent from my iPhone
Reply all
Reply to author
Forward
0 new messages