Producing OLAP cubes with Druid

1,089 views
Skip to first unread message

Adi Leshem

unread,
Nov 2, 2015, 8:24:19 AM11/2/15
to Druid Development

Hi,


I have a a bunch of files I want to serve, on-demand, to a UI, and use them as "small" OLAP cubes. When a user opens a file in a UI, I want to load via Druid ASAP and summarize it for the user.

The cube has several dimensions (i.e. columns in the file), and I want to expose each dimension as a filtering criteria that can be mixed to his liking.

When the user moves to another file, I want to free that cube from the memory and move to loading another cube.


From the documentation it isn't clear to me if Druid is suitable for this – performance and functionality wise.


My main 2 questions are:

  1. Is Druid a good technology for this use-case in terms of on-demand performance?

  2. How to perform such queries that return multiple dimensions and aggregations – I’ve been trying for 3 days now and don’t seem to figure it out… :(


For example, if my data consists of the following lines:

time      c-ip      method cs-host cs-port cs-fullpath               cs-content-type cs-user-agent

12:00:00 10.1.1.1 GET     foo.com 80       /path/index.html text/html        Mozilla 4.0

12:00:00 10.2.2.2 GET     foo.com 80       /path/index.html text/html        Mozilla 4.0

12:00:02 10.3.3.3 GET     foo.com 80       /path/index.html text/html        Mozilla 4.0

12:00:02 10.1.1.1 GET     bar.com 80       /path/index.html text/html        Mozilla 4.0

12:00:02 10.3.3.3 GET     bar.com 80       /path/index.html text/html        Chrome 80.1

12:00:59 10.1.1.1 GET     bar.com 80       /path/index.html text/html        IE 10

12:00:59 10.3.3.3 GET     bar.com 80       /path/index.html text/html        Mozilla 4.0


As such, I’d like each column to be a widget of unique values and the count – derived from the filter – of how many times it appeared. For example, based on the data above I’d like druid to return:

  • For each `c-ip` – [{“10.1.1.1”: 3}, {“10.2.2.2”: 1}, {“10.3.3.3”: 3}]

  • for `cs-host` – [{“foo.com”: 3}, {“bar.com”: 4}]

  • … same for all other columns


Upon applying a filter, e.g. `method==“GET" AND cs-host==“foo.com”, Druid should return as above, but only for rows matching the conditions. Even better if I could make it return everything, but `count` would be zero where it didn’t match the filter, e.g. for `cs-host`, after the filter, I’d ideally want: [{“foo.com”: 3}, {“bar.com”: 0}], and for for `cs-ip`: [{“10.1.1.1”: 1}, {“10.2.2.2”: 1}, {“10.3.3.3”: 1}]



Thank you,

Adi

rohit kochar

unread,
Nov 3, 2015, 12:53:25 AM11/3/15
to Druid Development
Hello,
If i understood it correctly your requirement is to count the numbers of rows which match a certain criteria.
If that is the case than druid supports a "count" aggregator which can count the numbers of rows(events) which meets the provided filters.
You can use the timeseries query along with "count" aggregator to achieve the same and if you don't need data to be grouped by on time than you can set "granularity" as "all".


Thanks
Rohit

Adi Leshem

unread,
Nov 4, 2015, 9:27:58 AM11/4/15
to Druid Development

Thanks for the reply, but actually what I'm trying to figure out, is if by running a single query, I can get the count of each unique value of each dimension..


For example, If i follow your advice and use the count aggregator, I can run a group by query like this:


{

"queryType": "groupBy",

 "dataSource": "wikipedia",

 "granularity": "all",

 "dimensions": [“page”],

 "filter": {

    "type": "and",

    "fields": [

       { "type": "selector", "dimension": "country", "value": "United States" },

       { "type": "selector", "dimension": "language", "value": "en" }

    ]

 },

 "aggregations": [

    {

     "type": "count",

     "name": "rows"

    }

 ],

 "intervals": [ "2010-01-01/2020-01-01" ]

}



I get the following results:

[ {

 "version" : "v1",

 "timestamp" : "2010-01-01T00:00:00.000Z",

 "event" : {

   "page" : "1935_in_Germany",

   "rows" : 3

 }

}, {

 "version" : "v1",

 "timestamp" : "2010-01-01T00:00:00.000Z",

 "event" : {

   "page" : "1992_Consensus",

   "rows" : 1

 }

 "version" : "v1",

 "timestamp" : "2010-01-01T00:00:00.000Z",

 "event" : {

   "page" : "2014_Winter_Olympic_and_Paralympic_Games_mascots",

   "rows" : 1

 } …. ]


Meaning, I got the count for each “page” value, according to the filters.


Now, what I’m trying to achieve is something like this, but for ALL dimensions.

Run some single query that would output the count of each value of each dimensions..

Is that possible?

Or the only solution is running the above query for each dimension, separately?

Thanks!
Adi

David Lim

unread,
Nov 4, 2015, 1:33:02 PM11/4/15
to Druid Development
Hey Adi,

I believe you would see the best performance by issuing multiple TopN queries over the dimensions that you're interested in. I know it's a little cumbersome to have to issue multiple requests to get the data you need, so I thought you would be really interested in the following open source tools:

https://github.com/implydata/plywood
https://github.com/implydata/pivot

Plywood helps you issue the appropriate queries to build the visualizations that you're looking for so you can focus on your results without having to worry about the multiple underlying queries required to generate those results. Pivot is a fantastic interface that uses Plywood queries to generate visualizations for exactly the use case you're describing.

Performance wise, I think Druid would be a great fit for the use case that you've described!
...
Reply all
Reply to author
Forward
0 new messages