Hi,
I have a a bunch of files I want to serve, on-demand, to a UI, and use them as "small" OLAP cubes. When a user opens a file in a UI, I want to load via Druid ASAP and summarize it for the user. The cube has several dimensions (i.e. columns in the file), and I want to expose each dimension as a filtering criteria that can be mixed to his liking. When the user moves to another file, I want to free that cube from the memory and move to loading another cube. From the documentation it isn't clear to me if Druid is suitable for this – performance and functionality wise. My main 2 questions are:
For example, if my data consists of the following lines: time c-ip method cs-host cs-port cs-fullpath cs-content-type cs-user-agent 12:00:00 10.1.1.1 GET foo.com 80 /path/index.html text/html Mozilla 4.0 12:00:00 10.2.2.2 GET foo.com 80 /path/index.html text/html Mozilla 4.0 12:00:02 10.3.3.3 GET foo.com 80 /path/index.html text/html Mozilla 4.0 12:00:02 10.1.1.1 GET bar.com 80 /path/index.html text/html Mozilla 4.0 12:00:02 10.3.3.3 GET bar.com 80 /path/index.html text/html Chrome 80.1 12:00:59 10.1.1.1 GET bar.com 80 /path/index.html text/html IE 10 12:00:59 10.3.3.3 GET bar.com 80 /path/index.html text/html Mozilla 4.0 As such, I’d like each column to be a widget of unique values and the count – derived from the filter – of how many times it appeared. For example, based on the data above I’d like druid to return:
Upon applying a filter, e.g. `method==“GET" AND cs-host==“foo.com”, Druid should return as above, but only for rows matching the conditions. Even better if I could make it return everything, but `count` would be zero where it didn’t match the filter, e.g. for `cs-host`, after the filter, I’d ideally want: [{“foo.com”: 3}, {“bar.com”: 0}], and for for `cs-ip`: [{“10.1.1.1”: 1}, {“10.2.2.2”: 1}, {“10.3.3.3”: 1}] |
Thank you, Adi |
Thanks for the reply, but actually what I'm trying to figure out, is if by running a single query, I can get the count of each unique value of each dimension..
For example, If i follow your advice and use the count aggregator, I can run a group by query like this:
{
"queryType": "groupBy",
"dataSource": "wikipedia",
"granularity": "all",
"dimensions": [“page”],
"filter": {
"type": "and",
"fields": [
{ "type": "selector", "dimension": "country", "value": "United States" },
{ "type": "selector", "dimension": "language", "value": "en" }
]
},
"aggregations": [
{
"type": "count",
"name": "rows"
}
],
"intervals": [ "2010-01-01/2020-01-01" ]
}
I get the following results:
[ {
"version" : "v1",
"timestamp" : "2010-01-01T00:00:00.000Z",
"event" : {
"page" : "1935_in_Germany",
"rows" : 3
}
}, {
"version" : "v1",
"timestamp" : "2010-01-01T00:00:00.000Z",
"event" : {
"page" : "1992_Consensus",
"rows" : 1
}
"version" : "v1",
"timestamp" : "2010-01-01T00:00:00.000Z",
"event" : {
"page" : "2014_Winter_Olympic_and_Paralympic_Games_mascots",
"rows" : 1
} …. ]
Meaning, I got the count for each “page” value, according to the filters.
Now, what I’m trying to achieve is something like this, but for ALL dimensions.
Run some single query that would output the count of each value of each dimensions..
Is that possible?
Or the only solution is running the above query for each dimension, separately?...