Can druid aggregate on strings(dimension)?

1019 views
Skip to first unread message

Tony Jin

unread,
Jan 20, 2016, 8:51:17 AM1/20/16
to Druid User
Hi, I have a problem about the druid query. My datasource structure is as follows:

clientMac                                   username                osType           traffic
00:19:F8:10:00:01               user0                    Window           10
00:19:F8:10:00:01               user1                    Linux                20
00:19:F8:10:00:02               user2                    Android            30
00:19:F8:10:00:03               user3                    ios                    40

So when I create the indexes, the dimensions are [mac, username, osType] and the metric is traffic.
And what I want to do is to group the traffics by clientMac:

{
    "queryType": "groupBy",
    "dataSource": "binnedSessions",
    "granularity": "all",
    "dimensions": ["clientMac"],
    "aggregations": [
      {
        "type": "count",
        "name": "totalCount"
      }, {
        "type": "doubleSum",
        "name": "traffic",
        "fieldName": "traffic"
               }],
    "intervals": [
      "2015-11-01T00:00:00.000/2015-12-20T23:00:00.000"
    ]
}
I can get 
[
  {
    "version": "v1",
    "timestamp": "2015-11-01T00:00:00.000Z",
    "event": {
      "clientMac": "00:19:F8:10:00:01",
      "totalCount": 2,
      "traffic": 30
    }
  },
  {
    "version": "v1",
    "timestamp": "2015-11-01T00:00:00.000Z",
    "event": {
      "clientMac": "00:19:F8:10:00:02",
      "totalCount": 1,
      "traffic": 30
    }
  },
  {
    "version": "v1",
    "timestamp": "2015-11-01T00:00:00.000Z",
    "event": {
      "clientMac": "00:19:F8:10:00:03",
      "totalCount": 1,
      "traffic": 40
    }
  }
]


But, how can I also have the "username" and "osType" dimensions inside the query results:
[
  {
    "version": "v1",
    "timestamp": "2015-11-01T00:00:00.000Z",
    "event": {
      "clientMac": "00:19:F8:10:00:01",
      "totalCount": 2,
      "username": ["user0", "user1"],
      "osType": ["Window", "Linux"],
      "traffic": 30
    }
  },
  {
    "version": "v1",
    "timestamp": "2015-11-01T00:00:00.000Z",
    "event": {
      "clientMac": "00:19:F8:10:00:02",
      "totalCount": 1,     
              "username": ["user2"],

      "osType": ["Android"],
      "traffic": 30
    }
  },
  {
    "version": "v1",
    "timestamp": "2015-11-01T00:00:00.000Z",
    "event": {
      "clientMac": "00:19:F8:10:00:03",
      "totalCount": 1,             
             "username": ["user3"],

      "osType": ["ios"],
      "traffic": 40
    }
  }
]



Vadim Ogievetsky

unread,
Jan 20, 2016, 1:22:58 PM1/20/16
to Druid User
Hi Tony,

What you are trying to do totally makes sense but unfortunately I doubt that you would be able to do it in a single Druid query any time soon. (Please, Druid peeps, prove me wrong!)

The query you want could be expressed as SQL as follows: 

SELECT
  clientMac,
  MAKE_SET(username) AS 'username',
  MAKE_SET(osType) AS 'osType',
  COUNT(*) AS totalCount,
  SUM(traffic) AS traffic
FROM binnedSessions
WHERE '2015-11-01T00:00:00' <= __time AND __time < '2015-12-20T23:00:00.000'

This is not possible to do in Druid as a single query because Druid does not support a MAKE_SET aggregation. It probably wont in the near future since aggregations are heavily assumed to be numeric.

There is a library called Plywood ( https://github.com/implydata/plywood ) that acts as a query planner for Druid, amongst other things is allows you to issue SQL-like queries using a language called PlyQL.

Bad news is that Plywood/PlyQL do not support MAKE_SET ether (but should pretty soon). 
Good news is that you can get the data out that you need and maybe even create queries that are more useful to you.

The following query will work and will give you all the information that you need:

SELECT
  clientMac,
  (SELECT username AS 'val' GROUP BY username) AS 'username',
  (SELECT osType AS 'val' GROUP BY osType) AS 'osType',
  COUNT(*) AS totalCount,
  SUM(traffic) AS traffic
FROM binnedSessions
WHERE '2015-11-01T00:00:00' <= __time AND __time < '2015-12-20T23:00:00.000'

The only difference is that instead of returning Sets like

"username": ["user2"]

it will return Datasets like:

"username": [{ "val": "user2" }]

But you should be able to figure it out.

The added bonus is that you can add sorting and limiting in the nested queries like so:

SELECT
  clientMac,
  (SELECT username AS 'val', SUM(traffic) AS 't' GROUP BY username ORDER BY t DESC LIMIT 100) AS 'username',
  (SELECT osType AS 'val', SUM(traffic) AS 't' GROUP BY osType ORDER BY t DESC LIMIT 100) AS 'osType',
  COUNT(*) AS totalCount,
  SUM(traffic) AS traffic
FROM binnedSessions
WHERE '2015-11-01T00:00:00' <= __time AND __time < '2015-12-20T23:00:00.000'

And you can nest these queries as much as your heart desires.

If you want to try this, grab the PlyQL command line tool here: https://github.com/implydata/plyql and try pasting in the query above.

If you have any questions or suggestions for other queries you want to do, we would love to hear.

Please head over to:
Where we provide Plywood / PlyQL support.

charles.allen

unread,
Jan 20, 2016, 4:14:48 PM1/20/16
to Druid User
Vadim is correct in that a "MAKE_SET from dimension" kind of functionality is not currently available in stock Druid. That would be a cool feature request though. 

You CAN do multiple queries where the first query gets the list of mac ids and the second sequence of queries gets the sub-features with a filter of mac-id = whatever.

It is certainly clunky to do it in such a manner but can be accomplished with stock druid.

charles.allen

unread,
Jan 20, 2016, 4:15:29 PM1/20/16
to Druid User
Would you mind referencing this topic in a feature request submission to https://github.com/druid-io/druid/issues ?

Tony Jin

unread,
Jan 21, 2016, 1:43:02 AM1/21/16
to Druid User
Hi Vadim,

Thank you very much, your reply is quite helpful. Now I already solved my problem by using two successive queries, just as Charles mentioned, first one to groupBy "clientMac", and second one to query other according dimension values filtered by the clientMac list from the first query. I also tried PlyQL. It's really cool. By the way, you missed the GroupBy clientMac in your sql select script, so it printed:
There was an error getting the data: could not resolve $clientMac

 The correct one is: 
plyql -h ip -q "
  SELECT
  clientMac,
  (SELECT username AS 'val', SUM(traffic_r) AS 't' GROUP BY username ORDER BY t DESC LIMIT 100) AS 'username',
  (SELECT osType AS 'val', SUM(traffic_r) AS 't' GROUP BY osType ORDER BY t DESC LIMIT 100) AS 'osType',
  COUNT(*) AS totalCount,
  SUM(traffic_r) AS traffic
FROM binnedSessions
WHERE '2015-11-01T00:00:00' <= __time AND __time < '2015-12-20T23:00:00.000'
GROUP BY clientMac
LIMIT 5
"

Tony Jin

unread,
Jan 21, 2016, 1:48:12 AM1/21/16
to Druid User
Hi Charles,

Thanks a lot. I solved my problem based on your suggestion. Aggregation on dimension strings will be a cool feature. Hope druid will have it in the near future. Thank you for your efforts on powerful druid. 
Message has been deleted

Tony Jin

unread,
Jan 21, 2016, 1:50:24 AM1/21/16
to Druid User
No problem, i will go to create a feature request  issue on github.


On Thursday, January 21, 2016 at 5:15:29 AM UTC+8, charles.allen wrote:

Tony Jin

unread,
Jan 21, 2016, 2:13:35 AM1/21/16
to Druid User


On Thursday, January 21, 2016 at 5:15:29 AM UTC+8, charles.allen wrote:
Reply all
Reply to author
Forward
0 new messages