Multi-tenant deployment where each client has different dimension set

136 views
Skip to first unread message

Jakub Liska

unread,
Jan 14, 2016, 10:50:01 AM1/14/16
to Druid User, Petr J
Hey,

imagine you are ingesting impression log of multiple clients, the Kafka events of all clients share : 
1) userId, cookieID
2) possibly a few common dimensions with corresponding metrics

But each client can have it's own dimension set, because their businesses differ.

For simplicity, let's imagine we would never have to scale horizontally beyond a single node, we are just interested in real-time interactivity. 

For instance this says that you can have some dimensions missing in some events, but what about this ^ schemaless  case ? 

Should each client have it's own Table DataSource? So that Storm/spark-streaming that is feeding real-time node "categorizes" events by client
and stores them into different Druid DataSources? 

Can this run on a simple Druid setup with a single real-time service, like the one ine Druid's Tutorial ? Because from what I know Druid is able to ingest only single stream by default, right? 

Thank you ! Jakub

Gian Merlino

unread,
Jan 14, 2016, 6:21:08 PM1/14/16
to druid...@googlegroups.com, Petr J
Hey Jakub,

You can have a Druid schema that doesn't explicitly specify dimensions, and in that case any field not already specified as a timestamp or metric will be ingested as a dimension. This feature is often helpful in your case.

When deciding whether to use a shared datasources, or a datasource per tenant, the considerations are usually:

Pros of datasources per tenant:
- Each datasource can have its own schema, its own backfills, its own partitioning rules, and its own data load rules
- Queries can be faster since there will be fewer segments to examine for a typical tenant's query
- You get the most flexibility

Pros of shared datasources:
- Each datasource requires its own JVMs for realtime indexing
- Each datasource requires its own YARN resources for hadoop batch jobs
- Each datasource requires its own segment files on disk
- For these reasons it can be wasteful to have a very large number of small datasources

Gian

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/0403a20a-3b5e-452f-a9ac-43f211b62007%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jakub Liska

unread,
Jan 14, 2016, 6:35:17 PM1/14/16
to Druid User, pe...@globalwebindex.net
Hi Gian,

in that case shared datasource would be sufficient. 

Just one correction, I actually knew about having arbitrary dimensions is possible. 
What I wanted to say that each client can have arbitrary metrics. Is that possible too with shared datasource? 
Because metrics need to have pre-defined aggregator, right? Or is it possible to use Count aggregator for instance by default for these arbitrary metrics ?

Thank you for insights ! Jakub

Gian Merlino

unread,
Jan 14, 2016, 6:38:56 PM1/14/16
to druid...@googlegroups.com
Hey Jakub,

Arbitrary metrics are not possible, but one way people get around that is to have a few predefined metrics like "met1_sum", "met1_min", "met1_max", and so on.

Gian

Jakub Liska

unread,
Feb 1, 2016, 7:50:50 AM2/1/16
to Druid User
Hi Gian,

You can have a Druid schema that doesn't explicitly specify dimensions, and in that case any field not already specified as a timestamp or metric will be ingested as a dimension. This feature is often helpful in your case.


I cannot confirm that, steps to reproduce :

1)  index this task : https://gist.github.com/l15k4/1054f2a127b5050c33c0  with events : 
{"time": "2015-01-01T00:00:01.000", "gwid": "8f14e45f-ceea-367a-9a36-dedd4bea2543", "country": "nzl", "section": 0.08739450661637685, "purchase": "small", "kv_0": 0.07701255849946403, "kv_1": 0.019320348491121905, "kv_2": 0.2097968108921387, "kv_3": 0.09669132192026665}

Where kv_* are the dynamic dimensions

2) raw select query https://gist.github.com/l15k4/78e6474785308d2e8f61  then returns events with the 3 dimensions only : country, section and purchase.


Is there anything I need to do explicitly to enable this feature ?


Gian Merlino

unread,
Feb 1, 2016, 12:30:55 PM2/1/16
to druid...@googlegroups.com
Hey Jakub,

You need to have a dimensionsSpec that does not have any dimensions. It can have some dimensionExclusions. So for example:

  "dimensionsSpec" : {
    "dimensions" : [],
    "dimensionExclusions" : [
      "timestamp",
      "value"
    ]
  }

Gian

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.

Jakub Liska

unread,
Feb 3, 2016, 5:01:36 PM2/3/16
to Druid User
Hi Gian,

that seems to be working, although performance is a little unexpected.

I've done some benchmarks and Druid seems to perform quite well.
I tested it in your distribution-docker container even with 1000 custom dimensions named "kv_1" - "kv_1000" each with its own normally distributed values (Standard Deviation 0.2) 
and even with that Druid is ingesting 4000 events/s (quad-core with 12GB ram)

But this performance applies to json data files (50 MB each) where individual json event objects are not delimited by EOL "\n" ... If I do so, performance drops radically (~ 8 times) and the task always fails.

So the Overlord really likes one huge line of many json objects (50MB of 1600 json objects) and cannot handle 50MB files each with ~ 1600 lines of json events/objects.

Would you please take a look at it? I'm now considering working with json files without EOLs...

Thank you for your insights Gian !

Jakub Liska

unread,
Feb 3, 2016, 5:22:09 PM2/3/16
to Druid User
Github issue https://github.com/druid-io/druid/issues/2389 with benchmark included.
Reply all
Reply to author
Forward
0 new messages