Jon,
This is a great question, I'd like to start by first repeating what I
believe you are saying the problem is.
You take multiple customers and put them in a single table. However,
each customer can have their own, disparate set of dimensions. What
you've done is say, "we'll set aside up to 100 dimensions and map them
to meaningful dimension names externally".
The problem you are running into, however, is that, for example, each
customer might have a "gender" dimension, but customer 1 might have it
mapped to dimension1, customer2 to dimension2, etc. This means that
each column might have its own set of "male", "female", "transgender",
etc. So, in the limit, all columns actually end up looking like "high
cardinality" columns.
When you look at your indexes, they appear to be really large compared
to the input data and you are thinking that is because of lots of runs
of 0's in the bitmap indexes. Am I understanding correctly?
Assuming that is all correct, I'd like to ask some questions.
Question line 1:
a) Do any of your customers currently use all 100 dimensions, or is
that number being set aside more as a safety?
b) Assuming it's set aside for "safety", are you currently actually
materializing all 100 dimensions (in your ingestion spec, are you
telling it to include all 100 dimensions, or are you allowing it to
see what dimensions actually exist and build the dimension set from
that?)
c) If it is the case that you are materializing all dimensions, I
wouldn't be surprised if the extra space is actually storing "null" in
the unused dimensions. We currently don't optimize columns of a
single value very well. They can effectively be optimized to a
constant and that might resolve your issue. That said, the better fix
would be to allow Druid to build the dimension set as it indexes
instead of materializing all of them. You can do this by using a
dimension "blacklist" instead of a "whitelist" in your indexing spec.
If you'd be willing to share your spec we can probably help point you
to it.
Question line 2:
How many columns do you actually have? That is, if you were to take
the set of dimensions from your current customers, what would the
superset of column names be?
Druid can handle "schema-less" dimensions meaning that it can just add
dimensions as it sees new data. Given that your data schema will be
customer specific and you are going to partition in a way that a
customer's data co-exists in the same segment, you should be able to
leverage the schema-less dimension sets to great effect.
That is, you can use the actual dimension names you get from your
customers and allow Druid to automatically build up the set of
dimensions that it sees. When you partition by customer, this will
mean that segments with customer X data will have customer X columns,
but not necessarily customer Y columns and vice versa (essentially
giving you a proxy of "per customer datasource" without as much
overhead). This also means that values would get to be re-used
between similarly named columns.
Druid can handle different segments with different schemas, so the
fact that the segments do not share a schema is nothing to be
concerned about.
Fwiw, my recommendation is to do the latter and leverage Druid's
schema-less columns. One thing of note for this, however, is that
handling of "null" values is still not fully consistent. The current
set of known bugs are on the ingestion side though and have to do with
empty string vs. null. On the query side, using 0.7, I do not expect
you to run into issues with null handling.
--Eric
> --
> You received this message because you are subscribed to the Google Groups
> "Druid Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to
druid-developm...@googlegroups.com.
> To post to this group, send email to
druid-de...@googlegroups.com.
> To view this discussion on the web visit
>
https://groups.google.com/d/msgid/druid-development/85468d72-d422-4526-991a-f85f9334d063%40googlegroups.com.
> For more options, visit
https://groups.google.com/d/optout.