Virtual column proposals

718 views
Skip to first unread message

Gian Merlino

unread,
Dec 7, 2016, 11:30:57 PM12/7/16
to druid-de...@googlegroups.com
I wanted to gather together some related PR discussions into a dev list discussion, since there's been a lot of chatter on GitHub recently and this represents a potentially major new subsystem in Druid.

So far we have https://github.com/druid-io/druid/pull/2511 merged in master, which adds a simple interface and a few places where it gets used. On deck are https://github.com/druid-io/druid/pull/3758 (expanding the interface and bringing expressions under it) and https://github.com/druid-io/druid/pull/3755 (filtering on expressions, which doesn't use virtual columns currently but would have to if #3758 is merged). All of those PRs have fascinating threads associated with them.

My feeling is that it makes sense to use virtual columns as a "projection" concept throughout Druid, that allows us to avoid needing all aggregators, filters, etc to need to be aware of the different interfaces that exist for transforming things (extractionFn, dimensionSpec, math expressions). This suggests to me that:

- expressions should be accessible only through virtual columns (mostly to simplify the rest of the code). we can do this now since expressions were never actually usable in a release, only in master

- extractionFns and dimensionSpecs should at some point be brought into the virtual column world, although we should also continue to support them in the places they exist now for backwards compatibility

- we need to be able to filter on virtual columns at some point. It would be nice if somehow enough information is plumbed through such that bitmap indexes are still usable in cases where they make sense (they won't make sense in all cases).

My concrete proposal for an interface is the interface embodied by #3758. It's very similar to one Navis proposed in a comment to #2511. It will likely need more work to add filtering but it's a start.

So far the work here has mostly been done by Navis, although a lot of his work isn't in master yet, pending working out how this stuff should behave. So feedback on the direction of this subsystem is very welcome. I'm excited about moving towards a world where more kinds of query time transformations are possible in Druid.

And thank you Navis for the idea and almost all of the code!

Gian


Gian

Julian Hyde

unread,
Dec 8, 2016, 12:37:23 AM12/8/16
to druid-de...@googlegroups.com
What a coincidence; Slim and I were just talking about virtual columns at the espresso machine. I have a few thoughts on where you might take this.

Are there any plans to store virtual columns? Similar to what MySQL calls “generated columns” [1] and I would call “materialized virtual columns”. Regular virtual columns just save typing (like a view in a SQL database) but stored virtual columns affect what is stored on disk and so can potentially give performance benefits. Some cases worth considering: partition on an expression, index an expression, or compute stats on an expression.

As with materialized views, materialized virtual columns are most useful if a user can just write an expression in their query and the planner rewrites in terms of the materialized virtual column. i.e. transparent rewrite.

From the planner’s perspective, you don’t want to say “x is computed using the formula y + 10”, you want to say “there is a constraint that ensures that x is always equal to y + 10, or equivalently, y is always equal to x - 10”. At query time, materialized virtual columns are just regular columns, and the constraint formulation puts all columns on an equal footing.

I can see how it might make sense to make all expressions into virtual columns throughout Druid’s code base, but be careful you don’t go too far and assume that they have the same access cost. You don’t want to end up computing everything you need in a scan, only to throw away 99.99% of rows and wasting the effort of computing a lot of values you later throw away.

Julian

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-developm...@googlegroups.com.
To post to this group, send email to druid-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/D8421593-DCDB-4BA1-9A35-25351D3B5519%40imply.io.
For more options, visit https://groups.google.com/d/optout.

Gian Merlino

unread,
Dec 8, 2016, 11:45:33 AM12/8/16
to druid-de...@googlegroups.com
I like that espresso machine talk is the new water cooler talk :)

The main motivation for virtual columns in Druid right now is that they do _more_ than save typing. They enable functionality that is not currently possible. Druid's JSON language, and query internals, can't provide even some simple things like GROUP BY IF(x <> '', x, y). Virtual columns provide a projection layer that Druid was previously lacking (and thanks again Navis for the idea and initial code).

Everything else you're saying makes total sense. I didn't personally have plans to store virtual columns but the idea sounds very useful. We could probably do this by incorporating them into the ingestion layer too, and then users can do a re-indexing to materialize a virtual column by specifying one to the indexer.

Gian

To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/B97CDE7D-AC4C-4044-9790-74052879AE21%40apache.org.

Gian Merlino

unread,
Dec 8, 2016, 11:49:50 AM12/8/16
to druid-de...@googlegroups.com
For anyone who hasn't taken the time to read through #2511, #3758, #3755, I think the biggest question at this point is whether expressions should be accessible _directly_ by filters, aggregators, etc, or only through an "expression" virtual column. I'm in favor of the latter (I think it simplifies the implementation and makes it easier to keep the subsystems working well together and free of bugs) but it does mean that the JSON language is a bit more verbose, and that it will take some more work before filtering on expressions is possible. Virtual columns in their current form don't work well with filters. But, I think they could…

Gian

Jonathan Wei

unread,
Dec 8, 2016, 9:40:05 PM12/8/16
to druid-de...@googlegroups.com
I'm in favor of using virtual column as the common "projection" layer, to avoid having too many separate transformation interfaces. It would keep the implementation simpler for developers, and I think having a unified 'language' for expressing value transformations is also more user-friendly.

I think having expression transformations be accessed exclusively through virtual columns is the right way to go, vs. having the filters/aggregators/etc. be aware of them (I see their responsibility as limited to filtering/aggregating on some provided set of values, without any regard for how those values were derived).

I agree with transitioning extractionFn/dimensionSpec use cases to virtual columns as well. Personally, I've regarded those as a rudimentary form of "virtual column". I haven't fully thought through all of the implementation details, but it may be fairly straightforward to do that transition with new virtual column types that handle the existing extractionFn/'decorating' DimensionSpec use cases, and some translation layer that converts the old-style specs into the equivalent virtual column specs for backwards compatibility.

Stored virtual columns also sound pretty cool. One minor thing we could probably do with that is to have the current coordinate-pair "spatial" dimension use case be expressed with virtual columns instead.

- Jon

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.

Roman Leventov

unread,
Dec 9, 2016, 2:05:59 AM12/9/16
to druid-de...@googlegroups.com
I support moving expressionFns away from aggregators, because it's aligned with single responsibility principle.

I don't like the idea of turning expression filters into "virtual columns + filter on value=true/false" combination.
 - Filter abstraction is not gone, so it won't simplify all related code fundamentally.
 - Such filters will either have to use generic interface which will return Object as dimension value, that will incur boxing of boolean values (that is allocation-free, but definitely more expensive than just working with raw boolean values returned from ValueMatcher.matches()), or need to add another boolean specialization (along with Long and Double specializations which already exist somewhere), that adds complexity (that is acceptable sometimes, but keeping in mind that the main argument for moving expression filters to virtual columns is reducing complexity, this argument is discounted).
 - Consider the case when you want to filter on some virtual value using some expression and aggregate on that at the time, e. g.
virtualX = metricA + metricB
filter: virtualX > metricC
aggregate: sum (virtualX)
If we implicitly make virtualY = virtualX > metricC and want virtualX to be computed only once during processing a row, it could be tricky to implement, e. g. cache the computed value inside virtualX selector (but if there is no filter like in the example, we may not want to do this for better performance) and reference virtualX selector from virtualY selector (but if the "filter-virtual-column" reference only "simple" columns, we may not want to do this for better performance). With proper expression filters it seems to be easier to implement and may avoid writing the cached values to fields of long-living objects (like selectors), giving chances for scalar replacement, if the intermediate row values holder is scalarized (it won't happen automatically, but it could be done with semi-manual query processing specialization which I'm currently work on).
 - Talking about simplicity for users, to me it feels more like simplicity in the sense "when you have a hammer everything looks like a nail". Again, given that filter abstraction will still exist, I'm more sympathetic to "native" expression filters.

Regarding DimensionSpecs, LookupDimensionSpec is definitely a candidate for replacement with virtual columns. ListFiltered and RegexFiltered are quite special. I'm not ready to reason about them without deeper thinking, it could be that they also fit into virtual column abstraction, or better stay a distinct matter, only might be done internal to query processing, not a user-facing abstraction as it is now.


Gian Merlino

unread,
Dec 9, 2016, 11:13:51 AM12/9/16
to druid-de...@googlegroups.com
Roman thanks for commenting.

I think there's nothing really wrong with adding a boolean specialization to column selectors. I could see adding a double specialization too, since right now only having "float" means sometimes we downcast doubles to floats when ideally we would keep them as doubles. It is more complexity but it's the "good" kind of complexity, in that we're not adding any additional links between subsystems, just fleshing out the existing ones more.

I figured we would want to cache virtual column selector values someday anyway. It'll be a relatively common desire, I think, such as for lookups (you have one virtual column looking up the value, and then potentially a variety of operations on that looked up value). Also, I think the caching question is kind of independent. In your example, in either case (native expression filters vs. not) we'd want to cache virtualX and not cache "virtualX > metricC". We could accomplish that by only caching selector values if they're referenced by other selectors, and this could be done with or without native expression filters.

I think native expression filters can make sense but I really think that including methods in ColumnSelectorFactory and ValueMatcherFactory that are expression-aware is adding the "bad" kind of complexity: links between subsystems that don't necessarily need to be linked. In this case, the storage adapter and expression subsystems.

Perhaps we could agree on something like this. On the VirtualColumn interface, add:

   ValueMatcher makeValueMatcher(String columnName, ColumnSelectorFactory factory);

And then on ValueMatcherFactory, add:

   ValueMatcher makeValueMatcher(String columnName, VirtualColumn virtualColumn);

And remove "makeMathExpressionSelector" from ColumnSelectorFactory.

This would accomplish:

1) Native expression filters could work by accepting an expression string, creating an ExpressionVirtualColumn, and passing that to the ValueMatcherFactory. You could also write a "virtual column" filter if you wanted that would accept any inline anonymous virtual column, or named virtual column.
2) ValueMatcher is already specialized to boolean so no boxing needed.
3) No need for ColumnSelectorFactory impls to know about expressions (this is the main thing I'm hoping to avoid – a link between storage adapters and expressions).

Most ValueMatcherFactory impls already have a ColumnSelectorFactory available to them so I hope it won't be too hard for them to implement "ValueMatcher makeValueMatcher(String columnName, VirtualColumn virtualColumn)".

Gian

Roman Leventov

unread,
Dec 9, 2016, 4:15:04 PM12/9/16
to druid-de...@googlegroups.com
On Fri, Dec 9, 2016 at 10:12 AM, Gian Merlino <gi...@imply.io> wrote:
I think native expression filters can make sense but I really think that including methods in ColumnSelectorFactory and ValueMatcherFactory that are expression-aware is adding the "bad" kind of complexity: links between subsystems that don't necessarily need to be linked. In this case, the storage adapter and expression subsystems.

I agree with this. However your proposal makes VirtualColumn a leaked abstraction, because it starts to know that it "could be a filter".

Another way to isolate ValueMatcherFactories and StorageAdapters from expressions is to make Filter.makeMatcher() to accept ColumnSelectorFactory. So simple Filters could pull a single dimensionSelector, expression filters may several ones. And ValueMatcherFactory interface could be removed.

Gian Merlino

unread,
Dec 9, 2016, 4:27:53 PM12/9/16
to druid-de...@googlegroups.com
IMO a "makeValueMatcher" method on VirtualColumn isn't really very leaky, since ValueMatcher is just a boolean-typed selector. "ValueMatcher makeValueMatcher(factory)" is essentially the same thing as "BooleanColumnSelector makeBooleanColumnSelector(factory)" method, but just with a funny name. The implementation would not need to know about filters, it just has to return a boolean.

Filter.makeMatcher accepting ColumnSelectorFactory sounds good to me right now. I haven't thought through it that much, but if it works then I think it'd also be a good solution.

Gian

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.

Himanshu Gupta

unread,
Dec 12, 2016, 4:53:25 PM12/12/16
to Druid Development
I skimmed through #2511 and #3758 today, For me, here are the preferred high level expectations.

1- Virtual columns should work universally across all query types, aggregators, filters and whatever else deals with virtual columns.
2- Virtual columns work "transparently", that is without all extensions to not necessarily implement them. For example, "all" aggregators and filters should be able to have things like expressions without needing to implement that specially in that aggregator or filter. expansion of same idea is that adding more things like "expressions" should just need implementing another virtual column and no more change.
3- We should be able to do the grouping in groupBy using virtual columns.
4- Semantics of virtual column should be same everywhere. For example, currently it looks like that "select" query would include the virtual column output in the response returned to the user while "groupBy" would not and they are there only to be referenced by aggregators/filters.
5- Folding extractioFn into virtual columns. 
6- During ingestion, users should be able to combine raw input row columns using virtual columns. However, this one is not super high priority because this "can be" achieved by doing simple ETL upfront or writing a custom InputRowParser to handle any row level transformation.

With those in mind, I think https://github.com/druid-io/druid/pull/3758 is a step in right direction as it does take care of (1), (2) and (6).

I see (3) mentioned in the discussions and possibly in future PR , but we should at least think that through to avoid too much rework later.

(5) is great to support the cases where user want to apply extraction function on multiple columns, That said, There are already some open PRs to enable this particular use case (haven't looked at those yet) and if those fulfill the usecase without too much of a hack and maintenance burden then (5) doesn't remain too important.

None of the discussion talked about (4) but I believe that is important to keep the mental model of virtual columns consistent for the end users. That means, we might have a different notion than virtual columns to report virtual column output in select query response.


-- Himanshu


On Friday, 9 December 2016 15:27:53 UTC-6, Gian Merlino wrote:
IMO a "makeValueMatcher" method on VirtualColumn isn't really very leaky, since ValueMatcher is just a boolean-typed selector. "ValueMatcher makeValueMatcher(factory)" is essentially the same thing as "BooleanColumnSelector makeBooleanColumnSelector(factory)" method, but just with a funny name. The implementation would not need to know about filters, it just has to return a boolean.

Filter.makeMatcher accepting ColumnSelectorFactory sounds good to me right now. I haven't thought through it that much, but if it works then I think it'd also be a good solution.

Gian

Gian Merlino

unread,
Dec 13, 2016, 7:01:03 PM12/13/16
to druid-de...@googlegroups.com
Himanshu,

I think (1) and (2) are well addressed in the existing PRs. (6) is partially addressed (but not completely) in that the IncrementalIndex can support virtual columns during ingestion, but there isn't a way to configure it. There could be at some point.

(3), (5) I think are good candidates for future work.

For (6) I think this is due to some weirdness in the API of the Select query. It has "dimensions" and "metrics" in its API to determine what it returns (and #2511 added "virtualColumns") but really, it should just have "columns". Since there's no grouping going on, dimensions/metrics are not a meaningful concept. I would be on board with a change to the Select API that has a single "columns" list determining what gets output, which I think would address your concerns. How does that sound?

Gian

On Mon, Dec 12, 2016 at 1:53 PM, Himanshu Gupta <g.him...@gmail.com> wrote:
I skimmed through #2511 and #3758 today, For me, here are the preferred high level expectations.

1- Virtual columns should work universally across all query types, aggregators, filters and whatever else deals with virtual columns.
2- Virtual columns work "transparently", that is without all extensions to not necessarily implement them. For example, "all" aggregators and filters should be able to have things like expressions without needing to implement that specially in that aggregator or filter. expansion of same idea is that adding more things like "expressions" should just need implementing another virtual column and no more change.
3- We should be able to do the grouping in groupBy using virtual columns.
4- Semantics of virtual column should be same everywhere. For example, currently it looks like that "select" query would include the virtual column output in the response returned to the user while "groupBy" would not and they are there only to be referenced by aggregators/filters.
5- Folding extractioFn into virtual columns. 
6- During ingestion, users should be able to combine raw input row columns using virtual columns. However, this one is not super high priority because this "can be" achieved by doing simple ETL upfront or writing a custom InputRowParser to handle any row level transformation.

With those in mind, I think https://github.com/druid-io/druid/pull/3758 is a step in right direction as it does take care of (1), (2) and (6).

I see (3) mentioned in the discussions and possibly in future PR , but we should at least think that through to avoid too much rework later.

(5) is great to support the cases where user want to apply extraction function on multiple columns, That said, There are already some open PRs to enable this particular use case (haven't looked at those yet) and if those fulfill the usecase without too much of a hack and maintenance burden then (5) doesn't remain too important.

None of the discussion talked about (4) but I believe that is important to keep the mental model of virtual columns consistent for the end users. That means, we might have a different notion than virtual columns to report virtual column output in select query response.


-- Himanshu


On Friday, 9 December 2016 15:27:53 UTC-6, Gian Merlino wrote:
IMO a "makeValueMatcher" method on VirtualColumn isn't really very leaky, since ValueMatcher is just a boolean-typed selector. "ValueMatcher makeValueMatcher(factory)" is essentially the same thing as "BooleanColumnSelector makeBooleanColumnSelector(factory)" method, but just with a funny name. The implementation would not need to know about filters, it just has to return a boolean.

Filter.makeMatcher accepting ColumnSelectorFactory sounds good to me right now. I haven't thought through it that much, but if it works then I think it'd also be a good solution.

Gian

On Fri, Dec 9, 2016 at 1:15 PM, Roman Leventov <roman.l...@metamarkets.com> wrote:
On Fri, Dec 9, 2016 at 10:12 AM, Gian Merlino <gi...@imply.io> wrote:
I think native expression filters can make sense but I really think that including methods in ColumnSelectorFactory and ValueMatcherFactory that are expression-aware is adding the "bad" kind of complexity: links between subsystems that don't necessarily need to be linked. In this case, the storage adapter and expression subsystems.

I agree with this. However your proposal makes VirtualColumn a leaked abstraction, because it starts to know that it "could be a filter".

Another way to isolate ValueMatcherFactories and StorageAdapters from expressions is to make Filter.makeMatcher() to accept ColumnSelectorFactory. So simple Filters could pull a single dimensionSelector, expression filters may several ones. And ValueMatcherFactory interface could be removed.

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/CAB5L%3DwdCuh4nZUsQxR6X13yUqywP3A934MuW1csCQM-yd%3DNcYA%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.

Gian Merlino

unread,
Dec 13, 2016, 7:26:51 PM12/13/16
to druid-de...@googlegroups.com
Roman,

in your suggestion, how would ExpressionFilter.makeMatcher(ColumnSelectorFactory) be implemented? Would it do something like:

  parsedExpression.eval(bindingsFrom(columnSelectorFactory)).asBoolean()

or:

  new ExpressionVirtualColumn(expressionString).makeValueMatcher(columnSelectorFactory)

[of course caching either the ExpressionVirtualColumn or the parsedExpression – just did one line for brevity.]

I'm okay with either one. I'm guessing you had the first one in mind since you had argued against the makeValueMatcher method appearing on ExpressionVirtualColumn.

Gian

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.

Gian Merlino

unread,
Dec 13, 2016, 7:32:04 PM12/13/16
to druid-de...@googlegroups.com
Also, does anyone see issues with getting rid of "ValueMatcherFactory"? If we did that, we could still have some convenience methods on Filters like Filters.makeValueMatcher(ColumnSelectorFactory, String columName, DruidPredicateFactory) and Filters.makeValueMatcher(ColumnSelectorFactory, String columName, String value). Most existing filters would use those convenience methods.

Gian

On Tue, Dec 13, 2016 at 4:26 PM, Gian Merlino <gi...@imply.io> wrote:
Roman,

in your suggestion, how would ExpressionFilter.makeMatcher(ColumnSelectorFactory) be implemented? Would it do something like:

  parsedExpression.eval(bindingsFrom(columnSelectorFactory)).asBoolean()

or:

  new ExpressionVirtualColumn(expressionString).makeValueMatcher(columnSelectorFactory)

[of course caching either the ExpressionVirtualColumn or the parsedExpression – just did one line for brevity.]

I'm okay with either one. I'm guessing you had the first one in mind since you had argued against the makeValueMatcher method appearing on ExpressionVirtualColumn.

Gian

Roman Leventov

unread,
Dec 14, 2016, 2:38:53 AM12/14/16
to druid-de...@googlegroups.com
On Tue, Dec 13, 2016 at 6:31 PM, Gian Merlino <gi...@imply.io> wrote:
Also, does anyone see issues with getting rid of "ValueMatcherFactory"? If we did that, we could still have some convenience methods on Filters like Filters.makeValueMatcher(ColumnSelectorFactory, String columName, DruidPredicateFactory) and Filters.makeValueMatcher(ColumnSelectorFactory, String columName, String value). Most existing filters would use those convenience methods.

I've gone through ValueMatcherFactory implementations again and have found no conceptual problems with this idea. And it looks like all ValueMatcherFactories has repetitive code like

if (row is empty)
  return valueToMatchEqualsToNull
for (value : row) {
  if (value.equals(valueToMatch))
    return true
}
return false

It could be not repeated.

However, ValueMatcherFactory has potential to go below DimensionSelector/IndexedInts interface and avoid one layer of abstraction. This potential is used in IncrementalIndexStorageAdapter, but not used in QueryableIndexStorageAdapter. To avoid adding another layer of abstraction and possible performance regression, DimensionSelector.makeValueMatcher(value) and DimensionSelector.makeValueMatcher(predicateFactory) should be added, basically the new place for the current logic of ValueMatcherFactories. On the other hand, it is probably a good idea to make this refactoring anyway, because logic of DimensionSelector.makeValueMatcher() and getRow() is similar so is better to be kept in the same place (DimensionSelector) rather than different places (DimensionSelector and ValueMatcherFactory).
 

Gian

On Tue, Dec 13, 2016 at 4:26 PM, Gian Merlino <gi...@imply.io> wrote:
Roman,

in your suggestion, how would ExpressionFilter.makeMatcher(ColumnSelectorFactory) be implemented? Would it do something like:

  parsedExpression.eval(bindingsFrom(columnSelectorFactory)).asBoolean()

or:

  new ExpressionVirtualColumn(expressionString).makeValueMatcher(columnSelectorFactory)

[of course caching either the ExpressionVirtualColumn or the parsedExpression – just did one line for brevity.]

I'm okay with either one. I'm guessing you had the first one in mind since you had argued against the makeValueMatcher method appearing on ExpressionVirtualColumn.

Yes, I meant the first version.

Gian Merlino

unread,
Dec 14, 2016, 2:21:04 PM12/14/16
to druid-de...@googlegroups.com
How does this sound for moving forward:

1) Make Filters use ColumnSelectorFactory directly for building row-based matchers.

- Remove ValueMatcherFactory completely.
- Change Filter.makeMatcher(ValueMatcherFactory) to Filter.makeMatcher(ColumnSelectorFactory).
- Add some helper methods to Filters.java: Filters.makeValueMatcher(ColumnSelectorFactory, String columnName, DruidPredicateFactory) and Filters.makeValueMatcher(ColumnSelectorFactory, String columnName, String value). Most existing filters should use these.
- If needed for performance, add DimensionSelector.makeValueMatcher(value) and DimensionSelector.makeValueMatcher(predicateFactory). If added, then the helper methods in Filters.java should use these when appropriate.

2) Remove the dependency of aggregators and storage adapters on expressions.

- Remove makeMathExpressionSelector from ColumnSelectorFactory.
- Remove "expression" from aggregatorFactories, in favor of users using fieldName + an "expression" typed virtual column.
- Add some helper method somewhere that makes an ObjectBindings from a ColumnSelectorFactory. A new ExpressionFilter and ExpressionVirtualColumn should use this.

3) Support filtering on expressions using an "expression" filter.

- Create ExpressionFilter that uses the helper from (2).

4) Support filtering on virtual columns.

- Have BitmapIndexSelector impls return null on getBitmapIndex(columnName) when a virtual column exists for columnName, to force ValueMatcher-based filtering. This, combined with (1) above should be enough.
- This does imply that filtering on virtual columns will always be done matcher-style and will not use indexes. I think this is fine for a start, although one day we might want to figure out a way to use indexes when possible.

Gian

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.

Gian Merlino

unread,
Dec 14, 2016, 2:27:26 PM12/14/16
to druid-de...@googlegroups.com
Notably missing from the previous post is moving extractionFns into virtual columns, which a couple people have suggested doing. The biggest issue there, to me, is how to support filtering on extractionFn'ed dimensions, if done through virtual columns, while still retaining the ability to use indexes. But I think we can deal with that later.

Gian

On Wed, Dec 14, 2016 at 11:20 AM, Gian Merlino <gi...@imply.io> wrote:
How does this sound for moving forward:

1) Make Filters use ColumnSelectorFactory directly for building row-based matchers.

- Remove ValueMatcherFactory completely.
- Change Filter.makeMatcher(ValueMatcherFactory) to Filter.makeMatcher(ColumnSelectorFactory).
- Add some helper methods to Filters.java: Filters.makeValueMatcher(ColumnSelectorFactory, String columnName, DruidPredicateFactory) and Filters.makeValueMatcher(ColumnSelectorFactory, String columnName, String value). Most existing filters should use these.
- If needed for performance, add DimensionSelector.makeValueMatcher(value) and DimensionSelector.makeValueMatcher(predicateFactory). If added, then the helper methods in Filters.java should use these when appropriate.

2) Remove the dependency of aggregators and storage adapters on expressions.

- Remove makeMathExpressionSelector from ColumnSelectorFactory.
- Remove "expression" from aggregatorFactories, in favor of users using fieldName + an "expression" typed virtual column.
- Add some helper method somewhere that makes an ObjectBindings from a ColumnSelectorFactory. A new ExpressionFilter and ExpressionVirtualColumn should use this.

3) Support filtering on expressions using an "expression" filter.

- Create ExpressionFilter that uses the helper from (2).

4) Support filtering on virtual columns.

- Have BitmapIndexSelector impls return null on getBitmapIndex(columnName) when a virtual column exists for columnName, to force ValueMatcher-based filtering. This, combined with (1) above should be enough.
- This does imply that filtering on virtual columns will always be done matcher-style and will not use indexes. I think this is fine for a start, although one day we might want to figure out a way to use indexes when possible.

Gian

Roman Leventov

unread,
Dec 14, 2016, 2:40:00 PM12/14/16
to druid-de...@googlegroups.com
Sounds good to me. I think there should be 2-4 separate PRs, not a single one.

On Wed, Dec 14, 2016 at 1:26 PM, Gian Merlino <gi...@imply.io> wrote:
Notably missing from the previous post is moving extractionFns into virtual columns, which a couple people have suggested doing. The biggest issue there, to me, is how to support filtering on extractionFn'ed dimensions, if done through virtual columns, while still retaining the ability to use indexes. But I think we can deal with that later.

Gian

On Wed, Dec 14, 2016 at 11:20 AM, Gian Merlino <gi...@imply.io> wrote:
How does this sound for moving forward:

1) Make Filters use ColumnSelectorFactory directly for building row-based matchers.

- Remove ValueMatcherFactory completely.
- Change Filter.makeMatcher(ValueMatcherFactory) to Filter.makeMatcher(ColumnSelectorFactory).
- Add some helper methods to Filters.java: Filters.makeValueMatcher(ColumnSelectorFactory, String columnName, DruidPredicateFactory) and Filters.makeValueMatcher(ColumnSelectorFactory, String columnName, String value). Most existing filters should use these.
- If needed for performance, add DimensionSelector.makeValueMatcher(value) and DimensionSelector.makeValueMatcher(predicateFactory). If added, then the helper methods in Filters.java should use these when appropriate.

2) Remove the dependency of aggregators and storage adapters on expressions.

- Remove makeMathExpressionSelector from ColumnSelectorFactory.
- Remove "expression" from aggregatorFactories, in favor of users using fieldName + an "expression" typed virtual column.
- Add some helper method somewhere that makes an ObjectBindings from a ColumnSelectorFactory. A new ExpressionFilter and ExpressionVirtualColumn should use this.

3) Support filtering on expressions using an "expression" filter.

- Create ExpressionFilter that uses the helper from (2).

4) Support filtering on virtual columns.

- Have BitmapIndexSelector impls return null on getBitmapIndex(columnName) when a virtual column exists for columnName, to force ValueMatcher-based filtering. This, combined with (1) above should be enough.
- This does imply that filtering on virtual columns will always be done matcher-style and will not use indexes. I think this is fine for a start, although one day we might want to figure out a way to use indexes when possible.

Gian

Himanshu

unread,
Dec 15, 2016, 12:26:12 AM12/15/16
to druid-de...@googlegroups.com
Gian, I am in general agreement with what you said in the last 2 posts. I guess, I can give more comments when there is PR.

regarding select query, I agree that it should just have columns to output stuff.

regarding extractionFn, it has two use cases
1) to just transform the column value and do groupBy/topN etc on the mapped value ... no filtering here. this one should automatically be achieved by doing the "groupBy should be able to have virtual columns as groupBy keys"
2) filtering on transformed values .... we are already enabling filters on virtual columns (that don't use indexes), so one version of filtering on transformed values is getting implemented anyway.

with (1) and (2) done, filters with extractionFn is just an optimization (because (2) can do this too) and can be removed once we can manage index usage in filtering with virtual columns that can be left for the future.

-- Himanshu

Gian Merlino

unread,
Dec 22, 2016, 4:28:57 PM12/22/16
to druid-de...@googlegroups.com
PR for part 1: https://github.com/druid-io/druid/pull/3797. I didn't add DimensionSelector.makeValueMatcher(value) or DimensionSelector.makeValueMatcher(predicateFactory) because the benchmarks looked fine without them.

Gian

On Thu, Dec 15, 2016 at 12:26 AM, Himanshu <g.him...@gmail.com> wrote:
Gian, I am in general agreement with what you said in the last 2 posts. I guess, I can give more comments when there is PR.

regarding select query, I agree that it should just have columns to output stuff.

regarding extractionFn, it has two use cases
1) to just transform the column value and do groupBy/topN etc on the mapped value ... no filtering here. this one should automatically be achieved by doing the "groupBy should be able to have virtual columns as groupBy keys"
2) filtering on transformed values .... we are already enabling filters on virtual columns (that don't use indexes), so one version of filtering on transformed values is getting implemented anyway.

with (1) and (2) done, filters with extractionFn is just an optimization (because (2) can do this too) and can be removed once we can manage index usage in filtering with virtual columns that can be left for the future.

-- Himanshu

Gian Merlino

unread,
Jan 6, 2017, 5:03:33 PM1/6/17
to druid-de...@googlegroups.com
The PR for (half of) part 2 was: https://github.com/druid-io/druid/pull/3815. It's already been merged. I didn't remove "expression" from aggregators at this time because I thought it was best to avoid making user facing api changes at this point.

I just raised one that wasn't on this list, but I think is probably necessary before (4) can be done: https://github.com/druid-io/druid/pull/3823. It adds some more types of virtual columns and some other niceties.

Gian

On Thu, Dec 22, 2016 at 1:28 PM, Gian Merlino <gi...@imply.io> wrote:
PR for part 1: https://github.com/druid-io/druid/pull/3797. I didn't add DimensionSelector.makeValueMatcher(value) or DimensionSelector.makeValueMatcher(predicateFactory) because the benchmarks looked fine without them.

Gian

On Thu, Dec 15, 2016 at 12:26 AM, Himanshu <g.him...@gmail.com> wrote:
Gian, I am in general agreement with what you said in the last 2 posts. I guess, I can give more comments when there is PR.

regarding select query, I agree that it should just have columns to output stuff.

regarding extractionFn, it has two use cases
1) to just transform the column value and do groupBy/topN etc on the mapped value ... no filtering here. this one should automatically be achieved by doing the "groupBy should be able to have virtual columns as groupBy keys"
2) filtering on transformed values .... we are already enabling filters on virtual columns (that don't use indexes), so one version of filtering on transformed values is getting implemented anyway.

with (1) and (2) done, filters with extractionFn is just an optimization (because (2) can do this too) and can be removed once we can manage index usage in filtering with virtual columns that can be left for the future.

-- Himanshu

Gian Merlino

unread,
Jan 7, 2017, 8:00:32 PM1/7/17
to druid-de...@googlegroups.com
Maybe it makes sense to keep "expression" on aggregators? I think getting it out of the column selectors was the bigger issue as far as separation of concerns, and that happened in #3815. Being able to do inline expressions is kind of nice.

Gian

jon...@imply.io

unread,
Jan 17, 2017, 8:18:41 PM1/17/17
to Druid Development
I don't see a big problem with keeping expressions on aggregators, particularly if there are benefits to doing so (e.g. inline expressions).

It might be convenient for users to allow that, e.g., suppose a user defines a set of aggregators, each with a unique expression, with some part of all expressions being common to all of the aggregators (e.g., take the square root of some column's values, each aggregator applies another expression to the resulting square root before aggregating). 

The shared square root transformation could be expressed as a 'base' virtual column, while transformations specific to each aggregator could reside at the aggregator level. 

Allowing that option, maybe some users would perceive "expressions in aggregator definition" as a nicer model than defining a separate virtual column for each aggregator and pointing the aggregator to the corresponding VC, where the link between the expression transformation is less 'inherent' than when having the aggregator-specific expression defined on the aggregator itself?

-----
From another angle, I lean towards "aggregate on a virtual column that applies the expression transformation" over "have the aggregators apply the expressions".

I see expressions on aggregators as similar to extractionFn within filters; in both cases the expression/extractionFn is being used mainly for transforming the input values, and I don't view the transformation as part of the "core" operation of the filter/aggregation

e.g., the "core" of a BoundFilter is the bound definition, excluding transformations applied to inputs, the "core" of a sum aggregator is the addition operation

I see ExpressionFilter as a different case from expression on aggregators and extractionFn on filters; while the expressions can be used to transform input values, expressions can also have boolean values so they fit naturally as the "core" operation of a filter.

Maurizio Sambati

unread,
Oct 30, 2017, 10:52:15 AM10/30/17
to Druid Development
What happened to Virtual Columns? 
I've seen some pull requests have been already merged but still no docs. Will they be supported in the near future?

M

Gian Merlino

unread,
Oct 30, 2017, 10:17:21 PM10/30/17
to druid-de...@googlegroups.com
Hi Maurizio,

They are alive and well in the code base and you can use them today. However, not all of the docs are written yet and not all of the APIs are finalized. Right now they are mostly being used under the hood to power SQL expressions in Druid SQL. In Druid 0.11 (release candidate at http://druid.io/downloads.html) the most basic way to use them out of the box is through math expressions. You can see some docs here http://druid.io/docs/0.11.0-rc1/misc/math-expr and access that through the (not yet documented) "expression" virtual column. It looks like { "type" : "expression", "name" : "foo", "expression" : "x + y + something" }.

The system is going to be finalized and documented in a future release, probably 0.12 I would guess.

Gian

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages