Function signature for AggregationRel

chaojun zhang

unread,

Sep 20, 2022, 11:33:29 PM9/20/22

to substrait

Does the function signature of AggregationRel needs to be distinguished according by the aggregation phase?

For example, the actual argument types for partial 'avg' can be a primitive integral type like 'avg(i32)' , while the actual argument types for final 'avg' must always be a struct type 'struct(double,bigint) ' , so do I need to create two extension functions used in a plan with different extension function name , one for 'avg:opt_i32' and another for 'avg:opt_struct' , or they just share the same extension function with same name 'avg:opt_i32'?

Jacques Nadeau

unread,

Sep 20, 2022, 11:39:13 PM9/20/22

to Substrait

They share the same name. The aggregate function invocation clarifies the phase.

On Tue, Sep 20, 2022, 8:33 PM chaojun zhang <zcj2...@gmail.com> wrote:

Does the function signature of AggregationRel needs to be distinguished according by the aggregation phase?

For example, the actual argument types for partial 'avg' can be a primitive integral type like 'avg(i32)' , while the actual argument types for final 'avg' must always be a struct type 'struct(double,bigint) ' , so do I need to create two extension functions used in a plan with different extension function name , one for 'avg:opt_i32' and another for 'avg:opt_struct' , or they just share the same extension function with same name 'avg:opt_i32'?

--
You received this message because you are subscribed to the Google Groups "substrait" group.
To unsubscribe from this group and stop receiving emails from it, send an email to substrait+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/substrait/24ff3295-1f64-4da9-8ac5-aee703a71500n%40googlegroups.com.

Jacques Nadeau

unread,

Sep 21, 2022, 12:04:22 AM9/21/22

to Substrait

Oh, actually you're right, this is an oversight in the design because we rely on args to deconflict variant. We need a solution. Can you propose one?

chaojun zhang

unread,

Sep 21, 2022, 12:12:02 AM9/21/22

to substrait

Yes, If only have intermediate types, we cannot resolve aggregate function except the function call carry the raw input type?

Jeroen van Straten

unread,

Sep 21, 2022, 2:52:03 AM9/21/22

to Substrait

The way I imagined this would work is that you would always refer to the function by its "regular" compound name, i.e. `avg:opt_i32`, but in bindings that use the intermediate type as the input replace the first non-constant value argument with the intermediate type and not specify any remaining non-constant value arguments. This is weird if you spell it out like that but not broken, I think... at least not unless the types of the input arguments are needed for deriving the return type, I didn't consider that until now.

What about replacing the value arguments with type arguments and adding an extra argument at the end for the intermediate value?

To view this discussion on the web visit https://groups.google.com/d/msgid/substrait/8267c562-2315-496a-bd72-d8abb3f1877dn%40googlegroups.com.

jun

unread,

Sep 21, 2022, 6:49:37 AM9/21/22

to subs...@googlegroups.com

the key is how you can refer to the regular compond name like 'avg:opt_i32' if only have intermediate types？ my thinking is if the function call can carry about the original raw input types, in this case we can attempt match it to a substrait function varaint?

Jeroen van Straten <jeroen.va...@gmail.com> 于 2022年9月21日周三 14:52写道：

To view this discussion on the web visit https://groups.google.com/d/msgid/substrait/CAEPx700RJ-C432fW2w9_QeTPGo-_G6bmUD%3DBnxFTDu4V4Wt3DQ%40mail.gmail.com.

Jacques Nadeau

unread,

Sep 25, 2022, 10:03:08 PM9/25/22

to Substrait

First, I'm returning to my initial thought: I don't think there is a problem here with referencing. Imagine we initially have a function invocation using this variant [1]. The function invocation will have the function name key: avg_opt_i32 and the phase of INITIAL_TO_RESULT. Let's now assume we decompose this into a two phase aggregate. The first phase function invocation will be avg_opt_i32/INITIAL_TO_INTERMEDIATE and the second phase will be avg_opt_i32/INTERMEDIATE_TO_RESULT. The reference doesn't change.

I do think there is a problem here of carrying forward options and how that occurs. The original intention was that the the intermediate type effectively declares the signature of the input argument for the intermediate function call. However, that doesn't work for options. It's also weird in the context of constant value arguments. Jeroen's proposal is really strange since arguments could oscillate between options, variable value arguments and constant value arguments. Removing some of these and carrying others feels really opaque/confusing. I suggest that we come up with a more formal definition/approach. I'm not sure what's best. We could declare a intermediate argument signature for each variant that references constant inputs and the intermediate output by name. This would be similar to Jeroen's proposal without as much magic. Do people have other proposals?

[1] https://github.com/substrait-io/substrait/blob/6021030a599134f959ebc0f36621b27127316356/extensions/functions_arithmetic.yaml#L853

To view this discussion on the web visit https://groups.google.com/d/msgid/substrait/CAArwL3PhpCRSAK%2BKY%2BA97k4hNquQWknjQKiMJSCMr91f57ZRbg%40mail.gmail.com.

Jeroen van Straten

unread,

Sep 26, 2022, 8:18:24 AM9/26/22

to Substrait

For the record, I agree that my interpretation of replacing arguments is super weird! It's just the best I could make of the current specification that at least does what I would expect for unary aggregate functions. My (somewhat) more sane proposal would be that for INTERMEDIATE_TO_X invocations the actual argument list is modified as follows:

- enumeration, type, and constant (well, "scalar" would be more accurate, see paragraph at end) value arguments are retained;

- non-constant (vector) value arguments are specified as type arguments instead;

- an additional argument is implicitly appended to the argument list, that must match the derived intermediate type for the above arguments.

The conversion from value to type argument is necessary for functions that use the argument type in their derivation expressions. Here's a very hypothetical example to illustrate:

weird_aggregate_fn(decimal<A,B>) -> decimal<A+6,B-6>

intermediate: decimal<A+3,B-3>

If we were to bind this using just its intermediate type, we would have to support expression evaluation reversal, i.e. you'd have to match a number to A+3, essentially rewriting it to a subtraction. You could do that for additions and subtractions I guess, but not for functions in general. The much, MUCH more straightforward approach is to just require the producer to write the original decimal<A,B> as a type argument. To complete the example with all the effective signatures:

INITIAL_TO_RESULT: weird_aggregate_fn:dec(decimal<A,B>) -> decimal<A+6,B-6>

INITIAL_TO_INTERMEDIATE: weird_aggregate_fn:dec(decimal<A,B>) -> decimal<A+3,B-3>

INTERMEDIATE_TO_INTERMEDIATE: weird_aggregate_fn:dec([type]decimal<A,B>, decimal<A+3,B-3>) -> decimal<A+3,B-3>

INTERMEDIATE_TO_RESULT: weird_aggregate_fn:dec([type]decimal<A,B>, decimal<A+3,B-3>) -> decimal<A+6,B-6>

In all cases we can now match A and B trivially and derive the remaining types per the normal rules.

Alternative proposal: just require that the YAML files specify what the intermediate-input argument list is explicitly. The above function would still be problematic to write that way, though; you'd still need the type argument.

> We could declare a intermediate argument signature for each variant that references constant inputs and the intermediate output by name.

I'm not following what you mean here exactly. Can you give an example?

Semi-related: when I was talking to Phillip the other day we realized there is another problem with aggregate function arguments that might be adding to the confusion here, namely what it means for an aggregate argument to be constant or be a literal. I always interpreted constant aggregate function arguments to be scalar, i.e., the function interprets them as an input to the aggregate as a whole. An example could be a regular expression string that's only compiled once for the entire aggregate, or the initialization value for a fold operation (if we would support higher-order functions). Such arguments would logically need to be repeated for INTERMEDIATE_TO_RESULT invocations (at least the regex, I guess a fold isn't decomposable to begin with so it might not be the best example in this context). Only vector arguments should logically be replaced by the intermediate type.

To view this discussion on the web visit https://groups.google.com/d/msgid/substrait/CAJ9XdSovURSUcPJz%2B-_0crpWT-j7DXxPc%3Dmx6v2f%3DNXF5sr2Ow%40mail.gmail.com.

Reply all

Reply to author

Forward