substrait support as expression only protocol

jelly ma

unread,

Jun 2, 2022, 9:46:33 PM6/2/22

to substrait

Hi guys,

We have some new requirements for substrait when trying to use it in an expression only scenario, take evaluations of “a + b”, “a > 2” for example. Per Jacques' advice, I would like have a discussion here about how to address it.

Based on existing substrait definitions, I think a message contains following fields will work.

message ExpressionDescriptor{

one of expr_type {

Expression expr = 1;

AggregateFunction measure = 2;

}

NamedStruct base_schema = 3;

repeated substrait.extensions.SimpleExtensionDeclaration extensions = 4;

// optional output names

repeated string names = 5;

}

Any comments here?

Thanks,

Yan

Jacques Nadeau

unread,

Jun 2, 2022, 10:38:39 PM6/2/22

to substrait

Can you expound more on the specific use case? Like, what kind of system are you focused on?

Message has been deleted

jelly ma

unread,

Jun 6, 2022, 9:17:20 PM6/6/22

to substrait

We are working on an computation evaluation based on expression like "a+b", instead of a plan with scan node and project node, like "select a+b from test". Similar to Arrow gandiva, front-end engines can leverage Substrait as protocol to offload related computation onto an expr-eval lib.

In this scenario, we take this expression, which contains computation semantics and input/output schemas, and data as inputs, with computed result as the output data. We don't want to translate this simple expression to a heavy substrait plan, which may include abundant required messages that don't need for target computation. That's why we propose such a dedicate expression descriptor which only contains necessary info for the computation.

Jacques Nadeau

unread,

Jun 6, 2022, 9:21:41 PM6/6/22

to Substrait

I understand the scenario. I'd suggest using the full plan representation today and then we can look at adding a smaller version in the future. For example, gandiva already has the concept of a project and a filter, same as substrait. Using those feels more appropriate than defining something smaller. It doesn't feel like it is a huge burden (either way you have to express input schema, one is just using a named table readrel instead of a new kind of object). This also means we can use the same tools like validator, transformers etc. Any burden of construction seems like it could be handled by providing a utility method that gets you what you want.

I'm not strongly against this but am also cautious of premature optimization/specialization. Would love to hear more people's thoughts.

--
You received this message because you are subscribed to a topic in the Google Groups "substrait" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/substrait/st9dDXYIYWU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to substrait+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/substrait/191685d0-e88f-4400-ad61-a98d34b3e523n%40googlegroups.com.

jelly ma

unread,

Jun 7, 2022, 1:18:22 AM6/7/22

to substrait

Yes. Agree. It's okay to wrap the expression into a complete plan if lower backend can take a plan as compute target. Otherwise it may need additional unnecessary handle if framework is more straight to expression level computation, like ExpressionAction in ClickHouse. For now, we will use existing protos and see what's necessary for expression representation in this scenario.

jelly ma

unread,

Jun 16, 2022, 2:01:49 AM6/16/22

to substrait

Hi Jacuques,

For expression only scenario, we have designed a series API for substrait expression build, from scratch or from parsed expression representation in other frontend framework, mainly include:

class SubstraitExprBuilder{
static ::substrait::Type* makeType(bool isNullable);
static ::substrait::NamedStruct* makeNamedStruct(SubstraitExprBuilder* builder,
std::vector<std::string> names,
std::vector<::substrait::Type*> types);
static ::substrait::Expression* makeFieldReference(size_t field);
static ::substrait::Expression* makeFieldReference(SubstraitExprBuilder* builder,
std::string name,
::substrait::Type* type);
static ::substrait::extensions::SimpleExtensionDeclaration_ExtensionFunction* makeFunc(
SubstraitExprBuilder* builder,
std::string func_name,
std::vector<::substrait::Expression*> args,
::substrait::Type* output_type);
static ::substrait::Expression* makeExpr(SubstraitExprBuilder* builder,
std::string func_name,
std::vector<::substrait::Expression*> args,
::substrait::Type* output_type);
::substrait::NamedStruct* schema();
std::vector<::substrait::extensions::SimpleExtensionDeclaration_ExtensionFunction*>
funcsInfo() {
return funcs_info_;
}

private:
size_t func_anchor_;
std::vector<std::string> names_;
std::vector<::substrait::Type*> types_;
::substrait::NamedStruct* schema_;
std::vector<::substrait::extensions::SimpleExtensionDeclaration_ExtensionFunction*>
funcs_info_;

}

In short, this builder provides interface to make substrait::type, substrait::expression, etc as well as maintain a context that can build final name_struct and function info. The typical usage will like this:

// example 1
SubstraitExprBuilder* builder = new SubstraitExprBuilder();
::substrait::NamedStruct* schema = SubstraitExprBuilder::makeNamedStruct(
builder,
{"a", "b"},
{CREATE_SUBSTRAIT_TYPE_FULL(I64, false), CREATE_SUBSTRAIT_TYPE_FULL(I64, false)});
::substrait::Expression* field0 = SubstraitExprBuilder::makeFieldReference(0);
::substrait::Expression* field1 = SubstraitExprBuilder::makeFieldReference(1);
::substrait::Expression* multiply_expr = SubstraitExprBuilder::makeExpr(
builder, "multiply", {field0, field1}, CREATE_SUBSTRAIT_TYPE_FULL(I64, false));
::substrait::Expression* add_expr = SubstraitExprBuilder::makeExpr(
builder, "add", {multiply_expr, field1}, CREATE_SUBSTRAIT_TYPE_FULL(I64, false));

// after necessary substrait expression info built, we provide it to our evaluator and do the evaluation

CiderExprEvaluator evaluator({add_expr}, schema, {func_info}, ExprType::ProjectExpr);
auto result = evaluator.eval(in_data);

// example 2
SubstraitExprBuilder* inc_builder = new SubstraitExprBuilder();
::substrait::Expression* field2 = SubstraitExprBuilder::makeFieldReference(
inc_builder, "a", CREATE_SUBSTRAIT_TYPE_FULL(I64, false));
::substrait::Expression* field3 = SubstraitExprBuilder::makeFieldReference(
inc_builder, "b", CREATE_SUBSTRAIT_TYPE_FULL(I64, false));
::substrait::NamedStruct* inc_schema = inc_builder->schema();
::substrait::Expression* multiply_expr1 = SubstraitExprBuilder::makeExpr(
inc_builder, "multiply", {field2, field3}, CREATE_SUBSTRAIT_TYPE_FULL(I64, false));
::substrait::Expression* add_expr1 = SubstraitExprBuilder::makeExpr(
inc_builder, "add", {multiply_expr1, field3}, CREATE_SUBSTRAIT_TYPE_FULL(I64, false));

// after necessary substrait expression info built, we provide it to our evaluator and do the evaluation

CiderExprEvaluator evaluator({add_expr}, schema, {func_info}, ExprType::ProjectExpr);
auto result = evaluator.eval(in_data);

Can you help review this? will appreciate for your comments.

Thanks,

Yan

Jacques Nadeau

unread,

Jun 16, 2022, 8:40:15 PM6/16/22

to Substrait

This feels like a reasonable API to abstract the plan building from the expression building. The main challenge I see is that your makeFieldReference function that takes data types/names feels like it may be confusing/induce errors of users since you have an implicit relationship between ordinal position (depending on calling order). You may want to come up with an alternative way to express that.

To view this discussion on the web visit https://groups.google.com/d/msgid/substrait/0c585001-0eb7-43bf-b3a8-1fa072f13da5n%40googlegroups.com.

jelly ma

unread,

Dec 5, 2022, 9:07:47 PM12/5/22

to substrait

Yes. You are correct. In these API, we need the Schema firstly defined and makeFieldReference need refer this definition to determine the ordinal index. We can improve this part. Here I would like to raise an open about current substrait design. It mainly targets for a plan representation but not quite feasible for expression tree description. But in recent communication with customers, expression level offloading is quite their first interesting point to lower computation from front-end engine. I also see there are requests about expression tree building in other languages, like python as discussed in here. Has the community considered about this? We may need make some minor change in substrait.proto and provide APIs like above to facilitate building process. What do you think about it?

Thanks,

Yan

Jacques Nadeau

unread,

Dec 6, 2022, 3:39:06 PM12/6/22

to subs...@googlegroups.com

Propose a PR with markdown (along with proto as necessary) and we'll take a look!

You received this message because you are subscribed to the Google Groups "substrait" group.
To unsubscribe from this group and stop receiving emails from it, send an email to substrait+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/substrait/82e23ec7-810e-4eb5-8ad4-95a64976c85fn%40googlegroups.com.

jelly ma

unread,

Dec 7, 2022, 10:07:08 AM12/7/22

to substrait

Get it! I will submit a PR in substrait to address what changes needed at proto level and if that works, I should propose the expression tree builder API in substrait_cpp, right?

Jacques Nadeau

unread,

Dec 7, 2022, 11:52:27 PM12/7/22

to subs...@googlegroups.com

Yes, let's start with proto/docs and make sure the community is supportive of the modifications.

To view this discussion on the web visit https://groups.google.com/d/msgid/substrait/f5d2c2b4-e1e3-4f3b-bd0f-9da87a58d87en%40googlegroups.com.

jelly ma

unread,

Dec 12, 2022, 2:18:32 AM12/12/22

to substrait

Hi Jacques,

https://github.com/substrait-io/substrait/pull/405 is submitted for further discussion, which covers our user scenario. Please help comment this.