Substrait Text Serialization Language
The Substrait Plan text serialization language is a direct analog to the binary serialization already defined for Substrait. Many of the concepts are the same with features added to reduce duplication which prevents the format from being easily consumed by humans.
The text serialization of a Substrait plan consists of the following parts:
Function and extension declaration
Schema definitions
Relations
The connectivity between the relations provided as a set of interconnected pipelines
include "defaults.splan"
pipelines {
read // Could be read -> later_step -> final_step.
}
relation read {
from initial_source using schema input_schema;
filter (l_shipdate >= 8766 AND l_shipdate < 9131 AND
l_discount BETWEEN 0.6 - 0.01 AND .06 + 0.01 AND
l_quantity < 24);
return sum(l_extendedprice * l_discount) AS revenue
}
schema input_schema {
l_quantity opt_fp64;
l_extendedprice opt_fp64;
l_discount opt_fp64;
l_shipdate opt_fp64;
}
source fileset initial_source {
uri_path "/mock_lineitem.orc"
length 3719
}
extension_space boolean.yaml {
function lte:fp64_fp64 as lte;
function multiply:opt_fp64_fp64 as multiply;
}
Line 1: Loads in another splan file. Intended to make it easy to bring in user extensions.
Line 3: The pipeline section describes the interconnectivity between relations. We only have one relation in this file so there’s not much of a pipeline. A more sophisticated plan would consist of multiple pipelines.
Line 4: At the end of this line we have a single line comment.
Line 7: Defines the only relation we have in this example. read is the identifier for this relation.
Line 10: filter selects the input as it is being read.
Line 16: Defines a simple schema with four nullable doubles.
Line 23: Defines a fileset source. Fileset is just one of the kinds of sources that can be used here.
Line 28: Defines an extension_space with the given extension URI.
Line 29: Defines the less than or equal to function for doubles and gives it an easy name for reuse. Overrides the definition in defaults.splan.
There is no fixed order to the file. As the main purpose of the file is to make it easy to see how a plan is laid out we start with the pipeline section which lists the interconnectivity between the relations.
Often a plan is broken up into linear pipelines. So if the language can optimize for that the secondary goal of joining the pipelines can be a bit more cumbersome. Here’s a potential design that uses two constructs: a pipeline A -> B -> C notion and a join operation (D + E) -> J
((scan lineitem -> filter) + (scan orders -> project) -> join + (scan customers -> aggregate)) -> join again -> filter again -> write to disk
However this all one line approach isn’t easy to read (or code review). So the approach taken now is to break the plan into those pipelines (similar to how a flowchart has connectors):
scan lineitem -> filter -> join
scan orders -> project -> join
join -> join again
scan customers -> aggregate -> join again
join again -> filter again -> write to disk
Function signature could be changed to look more like a C+ function but the existing format (colon and underscore separators) match the existing layout hidden in the binary format naming and may be more familiar as a result. Replacing the underscore with whitespace may be more convenient.
Function anchor not included in the text format. The conversion to binary format can pick any ordering for the assignment of the function anchors.
Need to be clear about how optional and types are defined in the function signature (align with the binary’s implementation). It’s not clear how much cleaner a type definition should be. Also do we keep optional as opt_ or could we use question mark as a shorthand as used elsewhere?
Schemas may be duplicated throughout the plan so having a named reference would help simplify the text version. Here again optional types are treated in a similar manner to the binary representation but could be altered to be more user friendly.
There’s a lot of information stuffed into a relation including the input types and internal computation. Some of the schema information could be skipped as it appears only source relations need to pull stuff off of disk. All subsequent relations should know what format the data was made available in during the previous step. However the relation may not know which step provided which chunk of data so the previous relation will have to be referenced here (in addition to being referenced in the pipeline section above).
The least polished portion of the current design is that there are expressions used instead of the lower level notions. A simple expression such as l_shipdate >= 8766 AND l_shipdate < 9131 expands quickly into something less readable such as AND(GTE(l_shipdate, 8766), LT(l_shipdate, 9131) and that’s only a fraction of the filter statement in the example above. An alternative would be to break the expression up into smaller pieces using variable declarations and combine them:
v1 = AND(GTE(l_shipdate, 8766), LT(l_shipdate, 9131);
min = SUB(0.6, 0.01)
max = ADD(0.6, 0.01)
v2 = AND(BETWEEN(l_discount, min, max);
v3 = LT(l_quantity, 24);
filter AND(v1, v2, v3);
A compromise could be to support simple operations using traditional operators and perform the conversion during the translation to the binary format. Ultimately an optimizer is going to need to get its hand on the lower level implementation in order to make optimizations.
Semicolons as statement ends could be eliminated with some other language behaviors if it is desirable to eliminate them.
For now references to defined constructs is the only form of repetition elimination that the file format holds. If the need arises we could define macros to eliminate any boilerplate we find inconvenient.
The only kind of comments this design has is single line comments however if there is a desire for multi-line comments that can easily be accommodated.
worry about here. However, you are missing the URIs and I think that
is necessary.
> Need to be clear about how optional and types are defined in the function signature
I think the whole concept of "optional arguments" has gone away. Is
there something we need to cleanup in the spec?
However, functions can have an arbitrary string:string dictionary of
"options" and I don't see how to specify those (e.g. specifying
overflow behavior for an add call).
I do not believe there is any such thing as an optional schema type.
## Relations
> However the relation may not know which step provided which chunk of data so the previous relation will have to be referenced here (in addition to being referenced in the pipeline section above).
It's not clear to me why the previous relation needs to be referenced?
If someone wants to know the input schema they could reconstruct the
pipeline to figure it out correct?
## Filters (although I think you mean "expressions"?)
I'm ok, I think, with multi-line expressions. However, it should be
clearer where an expression begins and ends. In your example it was
not obvious until a more detailed examination that the "return ..."
statement was a part of the filter expression and not a separate
property of the read.
Also, I'd be curious how you envision something like this working in a
project relation which has a collection of expressions?
## Line Separation
Your example is somewhat inconsistent where lines end in semicolons
and where they don't. Whether we use semicolons or not it should be
very obvious and I don't want to have to do any mental gymnastics to
figure out when I need to put a semicolon.
### Arbitrary relation properties
I think my main concern right now is that I don't see how to specify
relation properties. In my mind every relation has "inputs" (which
you've covered), an output schema (which is implicit) and
"properties". For example, you have specified the filter property for
the read rel but I don't see where you specify the read type (named
table, local files, virtual table). In addition, I don't see a
"projection" property.
Furthermore, I think the answer for "properties" will have to be very
generic. While we can add all the various standard relations to the
grammar that wouldn't help us with extension relations (e.g.
currently, in Acero, we have an extension relation for asof join). So
I think "properties" will need to be restricted to some kind of
arbitrary YAML / JSON that is in the relation with the caveat that we
have specialized representations for some things like expressions and
schemas.
### Literals
Your example includes a number of literals (8766, 0.6) but how am I to
know what the types of these literals are? I think, for compactness,
it would be ok to have a "default type" (int32 for 8766 and float64
for 0.6) but it would be good if there were a way to specify the type
when a more specific type is desired (e.g. 8766LL for an int64 literal
or 0.6f for a float32 literal). Also, whatever mechanism is come up
with here will need to support the concept of user defined types. In
Arrow we have List<int32> ([1, 1, 1]) and LargeList<int32> ([1, 1,
1]!largelist)?