Substrait Text Serialization Language

43 views
Skip to first unread message

David Sisson

unread,
Jan 18, 2023, 3:19:04 AM1/18/23
to subs...@googlegroups.com
What follows is an early definition of a potential language for serializing Substrait plans to text (as a replacement for the JSON protobufs used today).  If there's time in the next Substrait meeting I would love to get some feedback.

Substrait Text Serialization Language

The Substrait Plan text serialization language is a direct analog to the binary serialization already defined for Substrait.  Many of the concepts are the same with features added to reduce duplication which prevents the format from being easily consumed by humans.

Syntax

The text serialization of a Substrait plan consists of the following parts:

  • Function and extension declaration

  • Schema definitions

  • Relations

  • The connectivity between the relations provided as a set of interconnected pipelines

Example

include "defaults.splan"

 

pipelines {

  read  // Could be read -> later_step -> final_step.

}

 

relation read {

  from initial_source using schema input_schema;

 

  filter (l_shipdate >= 8766 AND l_shipdate < 9131 AND 

          l_discount BETWEEN 0.6 - 0.01 AND .06 + 0.01 AND

          l_quantity < 24);

  return sum(l_extendedprice * l_discount) AS revenue

}

 

schema input_schema {

  l_quantity opt_fp64;

  l_extendedprice opt_fp64;

  l_discount opt_fp64;

  l_shipdate opt_fp64;

}

 

source fileset initial_source {

  uri_path "/mock_lineitem.orc"

  length 3719

}

 

extension_space boolean.yaml {

  function lte:fp64_fp64 as lte;

  function multiply:opt_fp64_fp64 as multiply;

}

Example Explanation


Line 1:  Loads in another splan file.  Intended to make it easy to bring in user extensions.


Line 3:  The pipeline section describes the interconnectivity between relations.  We only have one relation in this file so there’s not much of a pipeline.  A more sophisticated plan would consist of multiple pipelines.


Line 4:  At the end of this line we have a single line comment.


Line 7:  Defines the only relation we have in this example.  read is the identifier for this relation.


Line 10:  filter selects the input as it is being read.  


Line 16:  Defines a simple schema with four nullable doubles.


Line 23:  Defines a fileset source.  Fileset is just one of the kinds of sources that can be used here.


Line 28:  Defines an extension_space with the given extension URI.


Line 29:  Defines the less than or equal to function for doubles and gives it an easy name for reuse.  Overrides the definition in defaults.splan.


Design discussion

File Order

There is no fixed order to the file.  As the main purpose of the file is to make it easy to see how a plan is laid out we start with the pipeline section which lists the interconnectivity between the relations.

Interconnectivity

Often a plan is broken up into linear pipelines.  So if the language can optimize for that the secondary goal of joining the pipelines can be a bit more cumbersome.  Here’s a potential design that uses two constructs:  a pipeline A -> B -> C notion and a join operation (D + E) -> J


((scan lineitem -> filter) + (scan orders -> project) -> join + (scan customers -> aggregate)) -> join again -> filter again -> write to disk


However this all one line approach isn’t easy to read (or code review).  So the approach taken now is to break the plan into those pipelines (similar to how a flowchart has connectors):


scan lineitem -> filter -> join

scan orders -> project -> join

join -> join again

scan customers -> aggregate -> join again

join again -> filter again -> write to disk

Functions/Extensions

Function signature could be changed to look more like a C+ function but the existing format (colon and underscore separators) match the existing layout hidden in the binary format naming and may be more familiar as a result.  Replacing the underscore with whitespace may be more convenient.


Function anchor not included in the text format.  The conversion to binary format can pick any ordering for the assignment of the function anchors.


Need to be clear about how optional and types are defined in the function signature (align with the binary’s implementation).  It’s not clear how much cleaner a type definition should be.  Also do we keep optional as opt_ or could we use question mark as a shorthand as used elsewhere?

Schemas

Schemas may be duplicated throughout the plan so having a named reference would help simplify the text version.  Here again optional types are treated in a similar manner to the binary representation but could be altered to be more user friendly.

Relations

There’s a lot of information stuffed into a relation including the input types and internal computation.  Some of the schema information could be skipped as it appears only source relations need to pull stuff off of disk.  All subsequent relations should know what format the data was made available in during the previous step.  However the relation may not know which step provided which chunk of data so the previous relation will have to be referenced here (in addition to being referenced in the pipeline section above).

Filters

The least polished portion of the current design is that there are expressions used instead of the lower level notions.  A simple expression such as l_shipdate >= 8766 AND l_shipdate < 9131 expands quickly into something less readable such as AND(GTE(l_shipdate, 8766), LT(l_shipdate, 9131) and that’s only a fraction of the filter statement in the example above.  An alternative would be to break the expression up into smaller pieces using variable declarations and combine them:


v1 = AND(GTE(l_shipdate, 8766), LT(l_shipdate, 9131);

min = SUB(0.6, 0.01)

max = ADD(0.6, 0.01)

v2 = AND(BETWEEN(l_discount, min, max);

v3 = LT(l_quantity, 24);

filter AND(v1, v2, v3);


A compromise could be to support simple operations using traditional operators and perform the conversion during the translation to the binary format.  Ultimately an optimizer is going to need to get its hand on the lower level implementation in order to make optimizations.

Line Separation

Semicolons as statement ends could be eliminated with some other language behaviors if it is desirable to eliminate them.

Repetition

For now references to defined constructs is the only form of repetition elimination that the file format holds.  If the need arises we could define macros to eliminate any boilerplate we find inconvenient.

Comments

The only kind of comments this design has is single line comments however if there is a desire for multi-line comments that can easily be accommodated.


Weston Pace

unread,
Jan 25, 2023, 5:26:31 PM1/25/23
to subs...@googlegroups.com
I like many of the things you have done here.

## Function declarations

I am happy to do away with the anchor. That's only in the binary
format for compactness and I don't think that is something we should
worry about here. However, you are missing the URIs and I think that
is necessary.

> Need to be clear about how optional and types are defined in the function signature
I think the whole concept of "optional arguments" has gone away. Is
there something we need to cleanup in the spec?

However, functions can have an arbitrary string:string dictionary of
"options" and I don't see how to specify those (e.g. specifying
overflow behavior for an add call).

## Schemas

> Here again optional types are treated in a similar manner to the binary representation but could be altered to be more user friendly.
I do not believe there is any such thing as an optional schema type.

## Relations

> However the relation may not know which step provided which chunk of data so the previous relation will have to be referenced here (in addition to being referenced in the pipeline section above).

It's not clear to me why the previous relation needs to be referenced?
If someone wants to know the input schema they could reconstruct the
pipeline to figure it out correct?

## Filters (although I think you mean "expressions"?)

I'm ok, I think, with multi-line expressions. However, it should be
clearer where an expression begins and ends. In your example it was
not obvious until a more detailed examination that the "return ..."
statement was a part of the filter expression and not a separate
property of the read.

Also, I'd be curious how you envision something like this working in a
project relation which has a collection of expressions?

## Line Separation

Your example is somewhat inconsistent where lines end in semicolons
and where they don't. Whether we use semicolons or not it should be
very obvious and I don't want to have to do any mental gymnastics to
figure out when I need to put a semicolon.

## Comments

Single line comments seems sufficient for now. That feels like
something that can be added later if needed.

## Other

### Arbitrary relation properties

I think my main concern right now is that I don't see how to specify
relation properties. In my mind every relation has "inputs" (which
you've covered), an output schema (which is implicit) and
"properties". For example, you have specified the filter property for
the read rel but I don't see where you specify the read type (named
table, local files, virtual table). In addition, I don't see a
"projection" property.

Furthermore, I think the answer for "properties" will have to be very
generic. While we can add all the various standard relations to the
grammar that wouldn't help us with extension relations (e.g.
currently, in Acero, we have an extension relation for asof join). So
I think "properties" will need to be restricted to some kind of
arbitrary YAML / JSON that is in the relation with the caveat that we
have specialized representations for some things like expressions and
schemas.

### Literals

Your example includes a number of literals (8766, 0.6) but how am I to
know what the types of these literals are? I think, for compactness,
it would be ok to have a "default type" (int32 for 8766 and float64
for 0.6) but it would be good if there were a way to specify the type
when a more specific type is desired (e.g. 8766LL for an int64 literal
or 0.6f for a float32 literal). Also, whatever mechanism is come up
with here will need to support the concept of user defined types. In
Arrow we have List<int32> ([1, 1, 1]) and LargeList<int32> ([1, 1,
1]!largelist)?
> --
> You received this message because you are subscribed to the Google Groups "substrait" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to substrait+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/substrait/CAB0yJQpkjqGr%3D-SOVVh%2BQM7DuT3rWcRKr6-UKtU-zML2NNO5Gg%40mail.gmail.com.

David Sisson

unread,
Jan 30, 2023, 1:45:08 PM1/30/23
to subs...@googlegroups.com

worry about here.  However, you are missing the URIs and I think that
is necessary.

I was trying to be concise in the example so I referenced "boolean.yaml" instead of a full URI.  The intention of the design was for the URI to be listed once as part of the extension_space section.  The test plan I've been using doesn't have any extension URIs provided so I've made specifying the URI optional in the language for now.
 
> Need to be clear about how optional and types are defined in the function signature
I think the whole concept of "optional arguments" has gone away.  Is
there something we need to cleanup in the spec?

However, functions can have an arbitrary string:string dictionary of
"options" and I don't see how to specify those (e.g. specifying
overflow behavior for an add call).

During the implementation of the test binary to text plan converter I encountered this as well.  I ended up adding additional capabilities to functions in the way one might add options to a URL.  The options would be be tacked on to the function signature with # <option> with option preferences separated by semicolons.  The phase would be specified by adding @ <phase>.  So a function could end up looking like this:

measure sum(multiply(amount, cost))->opt_fp64#potatoes;crunchy=yes;sliced=no@AGGREGATION_PHASE_INITIAL_TO_INTERMEDIATE;


I do not believe there is any such thing as an optional schema type.

Sorry for the confusion.  I am referring to types that are nullable.  In the example above opt_fp64 refers to a nullable float64.  I am not enamored with the look of types with options jammed up against them (such as the opt_ prefix).

 
## Relations

> However the relation may not know which step provided which chunk of data so the previous relation will have to be referenced here (in addition to being referenced in the pipeline section above).

It's not clear to me why the previous relation needs to be referenced?
 If someone wants to know the input schema they could reconstruct the
pipeline to figure it out correct?

I may be overthinking the problem.  I was concerned what would happen when a join takes place.  I suppose the expression itself will choose which specific field it requires from each of its inputs so there shouldn't be an issue.

 

## Filters (although I think you mean "expressions"?)

I'm ok, I think, with multi-line expressions.  However, it should be
clearer where an expression begins and ends.  In your example it was
not obvious until a more detailed examination that the "return ..."
statement was a part of the filter expression and not a separate
property of the read.

Agreed, the return section wasn't well thought out.  I believe the current implementation (just one long expression) is sufficient. 


 
Also, I'd be curious how you envision something like this working in a
project relation which has a collection of expressions?

What I ended up implementing this as was a set of expressions within the relation's definition:

relation project {
  expression amount;
  expression cost;


 
## Line Separation

Your example is somewhat inconsistent where lines end in semicolons
and where they don't.  Whether we use semicolons or not it should be
very obvious and I don't want to have to do any mental gymnastics to
figure out when I need to put a semicolon.

Agreed.  The current design leans towards semicolons everywhere.   That said I would love to have a design where we didn't need them so I may try to eliminate them after the first implementation of the parser and see if I can do that without complicating the parsing.



### Arbitrary relation properties

I think my main concern right now is that I don't see how to specify
relation properties.  In my mind every relation has "inputs" (which
you've covered), an output schema (which is implicit) and
"properties".  For example, you have specified the filter property for
the read rel but I don't see where you specify the read type (named
table, local files, virtual table).  In addition, I don't see a
"projection" property.

There's a project example above.  I am missing one connection between sources and the read relation in the latest generated example:

relation read {
  base_schema schema;
  source local_data;  // Was missing.
  filter and(and(and(and(and(and(and(is_not_null(l_shipdate), is_not_null(l_discount)), is_not_null(l_quantity)), gte(l_shipdate, 8766.000000)), lt(l_shipdate, 9131.000000)), gte(l_discount, 0.050000)), lte(l_discount, 0.070000)), lt(l_quantity, 24.000000));
}

source local_files local_data {
  items = [
    {uri_file: "/mock_lineitem.orc" length: 3719 orc: {}}
  ]
}
 
The type would be specified in the source section and the read relation would just refer to the source.  The intention is to keep the source details away from how the relation works.


Furthermore, I think the answer for "properties" will have to be very
generic.  While we can add all the various standard relations to the
grammar that wouldn't help us with extension relations (e.g.
currently, in Acero, we have an extension relation for asof join).  So
I think "properties" will need to be restricted to some kind of
arbitrary YAML / JSON that is in the relation with the caveat that we
have specialized representations for some things like expressions and
schemas.

Is a key/value approach enough for properties?  The parsing of custom values will need to happen externally to the plan.
 
 
### Literals

Your example includes a number of literals (8766, 0.6) but how am I to
know what the types of these literals are?  I think, for compactness,
it would be ok to have a "default type" (int32 for 8766 and float64
for 0.6) but it would be good if there were a way to specify the type
when a more specific type is desired (e.g. 8766LL for an int64 literal
or 0.6f for a float32 literal).  Also, whatever mechanism is come up
with here will need to support the concept of user defined types.  In
Arrow we have List<int32> ([1, 1, 1]) and LargeList<int32> ([1, 1,
1]!largelist)?

At worst we can provide the capability for an externally handled literal (something like a backquoted string `mycustomvalue` perhaps).  Adding specific types to literals is easily added to the language.  Given that every function has its types fully specified do we need to label the literals?

extension_space my_lovely_uri.yaml {
  function is_not_null:fp64 as is_not_null;
  function and:bool_bool as and;
  function gte:fp64_fp64 as gte;
  function lt:fp64_fp64 as lt;
  function lte:fp64_fp64 as lte;
  function sum:opt_fp64 as sum;
  function multiply:opt_fp64_fp64 as multiply;
  function multiply:opt_fp32_fp32 as multiply32;
}

Reply all
Reply to author
Forward
0 new messages