Understanding Substrait Plans

727 views
Skip to first unread message

Advitya Gemawat

unread,
May 19, 2022, 10:03:24 AM5/19/22
to Substrait
I have a few questions related to understanding the Substrait plan for TPCH-6 (which I wasn't able to directly determine from the plan or the proto spec):

1. [substrait_1.png] What is the need of ‘cast’ in the filter? Are we casting ‘24’ from int to decimal, and if so, is that even needed?

2. [substrait_2.png] Could you explain how exactly the decimals are encoded in the plan? I presume I’m supposed to decode them to “0.03 {+/–} 0.01” (for two of the filters) but the decimal dictionaries also look exactly the same.

3. [substrait_3.png] Sanity checking, could you explain what the data types (separated by underscores and mentioned after `<operator_name>:`) are supposed to refer to (especially `opt_decimal_decimal`)? Not sure if I have access to the YAML files defining these functions.

4. [substrait_5.png] Could you explain what “measures” is about (in the aggregation operator I presume as per the proto spec)? The project expressions make sense to me [substrait_4.png], that we have to MULTIPLY (function reference 9) L_EXTENDEDPRICE (field / names list index 5) & L_DISCOUNT (field/index 6). However, taking the SUM (function reference 10) of L_ORDERKEY (field / index 0) doesn’t make intuitive considering considering that we have to instead take the SUM (aggregation) of the multiplied result, as per TPCH6.

Looking forward to your response, thank you very much!

Best,
Advitya Gemawat

substrait_3.png
substrait_5.png
substrait_2.png
substrait_4.png
substrait_1.png

Jacques Nadeau

unread,
May 19, 2022, 1:43:08 PM5/19/22
to Advitya Gemawat, Substrait
For #1: Another option would be to actually declare a decimal literal. Substrait doesn't have implicit conversions so if the literal is a integer but the function needs a decimal, the data must be explicitly casted in the substrait plan.
For #2: It seems like there is a bug there as noted by [1]
For #3: The definitions are here[2] but [1] also notes an issue with decimal and Isthmus.
I'll let someone else take #4.


--
You received this message because you are subscribed to the Google Groups "Substrait" group.
To unsubscribe from this group and stop receiving emails from it, send an email to substrait+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/substrait/48c153e4-024f-47b0-87eb-9b1449b68fd6n%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

weston.pace

unread,
May 19, 2022, 2:35:29 PM5/19/22
to Substrait
Can you attach the entire plan as a JSON file?  Also, how did you generate the plan?  Did you get this by plugging SQL into Isthmus?  Or was it a hard-coded plan stored somewhere?

Jeroen van Straten

unread,
May 19, 2022, 2:51:11 PM5/19/22
to Advitya Gemawat, Substrait
For #2: They should be 16-byte integers represented as a binary blob, that you divide by 10^SCALE to get the actual number. In protobuf JSON those bytes are base64-encoded. As Jacques already noted, though, Isthmus is currently pretty wrong about how it encodes decimal literals. There are some (hopefully) correct examples here: https://github.com/jvanstraten/substrait-validator/blob/initial/tests/tests/expressions/literals/decimal.yaml Those test files are not *quite* protobuf JSON represented as a YAML file, but they're close enough to read the decimal literals and expected values.

For #4: Assuming this relates to the output for TPC-H06 by the current version of Isthmus, the project relation removes all columns except for the expression by means of the emit message in common. The schema becomes just a single DECIMAL?<19, 0> column at that point, i.e. the result of the multiplication, so in the parent relation a field reference to field 0 resolves to the result of that expression rather than to the order key.

I've attached the HTML output of the validator + tpch.py script from https://github.com/substrait-io/substrait-java/issues/18 for TPC-H06 here (@Weston, it also includes the JSON).

Jeroen

06.html

Advitya Gemawat

unread,
Jun 22, 2022, 11:20:54 AM6/22/22
to substrait
Thank you for your response! We have integrated support for a few TPCH queries with Substrait, and have a few follow-up questions:

1. It looks like the plans are unoptimized. Say for TPCH14 where there's a join condition specified in the WHERE clause, the substrait plans seem to represent it in the same way specified in the input query, without specifying that the other WHERE conditions should be executed before the JOIN to accelerate query execution. The `join` key in the json also seems to be serving as input to the filter, and executed a join before filters is usually an unoptimal choice. Has Calcite been used in generating these plans, and are there any improvements that can be done in terms of the execution order represented in the plans?

2. The Substrait plans look like logical plans & not physical plans. I.e. the plans say that there needs to be, say, an inner join (TPCH 14 example again), but the plan doesn't point to a particular join implementation that should be used or may be optimal for that query (leading to us arbitrarily picking a join implementation which may not be the most optimal choice, and arbitrarily picking an implementation may not be ideal as we work on expanding support). Depending on the query, particular implementations of operators like join may be more optimal to execute. Are there any plans to also represent a physical execution plan into Substrait?

3. Sorry I haven't been following some of the other Substrait conversations; are there any updates with the decimal glitch in Isthmus? If there're GitHub issues, Google Group / Slack conversations etc around these, it'll be great if you could route me to those. 


Looking forward to your responses, thank you very much!

Advitya Gemawat

unread,
Jun 22, 2022, 11:25:34 AM6/22/22
to substrait
4. Are there any differences in the Substrait plans generated from Isthmus/Calcite v/s say DuckDB, and have you contrasted how the plans looks (or have access to those plans to share with us)?

Thank you again!

Jinfeng Ni

unread,
Jun 22, 2022, 12:25:52 PM6/22/22
to substrait
Hi Advitya,

Glad to hear you are moving forward with integrating tpc-h Substrait plans.

1.  You are right the Substrait plan generated is not optimized. In the current implementation,  the Calcite-based converter from Sql -> Substrait plan goes through the step of parsing, sql validating and converting to a relational algebra operator (RelNode), but it does not go through Calcite's optimization phases.  That means typical optimizations like predicate pushdown, projection pruning, join reordering etc are not there yet.  Part of reason is we want to explore the two directional converters (sql -> substrait,  substrait -> sql) in the Isthmus to verify converting into and out of substrait plan is doable and correct.  Later, we could add the optimization steps into the process. 

2. Yes, the substrait plan generated is logical.  The current Substrait plan specification mainly contains logical operators. It specifies the semantics of the operator, without specifying how the operator is to be implemented. 

3. Not sure what decimal related bugs you are referring to.  In Substrait-java repo, there was a recent fix for decimal literal encoding/decoding. 

4. I have not used DuckDB, and do not have idea how the substrait plans generated from DuckDB look like. 

Best Regards,

Jinfeng

Advitya Gemawat

unread,
Jul 5, 2022, 2:11:00 PM7/5/22
to substrait
Thank you for your response! Please keep us updated on when the Calcite optimizations are added to substrait plans so that we can incorporate them.

For now, we're looking at the DuckDB API to get substrait proto bytes for a query which may incorporate DuckDB's optimizations.

To generate the Substrait protobuf from the proto bytes, I downloaded protoc, cloned the main substrait repo, and tried to generate a python module from  substrait/plan.proto at main · substrait-io/substrait (github.com) (to parse the proto bytes), but got the error attached in plan_substrait_protoc_output.png.

I was wondering if there's a different way in which I'm supposed to get a python file out of the proto spec, which I could use to parse substrait proto bytes.

Looking forward to your response, thank you very much!

duckdb_substrait.py
plan_substrait_protoc_output.png
duckdb_substrait_print_output.png

Jeroen van Straten

unread,
Jul 5, 2022, 2:54:44 PM7/5/22
to Substrait
You were close: the correct command would be

path/to/substrait/proto$ protoc -I . --python_out . substrait/*.proto substrait/extensions/*.proto

noting that the include path should point to the proto dir (not proto/substrait), and that you should probably generate all the files (protoc won't generate the dependencies for you). If you have buf (https://buf.build/), you can also just use "buf generate" from the root of the repository, which generates Python files in gen/proto/python.

If you just want to convert binary to JSON (or JSON-like formats) to understand what's happening, you could also try the validator:

pip install substrait-validator
substrait-validator input.proto -m convert -O output.json

or in Python:

import substrait_validator as sv
json_data = sv.plan_to_json(binary_data)

The validator also exposes the protobuf-generated files in sv.substrait. However, I have no idea what version of Substrait DuckDB is using, and there have been some breaking changes recently, so your mileage may vary with that approach. I think the PyPI release of the validator is based on Substrait 0.3.0.

You received this message because you are subscribed to the Google Groups "substrait" group.

To unsubscribe from this group and stop receiving emails from it, send an email to substrait+...@googlegroups.com.

Advitya Gemawat

unread,
Jul 11, 2022, 12:11:05 PM7/11/22
to substrait
Thank you very much for your response! I was able to run the command and get the python files for each proto file.

I also got a chance to try the substrait-validator on the DuckDB-Substrait proto bytes, and got a duckdb-substrait json file (attached for your reference). The syntax does look a bit different from the plans shared by Jinfeng (which we have been using), and this attached plan also seem to have spurious components such as an empty outer project operator. Yes, you had a very valid point on the Substrait version used by the DuckDB API (which looks like 0.12) and utilizing the validator (with 0.3.0). I've also been in touch with the DuckDB folks and have hence made a GitHub issue so that we can hopefully get engagement and compatibility on their API's side as well.

One question I wanted to ask is on the process to actually read the substrait protobuf stream. I'm curious on how to actually go about this for parsing substrait plans, and if there are any existing scripts to do the same (rather than setting all aspects of the plan manually such as in the attached image tutorial)? Please let me know any suggestions from your side. Many thanks for the continued support!
tpch6_duckdb_substrait.json
proto_manual_parse.png

Jeroen van Straten

unread,
Jul 11, 2022, 5:44:26 PM7/11/22
to Substrait
> and this attached plan also seem to have spurious components such as an empty outer project operator

The validator doesn't change the structure of the plan. The biggest artifact you might encounter is that default values are omitted in the JSON output, because that's just how protobuf's JSON serializer works.

Note that the outer projection is not actually empty. It looks like this:

"project": {
  "input": ...,
  "expressions": [
    { "selection": { "directReference": { "structField": {} } } }
  ]
}

The index in the structField is omitted by the JSON serializer because it's set to zero, so this is a project relation that appends a single column that just duplicates column zero. They probably *mean* to output a projection that *replaces* the dataset with just column zero (i.e. to remove all other columns, which in this case would just be the grouping set identification column that the aggregate relation implicitly appends), but that would be a misinterpretation of the spec: the expressions in a Substrait project relation can only append columns, while emit clauses in RelCommon must be used to remove or re-index/swizzle columns. This is a problem on their end.

> One question I wanted to ask is on the process to actually read the substrait protobuf stream. I'm curious on how to actually go about this for parsing substrait plans, and if there are any existing scripts to do the same (rather than setting all aspects of the plan manually such as in the attached image tutorial)?

Unless the Java team already has some API for this (I'm not up to speed on that), I'm not aware of any API that can be used to simplify this beyond just accessing the plans using the protobuf library (or third-party alternatives) directly at this time. There have been some talks about making APIs like this in the future though, as well as a human-readable text format (i.e. better than JSON) that maps one-to-one onto Substrait.

Reply all
Reply to author
Forward
0 new messages