RE: Is a MySQL Substrait extension feasible?

53 views
Skip to first unread message

Jacques Nadeau

unread,
Sep 8, 2025, 8:57:20 PMSep 8
to Substrait, Kaiwang Chen
(received this privately but thought it would be better to respond here. Kaiwang agreed it was okay to share)

1. I feel like substrait-cpp isn't getting that much love right now. Part of this is because people are pretty comfortable working with the protobuf structures directly. The addon library are really about trying to provide better idiomatic interfaces and utility functions for working with Substrait but you are always welcome to work directly with protobuf messages. Not sure if others have been interested in trying to move substrait-cpp. Hopefully they would chime in here.

2. Ideally pojo-like things aren't necessary. They can be useful to turn proto into something more idiomatic which is what the libraries like substrait-go, substrait-java, etc do. (For example, inheritance and interfaces can be useful but don't exist in protobuf). That being said, it isn't necessary. If you were thinking about doing so, I'd suggest extending/updating in a general way in substrait-cpp as opposed to something custom. That way you can hopefully share the burden with others in time. I'd probably start by picking up substrait-cpp and working with it/enhancing along the way as needed. We're trying to avoid lots of major/complex breaking changes so hopefully little changes in the spec won't be overly burdensome to incorporate into the project.

3. isthmus, in substrait-java is a connector between substrait and Calcite. This means that any substrait plan can be converted to Calcite and then leverage calcite's capabilities. One of Calcite's capabilities is to do exactly that (plan => sql) so you should be able to leverage isthmus+calcite to do that. I suspect datafusion integration may also have that capability but I'm not entirely sure (it might be only sql => substrait, not substrait => sql). 

4. There used to be a couple but I'm not sure what is currently active. Maybe others can chime in.

5. Yes. Basic stats are baked in and stats is designed as an extensibility endpoint for exactly this purpose via advanced extensions. Advanced extensions are covered lightly here at a high level: https://substrait.io/extensions/#advanced-extensions. Actual extension point in protocol is here: https://github.com/substrait-io/substrait/blob/main/proto/substrait/algebra.proto#L60

6. WRT Velox and DuckDB, they are both awesome but definitely serve different purposes. DuckDB is single process first and is a complete database. Velox is much more about being plugged into other systems (as is done with spark/gluten, prestissimo, etc). There are both really great groups of people who built high quality, high performance db execution engines. I would expect DuckDB to get you to something working faster but be less modular than working with Velox. I suggest you join both communities to ask their feedback on your work. In these situations (where you have two or more good options--data fusion is another really strong system and once you get to substrait+Arrow, language boundaries are more manageable). In terms of activity with substrait, I would say that Datafusion is the most active, followed by DuckDB, followed by Gluten/Velox. (Note that Gluten are currently on a hard fork at the moment--something we'd like to ultimately get rectified.) 

Hope that helps!

On Fri, Sep 5, 2025 at 5:58 AM Kaiwang Chen <kaiwan...@gmail.com> wrote:

Hi Jacques,


Great! I know Datafusion, on which Greptime, a database startup, builds its MPP system, they told me Substrait was used to ship plan fragments. However, it is Rust and components are the same on both sides. I am working on MySQL and DuckDB. Both are C++. I noticed substrait-cpp is not so active, compared to substrait-java and substrait-rs.


I just did some quick investigation and saw that Gluten translates Spark stage physical plan to Substrait, which is executed by Velox and ClickHouse. Gluten even provides a way to fallback to vanilla Spark. I guess it leads to the fact that substrait-java is active? I can't help but imagine how Spark-Gluten-Velox handles incompatibility between type systems. It is not about relation operators, it is about scalar functions and operators and non-trivial job according to Thomas Neumann and Viktor's talk, Towards Sanity in Query Languages (https://www.youtube.com/watch?v=TBAf5l1RmcA). Please also share your insights?


My vision is to extend MySQL in many cases. For example, a MySQL thread could ship a fragment to a group of workers to enable parallel execution, to an embedded DuckDB instance to enable columnar execution, or to another data processing component or computable storage system.


I see there is a DuckDB Substrait extension. I guess a MySQL Substrait extension makes all that possible. Probably what I am trying to build is something like Gluten for MySQL. However, MySQL processing makes things more challenging, because it does not have a clear stage before execution, instead it might interleave optimization and execution. I guess such interleaving implies fragments and the same data view between MySQL row store and a column store.


My other questions:

1. Is substrait-cpp a good start? As far as I know, DuckDB Substrait extension does not rely on substrait-cpp.

2. Is it best practice to introduce a middle layer of plan structures (kind of POJO in Java) between MySQL plan structures and Substrait plans? I found DuckDB plan structures are well designed and easy to translate, however, MySQL plan structures are not that easy, they are changing in these years.

3. Is there any library to convert a Substrait plan to some SQL text?

4. Is there any visual tool to examine a Substrait plan, just like pev2 (https://github.com/dalibo/pev2)?

5. Could a Substrait plan be extended with execution statistics, so as to resemble EXPLAIN ANALYZE?

6. Do you have any comments about Velox and DuckDB? I found DuckDB is a hot trend but not so easy to be extended to a MPP system, its pipeline execution framework assumes an in-memory shared storage with coupled computing logic.


Looking forward to your response!


Thanks,

Kaiwang

Julian Hyde

unread,
Sep 8, 2025, 11:01:26 PMSep 8
to subs...@googlegroups.com
Regarding 3. I think there’s an opportunity to reduce duplicate work and create a single SQL dialect library that works for all languages. We would do it by moving Calcite's plan-to-SQL code into Substrait (or a new project).

There is considerable pain, currently. Many, many tools and companies maintain their own SQL generation layer—significant redundant work. Substrait delegates this to Calcite, but since Calcite is Java-only, people who need this capability from languages like C++ or Rust go elsewhere.

My proposed solution is to move Calcite’s dialect layer into a separate project (perhaps a sub-project of Substrait). It would consist of a comprehensive test suite with thousands of plan-to-SQL cases, Java code, and eventually code in other languages (Rust, C++, Python). 

There are several benefits: Reduce duplicated maintenance work; attract more maintainers through broader usage; consistent Substrait capabilities in all languages; make Substrait the place to go for federated and hybrid planning, not just local code-generation.

Having built Calcite's dialect layer, and maintained it for several years, I know the real value is in the test suite. The test suite would be the heart of this project — and would grow as the project gains momentum. We can improve it by generating and verifying cases against actual engines while keeping offline validation for PRs. In 2025, most of the dialect code can be generated by AI — but the corpus of tests is essential to safely grow that code.

If people want to work on this, let's talk.

Julian




--
You received this message because you are subscribed to the Google Groups "substrait" group.
To unsubscribe from this group and stop receiving emails from it, send an email to substrait+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/substrait/CAJ9XdSo%2BPavRAKB%3DBkHU6MWakiUzCH0dxs0qnwa2pETTRkXbCw%40mail.gmail.com.

Kaiwang Chen

unread,
Sep 11, 2025, 6:47:51 AMSep 11
to Jacques Nadeau, Substrait

Thanks for the response and loop in the mail list.

Jacques Nadeau <jac...@sundeck.io> 于2025年9月9日周二 08:57写道:
(received this privately but thought it would be better to respond here. Kaiwang agreed it was okay to share)

1. I feel like substrait-cpp isn't getting that much love right now. Part of this is because people are pretty comfortable working with the protobuf structures directly. The addon library are really about trying to provide better idiomatic interfaces and utility functions for working with Substrait but you are always welcome to work directly with protobuf messages. Not sure if others have been interested in trying to move substrait-cpp. Hopefully they would chime in here.
 
Great!


2. Ideally pojo-like things aren't necessary. They can be useful to turn proto into something more idiomatic which is what the libraries like substrait-go, substrait-java, etc do. (For example, inheritance and interfaces can be useful but don't exist in protobuf). That being said, it isn't necessary. If you were thinking about doing so, I'd suggest extending/updating in a general way in substrait-cpp as opposed to something custom. That way you can hopefully share the burden with others in time. I'd probably start by picking up substrait-cpp and working with it/enhancing along the way as needed. We're trying to avoid lots of major/complex breaking changes so hopefully little changes in the spec won't be overly burdensome to incorporate into the project.

I am not sure if Gluten shim layer is to provide neural objects to isolate spark core changes. However, I guess there is no optimizer working directly on these protobuf structures.


3. isthmus, in substrait-java is a connector between substrait and Calcite. This means that any substrait plan can be converted to Calcite and then leverage calcite's capabilities. One of Calcite's capabilities is to do exactly that (plan => sql) so you should be able to leverage isthmus+calcite to do that. I suspect datafusion integration may also have that capability but I'm not entirely sure (it might be only sql => substrait, not substrait => sql).  

Looks like isthmus already pretty prints as json. However, the Substrait json example in isthmus-cli/README.md of the following command crashed DuckDB "CALL from_substrait_json(json_plan)". :(
 > $ ./isthmus-cli/build/graal/isthmus \
  -c "CREATE TABLE Persons ( firstName VARCHAR, lastName VARCHAR, zip INT )" \
     "SELECT lastName, firstName FROM Persons WHERE zip = 90210"

4. There used to be a couple but I'm not sure what is currently active. Maybe others can chime in.



5. Yes. Basic stats are baked in and stats is designed as an extensibility endpoint for exactly this purpose via advanced extensions. Advanced extensions are covered lightly here at a high level: https://substrait.io/extensions/#advanced-extensions. Actual extension point in protocol is here: https://github.com/substrait-io/substrait/blob/main/proto/substrait/algebra.proto#L60

Is a Substrait plan can be self-contained to allow for further optimization? The same question was also raised in reply to Julian's response.

6. WRT Velox and DuckDB, they are both awesome but definitely serve different purposes. DuckDB is single process first and is a complete database. Velox is much more about being plugged into other systems (as is done with spark/gluten, prestissimo, etc). There are both really great groups of people who built high quality, high performance db execution engines. I would expect DuckDB to get you to something working faster but be less modular than working with Velox. I suggest you join both communities to ask their feedback on your work. In these situations (where you have two or more good options--data fusion is another really strong system and once you get to substrait+Arrow, language boundaries are more manageable). In terms of activity with substrait, I would say that Datafusion is the most active, followed by DuckDB, followed by Gluten/Velox. (Note that Gluten are currently on a hard fork at the moment--something we'd like to ultimately get rectified.) 

Hope that helps!

Really helps! Thanks a lot!

Thanks,
Kaiwang 

Boshen Cui

unread,
Sep 13, 2025, 11:39:30 PMSep 13
to subs...@googlegroups.com
re: 4
one of my colleagues built out a similar visualization tool
https://github.com/gabotechs/substrait-vizhttps://www.npmjs.com/package/@substrait-viz/react
i think on top of https://reactflow.dev/
which we've use internally to debug

if you want to examine the plan structure via a text structure instead of visuals 

Kaiwang Chen

unread,
Sep 13, 2025, 11:39:30 PMSep 13
to subs...@googlegroups.com


Julian Hyde <jhyde....@gmail.com> 于2025年9月9日周二 11:01写道:
Regarding 3. I think there’s an opportunity to reduce duplicate work and create a single SQL dialect library that works for all languages. We would do it by moving Calcite's plan-to-SQL code into Substrait (or a new project).

There is considerable pain, currently. Many, many tools and companies maintain their own SQL generation layer—significant redundant work. Substrait delegates this to Calcite, but since Calcite is Java-only, people who need this capability from languages like C++ or Rust go elsewhere.

My proposed solution is to move Calcite’s dialect layer into a separate project (perhaps a sub-project of Substrait). It would consist of a comprehensive test suite with thousands of plan-to-SQL cases, Java code, and eventually code in other languages (Rust, C++, Python).  

Probably a subproject of both a library  for Substrait-to-SQL and a standalone CLI tool to pretty print given Substrait plan and SQL text?

Conversion between SQL and Substrait plan in both directions is also a point to consider. Note that substrait-io/duckdb-substrait-extension already provides such functions. However, they require table definitions.

- get_substrait(sql) to get substrait message as a binary string
- get_substrait_json(sql) to get substrait message as a JSON text
- from_substrait(blob)
- from_substrait_json(json)
 
I am not sure if it is possible to pretty print a Substrait message as structured text rather than JSON text, or if it is just by design to use JSON as a user-friendly representation.
However, a system might just put down raw binary messages, leaving people to examine them later. A separate tool to examine the binary message should be of great help here.


There are several benefits: Reduce duplicated maintenance work; attract more maintainers through broader usage; consistent Substrait capabilities in all languages; make Substrait the place to go for federated and hybrid planning, not just local code-generation.

I don't understand what hybrid planning is here. I would imagine an optimizer that optimizes a given substrait fragment? Is it OK to submit a Subtrait plan with embedded supportive definitions and statistics to Calcite and get an optimal Substrain plan for execution, just as provided by ORCA? See AMPERe https://dl.acm.org/doi/pdf/10.1145/2304510.2304513


Having built Calcite's dialect layer, and maintained it for several years, I know the real value is in the test suite. The test suite would be the heart of this project — and would grow as the project gains momentum. We can improve it by generating and verifying cases against actual engines while keeping offline validation for PRs. In 2025, most of the dialect code can be generated by AI — but the corpus of tests is essential to safely grow that code. 

Such test cases are really invaluable.

However, I have really painful experiences with MySQL's test cases. When I tried to convert an execution plan of some feature to a SQL dialect which is executed in a secondary engine, because the development process was ongoing and far from finished, there could be thousands of broken cases to examine. Although a silent fallback mechanism could help to alleviate the distraction, it was really hard to tell which cases are related to which features. I am wondering if there is a way to identify tests so that each iteration of development can be more focused when a big test suite is reused.
 

If people want to work on this, let's talk.

Julian



Thanks,
Kaiwang
 

Julian Hyde

unread,
Sep 16, 2025, 2:30:52 PMSep 16
to subs...@googlegroups.com
> I don't understand what hybrid planning is here.

A hybrid plan has a combination of local execution and distributed SQL execution. If a plan is a tree, then some of the leaves of the trees will be SQL queries that are executed by firing a SQL query at a remote engine. The local part might even be trivial - just forward the results from the remote engine to the client.

To do hybrid planning, you need to know the capabilities of the remote engine (e.g. does it support FULL outer join?) and its syntax (e.g. is the SUBSTRING function called SUBSTR?). A cost model is useful, but generally ‘push down as much as you can’ works well.

Kaiwang Chen

unread,
Sep 22, 2025, 11:04:27 PMSep 22
to subs...@googlegroups.com

Julian Hyde <jhyde....@gmail.com> 于2025年9月17日周三 02:30写道:
> I don't understand what hybrid planning is here.

A hybrid plan has a combination of local execution and distributed SQL execution. If a plan is a tree, then some of the leaves of the trees will be SQL queries that are executed by firing a SQL query at a remote engine. The local part might even be trivial - just forward the results from the remote engine to the client.

So it's intended for the case that some Substrait plan fragment is meant to be distributed to an executor which only understands its SQL form, then the substrait-to-sql feature is needed to do the conversion.
 

To do hybrid planning, you need to know the capabilities of the remote engine (e.g. does it support FULL outer join?) and its syntax (e.g. is the SUBSTRING function called SUBSTR?). A cost model is useful, but generally ‘push down as much as you can’ works well.

Yes. Then the remote executor is required to register to the optimizer/coordinator with its capabilities?

Looks there is no standard way in substrait to declare that an executor is able to execute only a few operators and functions. Substrait is just a wire format, it does not include capability negotiation?

/Kaiwang
 

Jacques Nadeau

unread,
Sep 23, 2025, 2:06:51 PMSep 23
to subs...@googlegroups.com
Take a look at David's dialect work, merged in https://github.com/substrait-io/substrait/pull/816.

It's our most complete capabilities expression system thus far.

Julian Hyde

unread,
Sep 23, 2025, 2:27:11 PMSep 23
to subs...@googlegroups.com
If by "capability negotiation” you mean Substrait has a back-and-forth with the target system, no, that is not required. The definition of a target dialect (e.g. MySQL version 8) knows what constructs MySQL supports and what is the syntax for those constructs. The only “back and forth” occurs when a developer is writing the MySQL 8 dialect, and testing against a MySQL 8 system.

Reply all
Reply to author
Forward
0 new messages