RE: Is a MySQL Substrait extension feasible?

6 views
Skip to first unread message

Jacques Nadeau

unread,
Sep 8, 2025, 8:57:20 PM (2 days ago) Sep 8
to Substrait, Kaiwang Chen
(received this privately but thought it would be better to respond here. Kaiwang agreed it was okay to share)

1. I feel like substrait-cpp isn't getting that much love right now. Part of this is because people are pretty comfortable working with the protobuf structures directly. The addon library are really about trying to provide better idiomatic interfaces and utility functions for working with Substrait but you are always welcome to work directly with protobuf messages. Not sure if others have been interested in trying to move substrait-cpp. Hopefully they would chime in here.

2. Ideally pojo-like things aren't necessary. They can be useful to turn proto into something more idiomatic which is what the libraries like substrait-go, substrait-java, etc do. (For example, inheritance and interfaces can be useful but don't exist in protobuf). That being said, it isn't necessary. If you were thinking about doing so, I'd suggest extending/updating in a general way in substrait-cpp as opposed to something custom. That way you can hopefully share the burden with others in time. I'd probably start by picking up substrait-cpp and working with it/enhancing along the way as needed. We're trying to avoid lots of major/complex breaking changes so hopefully little changes in the spec won't be overly burdensome to incorporate into the project.

3. isthmus, in substrait-java is a connector between substrait and Calcite. This means that any substrait plan can be converted to Calcite and then leverage calcite's capabilities. One of Calcite's capabilities is to do exactly that (plan => sql) so you should be able to leverage isthmus+calcite to do that. I suspect datafusion integration may also have that capability but I'm not entirely sure (it might be only sql => substrait, not substrait => sql). 

4. There used to be a couple but I'm not sure what is currently active. Maybe others can chime in.

5. Yes. Basic stats are baked in and stats is designed as an extensibility endpoint for exactly this purpose via advanced extensions. Advanced extensions are covered lightly here at a high level: https://substrait.io/extensions/#advanced-extensions. Actual extension point in protocol is here: https://github.com/substrait-io/substrait/blob/main/proto/substrait/algebra.proto#L60

6. WRT Velox and DuckDB, they are both awesome but definitely serve different purposes. DuckDB is single process first and is a complete database. Velox is much more about being plugged into other systems (as is done with spark/gluten, prestissimo, etc). There are both really great groups of people who built high quality, high performance db execution engines. I would expect DuckDB to get you to something working faster but be less modular than working with Velox. I suggest you join both communities to ask their feedback on your work. In these situations (where you have two or more good options--data fusion is another really strong system and once you get to substrait+Arrow, language boundaries are more manageable). In terms of activity with substrait, I would say that Datafusion is the most active, followed by DuckDB, followed by Gluten/Velox. (Note that Gluten are currently on a hard fork at the moment--something we'd like to ultimately get rectified.) 

Hope that helps!

On Fri, Sep 5, 2025 at 5:58 AM Kaiwang Chen <kaiwan...@gmail.com> wrote:

Hi Jacques,


Great! I know Datafusion, on which Greptime, a database startup, builds its MPP system, they told me Substrait was used to ship plan fragments. However, it is Rust and components are the same on both sides. I am working on MySQL and DuckDB. Both are C++. I noticed substrait-cpp is not so active, compared to substrait-java and substrait-rs.


I just did some quick investigation and saw that Gluten translates Spark stage physical plan to Substrait, which is executed by Velox and ClickHouse. Gluten even provides a way to fallback to vanilla Spark. I guess it leads to the fact that substrait-java is active? I can't help but imagine how Spark-Gluten-Velox handles incompatibility between type systems. It is not about relation operators, it is about scalar functions and operators and non-trivial job according to Thomas Neumann and Viktor's talk, Towards Sanity in Query Languages (https://www.youtube.com/watch?v=TBAf5l1RmcA). Please also share your insights?


My vision is to extend MySQL in many cases. For example, a MySQL thread could ship a fragment to a group of workers to enable parallel execution, to an embedded DuckDB instance to enable columnar execution, or to another data processing component or computable storage system.


I see there is a DuckDB Substrait extension. I guess a MySQL Substrait extension makes all that possible. Probably what I am trying to build is something like Gluten for MySQL. However, MySQL processing makes things more challenging, because it does not have a clear stage before execution, instead it might interleave optimization and execution. I guess such interleaving implies fragments and the same data view between MySQL row store and a column store.


My other questions:

1. Is substrait-cpp a good start? As far as I know, DuckDB Substrait extension does not rely on substrait-cpp.

2. Is it best practice to introduce a middle layer of plan structures (kind of POJO in Java) between MySQL plan structures and Substrait plans? I found DuckDB plan structures are well designed and easy to translate, however, MySQL plan structures are not that easy, they are changing in these years.

3. Is there any library to convert a Substrait plan to some SQL text?

4. Is there any visual tool to examine a Substrait plan, just like pev2 (https://github.com/dalibo/pev2)?

5. Could a Substrait plan be extended with execution statistics, so as to resemble EXPLAIN ANALYZE?

6. Do you have any comments about Velox and DuckDB? I found DuckDB is a hot trend but not so easy to be extended to a MPP system, its pipeline execution framework assumes an in-memory shared storage with coupled computing logic.


Looking forward to your response!


Thanks,

Kaiwang

Julian Hyde

unread,
Sep 8, 2025, 11:01:26 PM (2 days ago) Sep 8
to subs...@googlegroups.com
Regarding 3. I think there’s an opportunity to reduce duplicate work and create a single SQL dialect library that works for all languages. We would do it by moving Calcite's plan-to-SQL code into Substrait (or a new project).

There is considerable pain, currently. Many, many tools and companies maintain their own SQL generation layer—significant redundant work. Substrait delegates this to Calcite, but since Calcite is Java-only, people who need this capability from languages like C++ or Rust go elsewhere.

My proposed solution is to move Calcite’s dialect layer into a separate project (perhaps a sub-project of Substrait). It would consist of a comprehensive test suite with thousands of plan-to-SQL cases, Java code, and eventually code in other languages (Rust, C++, Python). 

There are several benefits: Reduce duplicated maintenance work; attract more maintainers through broader usage; consistent Substrait capabilities in all languages; make Substrait the place to go for federated and hybrid planning, not just local code-generation.

Having built Calcite's dialect layer, and maintained it for several years, I know the real value is in the test suite. The test suite would be the heart of this project — and would grow as the project gains momentum. We can improve it by generating and verifying cases against actual engines while keeping offline validation for PRs. In 2025, most of the dialect code can be generated by AI — but the corpus of tests is essential to safely grow that code.

If people want to work on this, let's talk.

Julian




--
You received this message because you are subscribed to the Google Groups "substrait" group.
To unsubscribe from this group and stop receiving emails from it, send an email to substrait+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/substrait/CAJ9XdSo%2BPavRAKB%3DBkHU6MWakiUzCH0dxs0qnwa2pETTRkXbCw%40mail.gmail.com.

Reply all
Reply to author
Forward
0 new messages