Hi Jacques,
Great! I know Datafusion, on which Greptime, a database startup, builds its MPP system, they told me Substrait was used to ship plan fragments. However, it is Rust and components are the same on both sides. I am working on MySQL and DuckDB. Both are C++. I noticed substrait-cpp is not so active, compared to substrait-java and substrait-rs.
I just did some quick investigation and saw that Gluten translates Spark stage physical plan to Substrait, which is executed by Velox and ClickHouse. Gluten even provides a way to fallback to vanilla Spark. I guess it leads to the fact that substrait-java is active? I can't help but imagine how Spark-Gluten-Velox handles incompatibility between type systems. It is not about relation operators, it is about scalar functions and operators and non-trivial job according to Thomas Neumann and Viktor's talk, Towards Sanity in Query Languages (https://www.youtube.com/watch?v=TBAf5l1RmcA). Please also share your insights?
My vision is to extend MySQL in many cases. For example, a MySQL thread could ship a fragment to a group of workers to enable parallel execution, to an embedded DuckDB instance to enable columnar execution, or to another data processing component or computable storage system.
I see there is a DuckDB Substrait extension. I guess a MySQL Substrait extension makes all that possible. Probably what I am trying to build is something like Gluten for MySQL. However, MySQL processing makes things more challenging, because it does not have a clear stage before execution, instead it might interleave optimization and execution. I guess such interleaving implies fragments and the same data view between MySQL row store and a column store.
My other questions:
1. Is substrait-cpp a good start? As far as I know, DuckDB Substrait extension does not rely on substrait-cpp.
2. Is it best practice to introduce a middle layer of plan structures (kind of POJO in Java) between MySQL plan structures and Substrait plans? I found DuckDB plan structures are well designed and easy to translate, however, MySQL plan structures are not that easy, they are changing in these years.
3. Is there any library to convert a Substrait plan to some SQL text?
4. Is there any visual tool to examine a Substrait plan, just like pev2 (https://github.com/dalibo/pev2)?
5. Could a Substrait plan be extended with execution statistics, so as to resemble EXPLAIN ANALYZE?
6. Do you have any comments about Velox and DuckDB? I found DuckDB is a hot trend but not so easy to be extended to a MPP system, its pipeline execution framework assumes an in-memory shared storage with coupled computing logic.
Looking forward to your response!
Thanks,
Kaiwang
--
You received this message because you are subscribed to the Google Groups "substrait" group.
To unsubscribe from this group and stop receiving emails from it, send an email to substrait+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/substrait/CAJ9XdSo%2BPavRAKB%3DBkHU6MWakiUzCH0dxs0qnwa2pETTRkXbCw%40mail.gmail.com.