State of the Strait

Weston Pace

unread,

Mar 23, 2023, 3:33:33 PM3/23/23

to Substrait

I've compiled a short summary of Substrait activity over the past three months (Jan 1 - Present, 2023). I plan on doing this roughly every quarter so that we can see levels of engagement, trends, etc. These numbers probably won't be extremely useful until we have collected them more regularly.

# Headline Statistics

In this interval there have been 83 commits and 1,037 comments (in both issues and PRs).

63 of these commits and 448 of these comments were from non-committers

During the previous 3 month period there were 65 commits.

# Distribution of Activity

The active repos are substrait, substrait-java, substrait-go, substrait-cpp, and substrait-rs

Most of the activity appears to be around building up initial support libraries for the various languages.

Although I don't have statistics for this (we don't have organizational information for non-committers) I do believe that most of the activity is coming from Voltron Data or Intel. It would be nice to diversify this.

# Rising Stars

The following were the most active non-committers during this period. These raw numbers need to be taken into context with less tangible factors (and not all commits or comments are equal) but they are good information when considering new committers:

EpsilonPrime: commits: 15 comments: 144
chaojun-zhang Commits: 11 Comments: 44
zeroshade Commits: 10 Comments: 35
mbrobbel Commits: 8 Comments: 30
rok Commits: 2 Comments: 32
ianmcook Commits: 2 Comments: 28
vbarua Commits: 6 Comments: 23
JkSelf Commits: 0 Comments: 28

wjones127 Commits: 3 Comments: 9

# Feedback Welcome

Is this format useful? Is there other information we can display? Is there a reason we shouldn't be reporting these statistics for any reason? Please feel free to reply with any feedback.

Jacques Nadeau

unread,

Mar 27, 2023, 7:00:19 PM3/27/23

to subs...@googlegroups.com

Thanks for compiling this Weston. This seems like a great place to begin having these reports. (And I appreciate the name "state of the strait") .

I think there a few key things that would be good for evolution of the project.

- Diversity: I agree with your comments here. I also think it is important to think about the broader Substrait community as well. If people are integrating Substrait from other projects, that is key to our success, even if they aren't initially contributing patches.

- Content pivot: We've talked several times about shifting the website more towards "using Substrait" content including linking to projects, etc.

- SMC: We need to look at how we can energize a broader set of committers to become SMC members. Having SMC limited to Voltron and Sundeck isn't where I think we hoped we would be by then.

Any thoughts on ways to broaden the community?

What do you think are the key things we need to improve? (Your initial email was mostly neutral but I'd love to hear your thoughts on what else we need to improve beyond diversity.)

--
You received this message because you are subscribed to the Google Groups "substrait" group.
To unsubscribe from this group and stop receiving emails from it, send an email to substrait+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/substrait/CAE4AYb3mKi8SJnudhvFMQSMMzpib67H1q9o8vdknKSxotTTuyw%40mail.gmail.com.

Weston Pace

unread,

Mar 29, 2023, 1:14:51 PM3/29/23

to subs...@googlegroups.com

I've thought about this a bit. I think there are two big challenges for growth:

# It's hard to get started

This is especially true if you don't have a relational algebra background. Plans are unreadable. There is no visualization. There are not a lot of demos and examples out there that people can refer to. I think there is some good work being done in this space (text format, fiddle, improving documentation, etc.) and there is still more that could be done. I am optimistic here.

# Substrait is not solving any real problems (yet)

Substrait is not yet doing anything that SQL can't do. There are certainly problems that Substrait can uniquely solve but those tools don't exist yet. A few areas of interest to me are:

* Substrait function options can be used to help catalogue SQL dialects. Substrait plans can be used to help translate between dialects and to indicate when that translation isn't possible. However, we need to actually catalogue these dialects for this information to be useful. This is something I am just starting to work on.

* Substrait can be used as the output of an optimizer / transformer and not just the input. However, without an optimizer or transformer existing, this isn't very useful. I don't have the ability (time, energy, resources) to single handled write an optimizer (I've since learned this is a huge task) but I think some work is being done on some initial transformation tools and I think that will be very valuable.

To view this discussion on the web visit https://groups.google.com/d/msgid/substrait/CAJ9XdSq_zF3x%2B8njA8oeSAGWv7D3Xk5LjxZXtWQ2hszG9EKYXw%40mail.gmail.com.

Andy Grove

unread,

Mar 29, 2023, 1:27:14 PM3/29/23

to subs...@googlegroups.com

DataFusion (Python & Rust) has the ability to read substrait plans (with limitations) and generate Graphviz visualizations. If this is of interest to anyone, I can share some sample code.

DataFusion, in theory, could be used today to read a substrait plan, optimize it, and write out a new plan. Again. I can provide an example of this, if it is of interest.

Thanks,

Andy.

To view this discussion on the web visit https://groups.google.com/d/msgid/substrait/CAE4AYb3Uka3ECVv%3DApRg361Mgn3wLDSOAppA6DLQr7kTBJEwAg%40mail.gmail.com.

Andrew Lamb

unread,

Mar 29, 2023, 5:18:38 PM3/29/23

to subs...@googlegroups.com

> # Substrait is not solving any real problems (yet)

In my opinion, the most exciting usecase for Substrait is as a description for "pushdown" computation to enable heterogenious data processing. Imagine if your analytic system could push filtering (or aggregation, etc) down into whatever hardware happened to be available that was storing the host data. Maybe into AWS Athena, maybe directly on to embedded CPU in flash storage, maybe into a CEPH object system, etc.

Today there is a 1:N problem (for each such data source I need to write a connector for my analytics engine).

If I could use Substrait to make it 1:1 that would be game changing:

1. Database makes substrait plan

2. The data sources each have a substrait consumer for the different data systems

> Substrait can be used as the output of an optimizer / transformer and not just the input.

As mentioned above, this is the most compelling usecase for Substrait that I see.

Hope that helps,

Andrew

To view this discussion on the web visit https://groups.google.com/d/msgid/substrait/CAJEf%3DX5cz638RCHvEM2T2tGoH-oM4QyPjshWDu8khJkUMSdOCA%40mail.gmail.com.

Weston Pace

unread,

Mar 30, 2023, 12:17:08 AM3/30/23

to Substrait

> DataFusion, in theory, could be used today to read a substrait plan, optimize it, and write out a new plan. Again. I can provide an example of this, if it is of interest.

I'd be very interested. I'll admit, I am probably too ignorant of Datafusion and its capabilities. I'm looking forward to Andrew's presentations on Datafusion (I'll miss tomorrow's due to conflict but hope to check out the recording).

> In my opinion, the most exciting usecase for Substrait is as a description for "pushdown" computation to enable heterogenious data processing. Imagine if your analytic system could push filtering (or aggregation, etc) down into whatever hardware happened to be available that was storing the host data. Maybe into AWS Athena, maybe directly on to embedded CPU in flash storage, maybe into a CEPH object system, etc.

This is very exciting. Thank you for bringing it up. There is some interesting work going on here (https://users.soe.ucsc.edu/~carlosm/dev/news/20220909/) though I'm not sure yet exactly what we can do to help that work move forward.

To view this discussion on the web visit https://groups.google.com/d/msgid/substrait/CAFhtnRwXUohFNej0mTSEWtLzfQg1Y%3Dp-bYRV7eY_r_Pnv8w47A%40mail.gmail.com.

Carlo Aldo Curino

unread,

Mar 30, 2023, 3:27:39 AM3/30/23

to subs...@googlegroups.com

+1 on this. The pattern you describe also applies to other parts of the stack. E.g., we are looking to leverage QO components across engines etc.

Thanks,

Carlo

To view this discussion on the web visit https://groups.google.com/d/msgid/substrait/CAFhtnRwXUohFNej0mTSEWtLzfQg1Y%3Dp-bYRV7eY_r_Pnv8w47A%40mail.gmail.com.

Reply all

Reply to author

Forward