Version 1.0.0

35 views
Skip to first unread message

Phillip Cloud

unread,
May 20, 2022, 6:18:49 PM5/20/22
to substrait
Hi everyone,

I'd like to discuss when we should release 1.0.0.

There's been some discussion around whether we should wait until some later date to release version 1.0.0 or, alternatively starting treating breaking changes as major version bump-worthy. We have a couple PRs ([1], [2]) that introduce breaking changes, and our automated release process would produce a 1.0.0 if we merge those as is.

In my mind it's best to adopt a strategy where 1.0.0 isn't special in any way. Breaking changes will be communicated a the top of the automatically generated release notes, similar to how we do it in the ibis project [3].This takes almost all of the toil out of cutting a release regardless of its status as major, minor or patch.

To avoid a bunch of major versions in succession we would then try to batch as many breaking changes together as possible, but it's certainly possible that there are times when we have N.0.0 in one week and (N + 1).0.0 the next.

Curious to hear what people think.

-Phillip

Jacques Nadeau

unread,
May 20, 2022, 8:40:44 PM5/20/22
to Substrait
Thanks for bringing this up Phillip. I'm generally of the same opinion: let's make release management as simple and low-overhead as possible. Let's just use consistent versioning and have engineers spend time engineering, not worrying about release processes and how certain numbers make people feel different.

While I agree that 1.0 can mean something special to people, I think it is easy to lie to ourselves that we can predict the future. As a point of reference, in Arrow, I believe the gap between the first discussion of "we're ready for 1.0" and the release was 18 months. This gap was theoretically to avoid having breaking changes shortly after 1.0. This is an example of a super effortful and diligent approach to this issue. The result: we released 2.0 six months after 1.0 and 3.0 three months after 2.0. In other words, even with best effort, the future was still impossible to predict. Let's stop trying.



--
You received this message because you are subscribed to the Google Groups "substrait" group.
To unsubscribe from this group and stop receiving emails from it, send an email to substrait+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/substrait/fc1c50e6-89af-4760-8618-ece7b890f9cdn%40googlegroups.com.

Weston Pace

unread,
May 20, 2022, 9:16:02 PM5/20/22
to Substrait
I think the comparison with Arrow is a little unfair. I agree the
Arrow libraries may have had breaking changes in every release.
However, the columnar format has not had a single breaking change
since the 1.0.0 release. I would say the fact that a file written by
pyarrow 0.17.0 is still readable two years later is pretty important
for its success. Parquet, also, has had very few breaking changes
since 1.0.0. This is going to be especially important for Substrait.
Otherwise we are going to end up with a very real challenge of
aligning producers and consumers.

I don't really care about version numbers as much as I care about the
idea of backwards (and ideally forwards) compatibility. At some point
the project has to switch from its current, experimental state, to a
state where we avoid backwards compatibility by making it harder to
make breaking changes. Let's consider a concrete example:
https://github.com/substrait-io/substrait/pull/169

This PR changes the "format" field from an enum to a oneof with a
variety of message types. There is a suggestion to reserve the
"format" field and use a new "file_format" field but that alone
wouldn't give backwards compatibility. To truly have backwards
compatibility we'd have to keep both fields in and have a default
"legacy" option in the newer field which, if set, would tell consumers
to fall back to the old behavior of checking the enum. This way a
message from an older producer would be handled just as correctly as a
message from a newer consumer.

However, this is a lot of extra work on the consumers and I don't
think we are all that close to the point where this kind of work is
justified. I do think we will eventually get to the point where this
sort of thing is important to do.

The "1.0.0" version number is a significant tool to communicate this
shift in backwards compatibility philosophy. But if we think it is
more of a burden for some reason then I don't really care if we
stabilize at 5.0.0 or 20.0.0 or 100.0.0 as long as we agree it will be
important to stabilize at some point.
> To view this discussion on the web visit https://groups.google.com/d/msgid/substrait/CAJ9XdSpeinA-wr_1Wsqt-12Xuu-mHu9wUSsH0RK7NAPVpKzwtA%40mail.gmail.com.

Jacques Nadeau

unread,
May 21, 2022, 5:11:39 PM5/21/22
to Substrait
Weston, it's a great point on the format versus the libraries. However, I think the rigidity around the 1.0 moniker was somewhat out of touch with the actual usage of Arrow. Many people were using Arrow in production long before 1.0 and we did our best to avoid breaking changes at that time. So, to me: little changed the year before and the year after 1.0.

I really don't have much opinion about specific numbers. My goals are: 
  • We should always try to minimize changes that are breaking (at the spec or library level). 
  • The more people that are using Substrait (alternatively the older it gets), the more aversion we should feel in making breaking changes.
  • At some point, the need for breaking changes should slow down naturally.
  • The meaning of our version numbers should be clear.
  • Releases should be 100% automated and based on a predictable schedule
  • Release frequency should slow down over time (e.g. 1w now, 1m next year, 3m year after).
Right now, as I understand it, the automated CI infrastructure is treating all breaking changes as a major version version bump, all features as a minor version bump and all fixes as requiring a patch version bump. To me the main point of discussion is: do we want to change that CI infrastructure to instead only bump minors for breaking changes until we want to formally allow a major bump (1.0). I believe this is compatible with semver specifications. If there is strong interest in the switching this, we'd also need a volunteer to do the actual work. Assuming that we have someone willing to do the work and we don't have to sacrifice the simplicity we now have, I'm fine with this change. I'm also fine (and slightly prefer) not doing it.

What do others think? So:

1. Should we modify the CI scripts to do minor version bumps for breaking changes until a few years from now? (Arrow took 4+ to get to 1.0, I believe.)
2. Additionally, should we already slow down the scripts from every 1w to every 4w so we can avoid worrying about batching breaking changes (something that feels especially prescient with 1w releases)
3. Do people thinks we should do something totally different?

Jacques









Jeroen van Straten

unread,
May 23, 2022, 2:59:40 AM5/23/22
to Substrait
1. Should we modify the CI scripts to do minor version bumps for breaking changes until a few years from now? (Arrow took 4+ to get to 1.0, I believe.)

+1 on this, and I'm willing to have a look at the scripts to make the change (I haven't looked at them at all yet, but this shouldn't be too difficult, I'd think?). Also, this is indeed compatible with semver, as semver explicitly leaves behavior before 1.0 unspecified [1]. I guess they had a hard time getting consensus on pre-1.0, too :) Note that the first two FAQ entries are also relevant to this discussion.

> 2. Additionally, should we already slow down the scripts from every 1w to every 4w so we can avoid worrying about batching breaking changes (something that feels especially prescient with 1w releases)

My initial response to a 1-week release cadence was that it was a bit much, but pre-1.0 I don't have a strong opinion on this anymore, especially if breaking vs non-breaking can be distinguished by which version element is incremented.

Jeroen


Weston Pace

unread,
May 23, 2022, 4:22:38 AM5/23/22
to Substrait
1. Weak yes

2. No. As a library user I don't really care what the release cadence
is. I'll just pick up whatever the latest release is when I need new
functionality. I'm not sure what "batching up breaking changes" means
but it sounds like extra work. Since early adopter's aren't forced to
grab each and every release I think it's fine if we have a breaking
change every week. I don't see how batching will really change
anything.

3. Not really, I like the idea of using commit messages. I appreciate
buf detecting breaking changes. I am a big fan of not worrying so
much about breaking changes at this point in development.
> To view this discussion on the web visit https://groups.google.com/d/msgid/substrait/CAEPx702crFFN6uj%2BzS%2B4CfxJBq%3DVmpW09_N4RFsvJ%2Bs2dse5xw%40mail.gmail.com.

Carlo Aldo Curino

unread,
May 24, 2022, 11:52:18 AM5/24/22
to subs...@googlegroups.com
I am in love with the idea of auto-releasing, but I have some concerns. The one caveat is that eventually the # of users should hopefully exceed the # developers. A fast moving major release imposes a bunch of extra overhead/uncertainty on the users while facilitating live of the core devs. The Hadoop style process with explicit (painful) releases that are then maintained for security/bugs for a while errs in the very opposite direction, more stable/slow/curated releases that might help users but are imposing/tedious on devs. In early days of a project I think facilitating devs is the right thing, but at some point I think the trade-off breaks the other way. Thoughts?

Thanks,
Carlo 

Phillip Cloud

unread,
May 24, 2022, 12:33:48 PM5/24/22
to subs...@googlegroups.com
Jeroen, the scripts would need to be modified to grep for "fix: " or "feat: " instead of "BREAKING CHANGE: " at the first lineo to allow patch and minor version bumps for breaking changes, respectively. If you want to gate it on minor releases you can reduce that to just checking for "feat: ". We can also add other commit message types if feat doesn't fit. That's configured in the semantic release config file (.releaserc.json at the repo root for us). Here's an example [1].

Weston, I agree with your 3 and also of the opinion that weekly cadence is fine for now. Batching isn't anything complicated, it just means that the release cadence is slower so that there'd be at least $SOME_TIME_PERIOD_LONGER_THAN_A_WEEK to pick up breaking changes.

Carlo, I think you're on point there.  That said, I'm not entirely sure whether we should make a decision right now about how to release Substrait in the far future. I think there may come a time when we need to move away from automatic releases to a "click-button" model, albeit using the same scripts and commit conventions etc, but we don't have to make that decision right now IMO.


Weston Pace

unread,
May 25, 2022, 5:19:10 AM5/25/22
to Substrait
> I think there may come a time when we need to move away from automatic releases to a "click-button" model, albeit using the same scripts and commit conventions etc, but we don't have to make that decision right now IMO.

I'm not sure this point is related to automatic releases as much as it
is related to how much work we do to avoid breaking changes. If we
are consistent with our naming and work hard to avoid accidentally
releasing a breaking change without labeling it then I think we could
still have automatic weekly releases far into the future. The only
change is that there would be fewer and fewer "breaking change" PRs as
we got further along.
> To view this discussion on the web visit https://groups.google.com/d/msgid/substrait/CAKRVfm5_8Hpu1CKgVHT30ZE1cRKr6Koy1aMU991rDQA9J1%3D14g%40mail.gmail.com.

Carlo Aldo Curino

unread,
May 25, 2022, 4:13:21 PM5/25/22
to substrait
+1 on Phil's response to my comments. I think it is important to add clear "how to contribute" documentation to the project (we were discussing with Subru/ Jesus/ Ashvin about this). So that folks know about how to use those "tags" at the beginning of commits etc.
Reply all
Reply to author
Forward
0 new messages