Feedback on 0.20 release process

45 views
Skip to first unread message

Francesco Guardiani

unread,
Jan 13, 2021, 11:14:11 AM1/13/21
to knati...@googlegroups.com
Hi everybody,
Kenjiro and I managed the 0.20 release and I wanted to provide some feedback here about the release process. I want to use this mail as a starting point, I include some "gotchas" that we learnt from this release and how we can improve next ones, in order to then split the various elements into issues to the respective repos to enable further discussion. Kenjiro and I already opened a PR where we're updating and increasing the release guide [0]

I'll cover in particular my experience with releasing eventing, although I encourage Kenjiro to add any additional thoughts below.

Bad timing
One of the biggest pains of this release, since the beginning, was the bad timing. Yeah I know, I'm between those who asked to move it to after the holidays and moving pkg cut 2 days further, I have to admit that it was a bad idea. The release process includes merging some stuff 14 days before the release [1], and we ended up the days before the pkg cut to merge all that stuff and pinging people to get these trivial PRs in. There were also some people (cut of pkg week) that still were enjoying holidays and this slowed down merging PRs required for the release.

I actually believe that we should have made the release before the holidays, but not on the 22th December, but a week before, say 15th december. This could have guaranteed the availability of people to help the leads, in case something breaks (and it actually did), and availability of people 2 weeks before the release, to help with all the additional stuff to actually prepare the release. Lesson learnt for the next release across the holidays :)

We need more powah!
The release process today assumes that maintainers are always available to merge PRs required to perform the release (deps alignment, actions update, eventual last minute fixes). Now I know, for sure the bad timing amplified this issue, but in my opinion this is wrong anyway, we are an open source project, it's completely reasonable that a maintainer disappears for a day, a week or an entire year. We ended up very often pinging people and waiting for them to merge no-brainer PRs, that's honestly no-sense, both for us and for the maintainers. I also had to ping very often Ville to perform additional tasks I didn't had rights to do, in contexts where I had to fix broken cuts/broken repos. In particular, I've found out that, both for ordinary release process and for "extraordinary" fixes, I needed these rights:
  • lgtm and approver rights in every knative & knative-sandbox PR to merge all the no-brainer PRs where maintainers where not available
  • HackMD rights to create the document with the release notes. In this case we just ended up overwriting the previous one
  • Rights to trigger jobs to prow, like retrigger auto-release jobs, nightly, dot release etc. I needed that to fix eventing-awssqs, eventing-kafka
  • Rights to delete a release from Github (we had to do that to fix eventing-awssqs which was cutted but it was broken). In particular it seems I had rights to delete tags in some repos, but i didn't had rights to delete the Github Release itself, and removing the tag was not enough
  • Rights to delete a release branch (in some cases) and rights to force push to it (in some cases). We needed that to fix the eventing-kafka broken cut, which we solved making the release-0.20 branch "dirty" because the first commit doesn't point to the release tag
Some of these are trivial, but for others we need to put in place proper automation I guess.

Things that went wrong
hack PR merged at the wrong moment broke the automation in eventing repo
There was a PR merged in hack [2] between the pkg cut and before the eventing cut, touching the library.sh, which was buggy. The changes of this PR were then propagated in eventing [3], breaking the automation that took care of doing the deps update [4]. In order to fix that, I had to do a revert on the release day in order to retrigger automation properly [5]. The problem with what happened is that nobody should have merged the deps update between the pkg cut day and the actual release day. I'm not blaming who merged the PRs, I guess none of us, including me, were aware of that issue. This leads me to think that we have a process problem here; luckily this is a time of the year when there are little contributions over all our repos, when this community is at full speed, this issue could cause serious problems. Some solutions I have in mind to fix that:
  • We need some system to code freeze between pkg cut and other repos cut and have a unique "gatekeeper" (e.g. the release leads) to check and accept PRs that doesn't hurt the release process
  • We cut everything the same day, from hack up to eventing-autoscaler-keda. This avoids the "time gap" between one cut and another, so we can be pretty much sure that nobody merges PRs with dep updates in the middle
  • Release leads execute the deps alignment after cutting the branch. This allows us to "revert" what happened in master, in case there were some dep update that "disalign" the deps for the release
Automation has some bugs
There is a bug in the automation that flagged some repos as ready to be cutted [6] but they were not [7], because pointing at the wrong eventing release. There is a similar bug that happened to eventing-awssqs, making it depend on hack master because it was an indirect dependency. I'm gonna file an issue for both these bugs on test-infra and I'll add to the guide to be checked, before cutting, that the go.mod actually makes sense.
In order to fix them, I asked Ville to remove the Github Release, I removed the branch and the tag, I manually fixed these repos (forcing eventing 0.20 in the go mod manually), after this PR got merged I asked Ville to retrigger the prow auto-release job. I'll write this process in the release guide, it might help future release leads having similar issues.

Nightly release started failing after the cut
Eventing kafka had a problem in their configs, where the release artifact included the creation of the knative-eventing namespace. The nightly release script tries to apply to the same cluster, previously used for tests, the nightly release artifact. But during the test teardown, kubectl delete -f kafka-source-artifact.yaml was executed, removing the knative-eventing namespace, causing the apply of the nightly release artifact to fail. It would have been nice to catch up this issue before, running the nightly script days before, based on master.

The state of automation
I have no experience of how was releasing Knative before the new release process, but as is now, I think we're in good shape, given the complexity we face with. Although, I still think we can improve, and I propose here a list:
  • Action to update the releasability update in .github repo, manually triggered. There's no need to do that manually. I already opened an issue for that in .github repo
  • Triggering knobots auto updates per repo. There were times when we did the deps update process manually because in knobots you can trigger the auto updates job only for every knative repo, and this takes around 1hr and a half. When I need to retrigger the deps update, maybe to fix some problems, I would prefer to let automation do the update deps pr for me, but without waiting 1hr and a half, hence my suggestion. I'm gonna open an issue for that in knobots repo.
  • Need more and better automation for the release notes. Or we completely automate them (and fix later), so when prow creates the release runs the release-notes script and applies that to the github release, or we need a better way to collaborate on them for every repo (although some repos like sources doesn't really need them, some "bigger" downstream repos like kafka needs them). I'll open an issue for that.
What I generally struggled with is that we have 2 CI systems, one does the checks, one does the actual release, and in the middle tons of bash scripts. We even have separate repos where we have github actions, prow jobs, knobots jobs, etc. Sometimes it was hard for me to grasp which tool was doing what and how, consequently limiting my ability to solve issues. My take is that It would be nice if we manage, at some point, to transition to just using a single tool which performs from the click to the actual release, without asking for human intervention in the middle (branch cutting, pinning deps PRs stamping, etc). This makes the release lead able to better understand what's going on and, at the same time, having a simple system where each step is well defined helps troubleshooting and "manual triggers" when something goes wrong.


I hope this is useful and helps improving this process for future releases, I'm available to further clarification and willing to help solve the above problems

FG


--

Francesco Guardiani

Software Engineer

Red Hat Srl

fgua...@redhat.com   

David Protasowski

unread,
Jan 13, 2021, 2:22:08 PM1/13/21
to Francesco Guardiani, knati...@googlegroups.com
this is great

What I generally struggled with is that we have 2 CI systems, one does the checks, one does the actual release, and in the middle tons of bash scripts. ...

I've been pushing for a knative/release​ repo (https://github.com/knative/community/issues/360) to be a central spot for this consolidation for automating our releases. Then we could have workflows that'll create PRs like this for you automatically. 

- dave


From: knati...@googlegroups.com <knati...@googlegroups.com> on behalf of Francesco Guardiani <fgua...@redhat.com>
Sent: Wednesday, January 13, 2021 11:13 AM
To: knati...@googlegroups.com <knati...@googlegroups.com>
Subject: Feedback on 0.20 release process
 
--
You received this message because you are subscribed to the Google Groups "Knative Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to knative-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/knative-dev/CAMtVRju42Nb5gdvWBQ94TCBjUiE-ab7RP_%2By%2B5etYeYUdW0pRg%40mail.gmail.com.

Evan Anderson

unread,
Jan 13, 2021, 2:43:04 PM1/13/21
to David Protasowski, Francesco Guardiani, knati...@googlegroups.com
This would also be a good place to put all the release documentation, both about how to run the process, and what the outputs of the process are.

For example, I believe that all our release yaml are copied to RELEASE_GCS_BUCKET, but I don't think that is documented anywhere as a better place to bulk-pull yaml from to avoid GitHub rate limits...

From: knati...@googlegroups.com <knati...@googlegroups.com> on behalf of David Protasowski <dprota...@vmware.com>
Sent: Wednesday, January 13, 2021 11:22 AM
To: Francesco Guardiani <fgua...@redhat.com>; knati...@googlegroups.com <knati...@googlegroups.com>
Subject: Re: Feedback on 0.20 release process
 

Francesco Guardiani

unread,
Jan 13, 2021, 4:35:28 PM1/13/21
to Evan Anderson, David Protasowski, knati...@googlegroups.com
> I've been pushing for a knative/release repo (https://github.com/knative/community/issues/360) to be a central spot for this consolidation for automating our releases. Then we could have workflows that'll create PRs like this for you automatically.

I've opened an issue exactly for that https://github.com/knative-sandbox/.github/issues/83, thanks!

The problem is not really about repos, It's about how we have one automation to make another automation happy, glued by bash scripts which knowledge is only shared between a bunch of us. If I could choose, I would rather prefer focusing on having a single CI, with a set of tools people can understand, contribute and diagnose, more than continuing to increase the amount of automation to make one or another tool happy.
TBH I don't think we need yet another repo but rather centralize all the stuff we have in a single spot in one of our existing repos.

Francesco Guardiani

unread,
Feb 12, 2021, 2:55:46 AM2/12/21
to Evan Anderson, David Protasowski, knati...@googlegroups.com
Hey, I'm still waiting for a merge and a review here: https://github.com/knative/pkg/pull/1982

Do you have any additional feedback on the PR?
Reply all
Reply to author
Forward
0 new messages