Thoughts:
I want to remind everyone of the regression label that I added to GitHub somewhat recently. My hope is that this label will help us zoom out and look at patterns in our regressions for the sake of identifying strategies to help with stability.
Pavish gave a list of reasons why our E2E tests didn’t work out, but there’s another important point I’d like to add: our API requests were regularly taking longer than 5 seconds to resolve, and the Python version of Playwright had a 5-second timeout which we were unable to easily configure. I’d wager that this combination of factors was the underlying culprit responsible for the vast majority of our test flakiness. Now that we’ve done some work to improve backend performance, we would be better positioned to pick up this same E2E testing pattern again.
In general I’m a little wary of picking up E2E tests again though. Even with improvements in architecture and approach, I’m still worried that E2E tests might not bring sufficient benefit to outweigh their cost. I recognize that one of the selling points of E2E tests is that we can write a test once and run it many times thereafter. We incur cost now and reap the benefits in the future. E2E tests are an investment. But I don’t think that necessarily means they are necessarily a prudent investment right now. My gut feeling is that, given our early stage and lack of users, we would see better return on investment by building features instead of tests. Once we get to a point where we’re seeing complaints from users about regressions, then I’d be more inclined to invest.
But I think Pavish has some good ideas, I’m I’d be curious to explore them in more detail. If we could find a way to reduce the cost of building and maintaining E2E tests, then I’d find it more compelling.
I think it might be nice to incorporate more QA testing into our release process. Beyond simple regressions, QA reveals other bugs too, especially when performed by different people. That’s useful! I like Brent’s idea of hiring this sort of work out. Compared to E2E testing, I think QA testing bring more shorter-term benefits and fewer long-term benefits. Given our situation, I think that’s okay though.
Another idea that I’ve floated previously is to have a longer grace period (e.g. 3 or 4 weeks) between cutting the release branch and publishing the release. During this time, we would not merge any feature work into the release branch, but we would be free to merge bug-fixes that we deem to be small and/or important. The work merged into the release branch would get merged into develop thereafter, giving all of us time to work on top of the release branch for a while and organically discover regressions ourselves. To some extent, we’d discover regressions through using the product for development. But ideally we’d also be dogfooding the product more heavily in order to give ourselves an even greater chance of identifying regressions.
Questions:
Pavish: would you still want to write E2E tests in Python? If not, how would we perform the setup/teardown tasks that we were previously using pytest to perform?
Pavish: would you still want the E2E tests to use the real API and a real DB? Or would you want to use some sort of mocking/stubbing system?
- They would be granular, less likely to be flaky, and would mainly be helpful for us to move fast and not break things.
- They would mock all interactions with the API and backend.
- They would be FE specific, written in TS, and run only on the FE stack. We will use vitest for it.
- They would test our pages and interactions/integration between our FE components.
- The FE team would be responsible in maintaining these tests.
- They would be primarily useful in finding regressions.
- They would be in python, and test the entire stack.
- There would be limited tests, and we could even move them to a different repo, if needed.
- They would run daily (or even only during the release process). They would not run on all PRs.
- The entire team (FE and BE) would be responsible in maintaining these tests.