Avoiding regressions, validating overall behavior

Brent Moran

unread,

Aug 28, 2023, 8:57:23 PM8/28/23

to Mathesar Developers

As I've been doing my best to check our work for the release of 0.1.3, I've been looking through the app by following along with the Library Demo video, trying to notice things that "don't seem right". Before starting, I also had to make some test data since I couldn't find the original (i.e., the CSVs we used in the video). Through this dubious method I discovered a couple regressions (so far):

I also found behavior that we added in the back end, but which is having an unexpected effect in the Front End (not visible in Python unless you are really familiar with the front end behavior):

https://github.com/centerofci/mathesar/issues/3184

The process has me thinking of a few problems:

We would certainly rather have found these problems before we wanted to release the version, i.e., earlier in the process.
The method above only deals with things that are visible in the single user flow, based on version 0.1.0.

No regressions from 0.1.1 or 0.1.2 are visible with this method. This will be more and more of a problem as we're further from the demo video.
No regressions in features which aren't used in that user flow are visible.

The method above is rife with the potential for missing things due to inattention or ignorance w.r.t. the intended state.
The method above is time-consuming.

I think we should have an async discussion about what we can do to improve this situation. The goal of said improvements should be focused on

noticing regressions in overall app behavior ASAP
noticing when new behavior or features in one part of the stack have an unexpected effect in other parts of the stack

My immediate instinct is 'add E2E tests'. We've tried that and more-or-less failed in the past. Should we try again?

Below is a list of questions that I'd like for us to try answering async in this email thread, along with my initial thoughts:

What would the tests be trying to achieve?

For me, the two bullet points above. Am I missing anything?

On what cadence should new tests be added?

I tend to think these should not be added on every PR that changes functionality. They should be separate, and potentially defined by someone who didn't participate in the implementation.
Maybe we should have a weekly async or sync discussion where anyone can flag anything they think warrants an E2E test?

On what cadence should the tests run?

One problem we had was that the tests were flaky and blocking PRs. I think at this point we should consider nightly test runs, and accept that we may have to fix any problems they find post-hoc.

Do we care about running tests locally during development?

This was a huge pain previously, and I think it reduced the usefulness of the tests. We should either make them easier to run locally, or design them with the intent that we won't ever run them locally.

I'm also open to other options (i.e., other than E2E tests) that could fix the listed problems. One alternative would be having a (growing) set of task-based scripts (e.g., the Library demo) along with screenshots, and hire contractors to go through them occasionally and let us know any issues they find. This method has been useful (for me) in the case of the Library demo for finding the above problems, along with some other small issues. Maybe we could hire it out rather than doing it ourselves. This could also be a way for less-technical contributors to get involved.

Brent

P.S. If you're worried (as I would be) that we should wait till Kriti gets back for this kind of discussion, please be aware that she's asked me to kick this off while she's gone.

Pavish Kumar Ramani Gopal

unread,

Aug 29, 2023, 4:59:55 AM8/29/23

to Brent Moran, Mathesar Developers

A few reasons why our last attempt at adding E2E tests didn't workout:

Tests took a long time to complete, delaying merges.
Tests were flaky since they depend on the entire stack.
We were writing granular tests with the intent to test the entire UI.
We used playwright for testing and wrote E2E tests in python, but they were written by the frontend team which took a change in context and language, and caused delays.
We were moving fast with development and our granular E2E tests were in the way.
We did not have well defined patterns for writing E2E tests.
Writing E2E tests in playwright was a lot of work.

My opinion is that we should try again.

I think we should be having user-flow driven E2E tests, instead of testing the entire UI.

> What would the tests be trying to achieve?

I agree with both the points mentioned by Brent.

> On what cadence should new tests be added?

New tests should be added everytime we define a user-flow, similar to the Library demo user flow.
User-flows should be defined at the product level.
We could have meetings during the last week of each cycle to determine what needs to be tested and define a user-flow for it, and then write E2E tests for the user-flow.
We should not be adding E2E tests with every PR.
Each release should contain E2E tests for changes associated with it.

> On what cadence should the tests run?

I agree with running them nightly, only on `develop`. Post-hoc fixing of problems is better for E2E tests.
We should also run E2E tests on branches where there are modifications to the tests, or when new ones are added.

> Do we care about running tests locally during development?

Yes, we do. It's harder to write tests if we cannot run them locally.
We should find ways to make the test setup easier.

Sean Colsen

unread,

Aug 30, 2023, 11:00:22 AM8/30/23

to Pavish Kumar Ramani Gopal, Brent Moran, Mathesar Developers

Thoughts:

I want to remind everyone of the regression label that I added to GitHub somewhat recently. My hope is that this label will help us zoom out and look at patterns in our regressions for the sake of identifying strategies to help with stability.
Pavish gave a list of reasons why our E2E tests didn’t work out, but there’s another important point I’d like to add: our API requests were regularly taking longer than 5 seconds to resolve, and the Python version of Playwright had a 5-second timeout which we were unable to easily configure. I’d wager that this combination of factors was the underlying culprit responsible for the vast majority of our test flakiness. Now that we’ve done some work to improve backend performance, we would be better positioned to pick up this same E2E testing pattern again.
In general I’m a little wary of picking up E2E tests again though. Even with improvements in architecture and approach, I’m still worried that E2E tests might not bring sufficient benefit to outweigh their cost. I recognize that one of the selling points of E2E tests is that we can write a test once and run it many times thereafter. We incur cost now and reap the benefits in the future. E2E tests are an investment. But I don’t think that necessarily means they are necessarily a prudent investment right now. My gut feeling is that, given our early stage and lack of users, we would see better return on investment by building features instead of tests. Once we get to a point where we’re seeing complaints from users about regressions, then I’d be more inclined to invest.
But I think Pavish has some good ideas, I’m I’d be curious to explore them in more detail. If we could find a way to reduce the cost of building and maintaining E2E tests, then I’d find it more compelling.
I think it might be nice to incorporate more QA testing into our release process. Beyond simple regressions, QA reveals other bugs too, especially when performed by different people. That’s useful! I like Brent’s idea of hiring this sort of work out. Compared to E2E testing, I think QA testing bring more shorter-term benefits and fewer long-term benefits. Given our situation, I think that’s okay though.
Another idea that I’ve floated previously is to have a longer grace period (e.g. 3 or 4 weeks) between cutting the release branch and publishing the release. During this time, we would not merge any feature work into the release branch, but we would be free to merge bug-fixes that we deem to be small and/or important. The work merged into the release branch would get merged into develop thereafter, giving all of us time to work on top of the release branch for a while and organically discover regressions ourselves. To some extent, we’d discover regressions through using the product for development. But ideally we’d also be dogfooding the product more heavily in order to give ourselves an even greater chance of identifying regressions.

Questions:

Pavish: would you still want to write E2E tests in Python? If not, how would we perform the setup/teardown tasks that we were previously using pytest to perform?
Pavish: would you still want the E2E tests to use the real API and a real DB? Or would you want to use some sort of mocking/stubbing system?

Pavish Kumar Ramani Gopal

unread,

Sep 4, 2023, 6:05:36 AM9/4/23

to Sean Colsen, Brent Moran, Mathesar Developers

Replies to Sean's questions:

> would you still want to write E2E tests in Python? If not, how would we perform the setup/teardown tasks that we were previously using pytest to perform?

Yes, I'd still want to write E2E tests in Python.

Writing E2E tests in JS would be more costly than helpful, especially when we would have to deal with populating and clearing DB state.

However, I think we should establish better patterns and get help from the backend team to help structure the initial set of tests.

> would you still want the E2E tests to use the real API and a real DB? Or would you want to use some sort of mocking/stubbing system?

Yes, we should be using the real API and real DB, but we would not be running the tests with every PR. I do think it's essential we run it through the entire stack since the primary purpose for them is to identify regressions and make sure the branch is release ready.

Here are my overall thoughts with regards to FE and E2E tests:

1. We should write integration tests on the FE stack and use vitest for it. This will test the entire UI with mocked API calls. This will run on all PRs.

2. These integration tests may seem similar to E2E tests, however, they are fundamentally different:

They would be granular, less likely to be flaky, and would mainly be helpful for us to move fast and not break things.
They would mock all interactions with the API and backend.
They would be FE specific, written in TS, and run only on the FE stack. We will use vitest for it.
They would test our pages and interactions/integration between our FE components.
The FE team would be responsible in maintaining these tests.

3. E2E tests would be more broad and focus on user flows, by performing actions in a similar manner that the user would.

They would be primarily useful in finding regressions.
They would be in python, and test the entire stack.
There would be limited tests, and we could even move them to a different repo, if needed.
They would run daily (or even only during the release process). They would not run on all PRs.
The entire team (FE and BE) would be responsible in maintaining these tests.

Brent Moran

unread,

Sep 6, 2023, 5:39:12 AM9/6/23

to Mathesar Developers

Mostly, I agree with Pavish's points in (3) above. I would add (to contrast with some assumptions it seems you're making) that I don't think needing to set things up in the DB is a huge point in Python's favor, since I think we'd do it rarely and simply. Honestly, I'm reluctant to associate these user flow tests with our current DB setting fixtures at all.

I think 'user flow' style tests should generally run through the whole user flow without resetting the DB. This will make them flakier, but it will also help us catch problems like building up connections in certain scenarios, weird race conditions (if there are any), etc. Thus, given (say) 5 user flows consisting of a bunch of actions and assertions, we only need to set the DB 5 times. We could do this in SQL, or through any other means. It shouldn't be hard. And given that we'd be testing the whole flow, I think the most we'd really need to do for setting up the Django DB would be to reflect whatever state we've defined in the (test) user DB.

All that said, I'm fine with writing the E2E tests in Python. I just want to avoid a big complicated fixture structure for setup when we can get away with something simpler.

Dominykas Mostauskis

unread,

Sep 6, 2023, 7:00:43 AM9/6/23

to Brent Moran, Mathesar Developers

To add to what Brent said, on the backend most of us (I think) are unhappy with our current Python test suite, so we'll likely be supportive of not reusing that code.

Kriti Godey

unread,

Sep 25, 2023, 6:54:07 PM9/25/23

to Dominykas Mostauskis, Brent Moran, Mathesar Developers

Has this conversation been continued in other channels? If so, is there any decision or conclusion?

Brent Moran

unread,

Sep 25, 2023, 11:06:04 PM9/25/23

to Mathesar Developers

We tied it off in a meeting. I've created a project stub. I'll expand that stub over the course of this week.

Kriti Godey

unread,

Sep 26, 2023, 2:06:38 PM9/26/23

to Brent Moran, Mathesar Developers

Thanks. Are there any notes for the meeting? I looked on the wiki and didn't see any.

Brent Moran

unread,

Sep 27, 2023, 9:50:14 AM9/27/23

to Mathesar Developers

It was a weekly meeting. The notes were only something along the lines of "We've discussed enough for now. Brent will start a stub project, and we'll expand from there."

Brent Moran

unread,

Sep 27, 2023, 9:51:14 AM9/27/23

to Mathesar Developers

Found it: https://wiki.mathesar.org/team/meeting-notes/2023/09/2023-09-06-team-meeting/#avoiding-regressions-validating-overall-behavior-brent

Kriti Godey

unread,

Sep 27, 2023, 11:01:12 AM9/27/23

to Brent Moran, Mathesar Developers

Thanks! I had found those notes but missed that topic somehow.

Reply all

Reply to author

Forward