We've been bogging down on backend development velocity. I think most
backend devs are quite aware of this, and others are probably feeling
the effects of the problem. We've also been plagued by difficult bugs
in the backend code pretty regularly, and solving those bugs has
become quite complicated. As an example, at one point during Cycle 3,
we had all 3 backend devs working on the same bug regarding handling
SQLAlchemy MetaData instances, and it still took multiple (even many)
days to solve. Another example is a bug Sean found:
We also have numerous performance issues slowing down the API
response. These issues will get worse with larger data, or more
complicated database setups.
1. The problem
I believe the common culprit behind the bugs mentioned above, the
general difficulty of adding new backend features, as well as the slow
API response times, is state. Specifically, the main problem is our
reliance on mutable state, managed by the web service, which
supposedly represents the state of the user's database (which is also,
of course, mutable). Examples are persisted Django models for columns,
table, constraints, etc. and SQLAlchemy class instances representing
the same objects. With our current architecture, getting an accurate
list of tables and their metadata requires syncing up multiple model
instances, SQLAlchemy class instances, and making sure these are in
sync with the actual database. There are even more pieces of mutable
state which must align if we want to actually *do* anything with the
list of tables (select from one, join two, etc.). I think we've
developed a machine that is too complex for its feature set, and very
little of the complexity is actually caused by the features.
Complicated pieces of machinery which are mostly wrangling identifiers
and other state, but don't add features include (but aren't limited
- Most of the type infrastructure is necessary only to manage syncing
up the comical number of different ways a single DB type needs to be
represented in order to succeed at actually using that type in
different contexts. We still crash into bugs semi-regularly related to
types and how they're addressed.
- The reflection infrastructure is only needed so that we can sync
various parts of the web service up with the actual user database
state. It's complicated and bug-prone.
- Various utility functions such as _prefetch_metadata_side_effector,
_replace_column_names_with_ids, _use_correct_column_identifier, and
These examples don't even get into the numerous places where we have
to ad-hoc re-reflect the DB state into some metadata object to get
some SQLAlchemy class instances to play nicely with each other.
2. Example solution architecture
Caveat: I do NOT want to propose a system-wide refactor at this
juncture. I think, however, that having a defined architecture that we
want to move towards would be a massive benefit, and help us make
implementation decisions for new features.
First, we should use DB identifiers (OIDs, or (OID, ATTNUM) pairs) to
identify database objects at the API. This would remove the main
motivation for many of our state-wrangling models.
Then, I think we should reduce to 3 models:
The Query model would change little from its current form, and the
User model would be pretty standard.
The DBObjectMetadata model would have 3 fields: oid, attnum, and
metadata. These are inspired by the fields of the pg_description
The oid would be in integer, as would the attnum. The metadata field
would be JSON. The purpose of this model is to persist any
Mathesar-specific metadata about various DB objects (e.g., display
options). Lookups of instances of this model should go from a
DB-identifer (e.g., an (oid, attnum) pair) to the metadata. For most
DB objects, the oid would suffice for the lookup. For columns, we'd
need both the oid of the containing table, as well as the column
We should treat any database object resource as wholly owned by and
stored in the relevant user database. That is, if a user sends a POST
to the Tables endpoint, the only thing persisted should be a table in
the user database. The associated DBObject metadata should be created
using the oid of the created table, and should not be dependent on
anything else. Same for columns, constraints, etc. This certainly goes
for records as well (but we're already doing this). If a user GETs a
list of tables in a schema, we should directly query the user's
database, and return the results (in an appropriate format).
We should completely excise SQLAlchemy from our code base over time.
SQLAlchemy is a fantastic piece of software, but it solves many
problems we don't have, and not the ones we do. We're essentially
using it as a query builder, but doing so comes with massive baggage
in the form of previously-mentioned mutable state which needs to be
extremely delicately handled. We should replace SQLAlchemy with a
query builder in combination with direct usage of psycopg 3. Note that
this will also help us moving forward with performance, since
SQLAlchemy is still on psycopg2 which lacks async capabilities and
prepared queries. We could consider rolling our own query builder. I'm
not sure about this, but see
for an example of how that
could look, and how simple it can actually be. I wasn't able to find a
currently-maintained query builder that seems to support all the
features we need natively.
Rather than replacing SQLAlchemy with our own set of database object
classes to represent the state of the user's database, we should focus
on operation classes such as "RecordsRequest", "CreateTableRequest",
"AlterColumnRequest" and the like for abstraction over the query
builder if needed. This concept was proposed by Dom; all credit goes
to him. This operation-centric concept is similar to how Alembic
works. Side note: Alembic was developed by the maintainer of
SQLAlchemy, a few years after SQLAlchemy, and seems to be considerably
less stateful in its design than his earlier software. Hmmmmmmmmm.
3. Iterative steps
The proposed architecture above feels like it's on the other side of
the planet from where we currently stand, and I absolutely do not want
to halt feature development to fix this. However, I don't think that
should be necessary. Iterative steps would be something like:
1. Change all API endpoints to use DB identifiers rather than Django
IDs. This might affect the front end, but I suspect it actually won't,
since we're using integer Django identifiers, and columns are already
namespaced under tables.
2. Decide how we want to construct queries in the future (the proposed
query builder and psycopg3 solution above is my favorite), and
implement any needed pieces (e.g., a query builder, or wiring to hook
a query builder to psycopg3).
3. Create the DBObjectMetadata model.
4. Choose one DBObject model, and remove its statefulness one piece at
a time. This means removing any usage of SQLAlchemy bit by bit,
replacing any usage of SQLAlchemy with our query builder, and
replacing any webservice-specific storage with usage of the
5. Remove the chosen model entirely.
6. Repeat steps (4) and (5) until only the models proposed above remain.
7. We should at this point be able to remove the rest of SQLAlchemy
entirely, as well as all reflection and other state management
4. Avoiding worsening the problem
We should not create a single new model to represent a database
object. We should avoid creating new functions that pass around
SQLAlchemy objects, but instead prefer passing around identifiers
(e.g., OIDs). When fixing a bug, or modifying a piece of the current
service, we should strongly prefer stateless solutions.
5. Final thoughts
I'd like to reemphasize that I don't think it's appropriate to jump
into a massive refactor before we have a demo together, and have
earned a little more breathing room. We can definitely fix little bits
we see as we go, and avoid making more of the same mistakes.
Also, while I've written the above in direct and generally imperative
terms, I don't claim that this is the only solution. I invite
criticism of the proposed solution, alternate proposals, or
counterclaims that the problem I outlined isn't so bad. I especially
invite the other backend devs (and Kriti) to do their best to kick
holes in my argument.