Removing SQLAlchemy

12 views
Skip to first unread message

Brent Moran

unread,
Mar 8, 2023, 10:47:19 AM3/8/23
to Mathesar Developers
We had a meeting to discuss removing SQLAlchemy.

The notes are here:
https://wiki.mathesar.org/en/meeting-notes/2023-03/2023-03-08-removing-sqlachemy-meeting

We (the backend core team) are definitely all onboard with removing
SQLAlchemy from our codebase.

We went through some concerns, but none seem like complete deal-killers.

The central sticking point seems to be what to do about identifiers
for Database objects:
- Should we stick with Django IDs in the API, then OIDs in the back end?
- Should we get rid of Django IDs in the API, and expose the OIDs in
the API instead?
- Should we use Django IDs in the API, and use only names (instead of
OIDs) in the back end?
- Should we expose names in the API and have the front end use those
instead of stable identifiers?

We've decided to continue the discussion in a meeting on Friday.
Intended attendees to that meeting (you should have received an
invitation) should try to brainstorm pros/cons of different setups
between now and then. I'd like to restrict the discussion in that
meeting to figuring out how we want to identify database objects in
our API, and the implications of different choices.

Kriti Godey

unread,
Mar 8, 2023, 11:17:35 AM3/8/23
to Brent Moran, Mathesar Developers
Thanks for the update, Brent!

Dominykas Mostauskis

unread,
Mar 9, 2023, 3:56:26 AM3/9/23
to Kriti Godey, Brent Moran, Mathesar Developers
I had forgotten to update my calendar. I'm off Friday. Sorry for the inconvenience.

By the way, could we submit our brainstorms and thoughts via email to this thread prior to the meeting? Reading what others think would be helpful in preparing, at least to me.

Brent Moran

unread,
Mar 9, 2023, 9:39:35 AM3/9/23
to Dominykas Mostauskis, Kriti Godey, Mathesar Developers
Okay, I rescheduled.

Regarding brainstorming, after spending some time thinking and
tinkering, I'm even more convinced to keep OIDs.

All metadata lookups need to operate on OIDs anyway in the DB (e.g.,
'describe this column's type, constraints, etc.', 'list the tables in
this schema', 'list the constraints on this table', etc. The only
reason we needed reflection for any of these types of operations is
because SQLAlchemy needs the table name for them. The actual query on
the DB uses OIDs, even as generated/run by SQLAlchemy.

DML can be trivially (or nearly trivially) mapped through OIDs by
creating views in our own schema (say, mathesar_util_views) that are
named algorithmically after the OID of the table they're mirroring.
Given a table with OID12345 with three columns of attnums 1, 2, 4,
we'd create a view like
```
CREATE VIEW mathesar_util_views.t12345 (c1, c2, c4) AS
SELECT (id, first_column, another_column)
FROM users_schema.mytable;
```
You can use such a view for SELECT, INSERT, UPDATE, and DELETE on the
underlying table.

The only remaining piece is DDL. For this, we'd either need to reflect
names from OIDs (and attnums), or we'd need to provide a DB-level
function to do the translation and run a given query with the names
plugged in instead of numeric IDs.

While the DDL is a bit of a pain, we'd have some pain the other way
for the descriptive queries in the first paragraph (though admittedly
less). However, DDL is run much more rarely than 'list columns' and
the like, so I think that trade is sensible.

Combined with the stability and proper ID nature of OIDs (as opposed
to names), I think we should stick with them as our main identifiers
for database objects.

Sean Colsen

unread,
Mar 9, 2023, 11:57:33 AM3/9/23
to Brent Moran, Dominykas Mostauskis, Kriti Godey, Mathesar Developers

Brent,

  • That view is a clever idea. How would the view definition get updated when, say, a new column is added to the underlying table (especially if done outside of Mathesar)?

  • You said:

    All metadata lookups need to operate on OIDs anyway

  • I can see how it’s possible to run a query like “describe this column’s type” by directly using OIDs instead of names. But when you say “need”, are you suggesting that it would be problematic to use names instead of OIDs? I’d be curious to see an example of a specific query which you think would work better when performed with OIDs vs names.

Dominykas Mostauskis

unread,
Mar 9, 2023, 1:17:20 PM3/9/23
to Brent Moran, Kriti Godey, Mathesar Developers
That looks promising, Brent.

It occurred to me that now is the best time to work out what specifically we need for 1) SaaS, 2) concurrent users, 3) real time collaboration, 4) backups, and any other significant challenges we predict facing in the near future. I think we need a clear image of how our refactors will play into those challenges (and what the challenges are). Maybe a dedicated email thread is warranted. Is there agreement that we need to clarify what the architectural requirements are for those features?

Observations about identifiers:

- We're using three kinds of ids (I'll refer to the list item numbers later):
    1. Django (rename-and-movement-stable) IDs,
    2. Postgres-specific (rename-stable) IDs: oids and attnums,
    3. SQL (backup-stable) IDs: names;
- Note the three ID stabilities:
    - Rename stability (name might change but id won't),
    - Movement stability (moving a column (for example) to a different table will preserve the id),
    - Backup stability (id stays the same between different Mathesar or Postgres instances);
- We need 3 (name ids), because SQLAlchemy and SQL don't understand 2 (oids and attnum);
    - We won't need 3 anymore when (if)
        - We replace SQLAlchemy
        - and implement implicit 2->3 conversions on Postgres;
- 2 over 3 has the slight advantage that our code is setup for 2;
     - If we do all 2->3 conversions on Postgres (not in Python),
          - we're basically just trading rename stability for backup stability;
          - suppose backup stability is non-consequential,
               - is rename stability worth the additional complexity?
- Why do we need 1 (Django ids)?
    - Rename and movement stability?
         - E.g. that way an exploration is not invalidated if one of the Postgres objects it references is moved or renamed;
    - What are the alternatives?

Kriti Godey

unread,
Mar 9, 2023, 4:28:55 PM3/9/23
to Dominykas Mostauskis, Brent Moran, Mathesar Developers
Is there agreement that we need to clarify what the architectural requirements are for those features?

No, this seems like premature optimization to me. We can't predict all the features we need to build, and if we make detailed plans for all sorts of features now, it will make us less flexible later.

I do agree that we should have a clear understanding of the impact of any refactors, but this shouldn't be related to a whole list of specific features since those are not set in stone.

Brent Moran

unread,
Mar 10, 2023, 1:30:39 AM3/10/23
to Mathesar Developers
Sean,

you raise a good point about 'need'. That's too strong in some cases
(maybe all if you accept some inefficiency; more on that later).

To kick a hole in my own argument:
- For the type of a column (specifically), you can use the function
`pg_typeof` to get the type by qualified name (e.g.,
`schema.table.colname`)

However, in other cases:
- For a query to 'list the constraints on this table', I'm not sure
how you'd do this without looking up the OID of the table in the
`pg_class` catalog table, and joining with the `pg_constraint` table.
The pg_constraint table only has OIDs for referencing relations.
- This is a similar situation for the query to 'list the columns of
this table'. The underlying catalog table `pg_attribute` only
references relations (and indeed types) by OID, and not by names.
- For the query to 'get the default value (or value generator) of this
column', the underlying catalog table `pg_attrdef` only references the
column by OID, attnum pair.

For every catalog table, except for ones that specifically have the
name of a given database object as one of their attributes (but not
generally as a key), the main id for that table is an OID, or some
tuple of OIDs and attnums. So, to look up the default of a column, you
need its OID, attnum pair from somewhere. We could, of course, make
wrapper functions that perform the joining necessary to look these
things up by some qualified name, but:
- That's less efficient, since the names are not generally indices
(but there are a few counterexamples where a name might be _part_ of
an index to provide a needed uniqueness constraint)
- We perform 'describe this object' queries a lot. Generally, every
time we ask for the records of a table, we also need some description
of it.
- Each of these wrapper functions is going to add complexity, and is
hard to generalize: The lookups for the OID of various different
objects by name are in different tables, based on object type.
- This implies a number of functions (or at least code paths) of m *
n, where m is the number of places to look up an OID, and n is the
number of places to use the OID.
- This reduces to only n if we start with the OID as the identifier.

Now, the complexity argument does swing the other way in the case of
DDL operations, but
- we have fewer of those than description look ups (so far; I'd expect
this to continue), and
- Those are generally simpler than the descriptive look ups (except
insofar as they have descriptive lookups as prerequisites)

On a side note, any DDL operation generally starts with a descriptive
look up prerequisite anyway. I.e., if you want to add a constraint to
a table, part of that work flow generally involves determining the
current constraints on a table (perhaps on a previous request or
call). Thus, pulling complexity out of the DDL-specific part of things
and shoving it into the lookup part of things doesn't gain much for
practical DDL operation work flows. On the other hand, there are many
times where we perform descriptive lookups without needing to perform
any DDL operations.

For an interesting exercise, try starting `psql` with the -E flag to
echo the queries run when you perform commands, e.g., when you use
'\d' to list the tables in a schema or '\d mytable' to describe a
table. Usually, the first query run by those commands gets an OID
(e.g., for 'mytable' in the latter example), and then use that OID for
all further querying.

So much for 'need'.

Regarding keeping things up-to-date: I think we should not only
update, but also create the views using event triggers:
https://www.postgresql.org/docs/13/event-trigger-definition.html.
These can have rules about which schemas they apply to, and which
users' actions are covered. We'd basically write a few of these up
front, and then they'd fire whenever a relevant action is performed.

As an enticement for Sean specifically, these translation views would
give us an internal-to-the-user's-database catalog layer where we
could put:
- table descriptions, separate from the user's SQL-level comments
- Other table metadata
In particular, this would start to give us a way to keep more and more
of the metadata (which Sean has occasionally (correctly) points out
may be perceived as proper _data_ by a user) in the user's database,
and in a state that makes it easier to back up and keep together with
the user's data.

Regarding Dom's points:
I think we can get away from (1), since we're not pursuing movement
stability. (Nice taxonomy by the way) I.e., we have put some effort
into avoiding simulating a different table being the same table. See
Mukesh's work on splitting tables for an example of this. The previous
version (that I wrote) created two new tables, and simulated one (the
remainder) being the same table as the original. After Mukesh's work,
we actually just modify the original table in place, and only create
one new one with the extracted columns. Over time, we might even have
a 'mathesar_catalog' namespace along the 'pg_catalog' namespace with a
bunch of metadata. Because we'd have the power of PostgreSQL triggers
(and event triggers) on our side, we could keep such a catalog in sync
with the existing user tables much more easily and efficiently than
with Django models.

Regarding back up stability (3): It depends on the back up method. If
one copies the PostgreSQL data directory, OIDs are stable. I think
this should more properly be 'export stability', as in exporting a
database via pg_dump for importing into another DB (or back into the
same DB in the backup case). I think this is valuable, but not as
valuable as rename stability (2), since we do a lot more renaming than
importing and exporting via SQL. I'd also point out that if we go with
the idea of keeping more and more metadata in the mirroring view
associated with a mathesar table, we'll at least be in a better
position than we are currently for doing exports along with metadata.

Finally, while I think identifier stability is a big benefit, the
reason I think it's a benefit is because it reduces the complexity of
the overall machine. For me, (2) is the clear winner on that front.

P.S. I'm worried that this discussion has completely subsumed the
'getting rid of SQLAlchemy' discussion...

Sean Colsen

unread,
Mar 10, 2023, 6:02:08 PM3/10/23
to Brent Moran, Mathesar Developers

Thanks for writing all this up, Brent

I wanted to respond to a few points:

  • If one copies the PostgreSQL data directory, OIDs are stable.

  • Sure. But to safely make a filesystem-level backup you’d need to take the DB offline, right? I wouldn’t want us to recommend that sort of backup process to any of our users. Or maybe I’m missing something?

  • we do a lot more renaming than importing and exporting via SQL

  • I think this is yet to-be determined. We do a lot of renaming because we’re currently hammering on the product and all its features. But I expect that many of our users will go years without a single DDL operation. But almost everyone needs regular backups of some kind (even if they don’t know it). For my CiviCRM clients, I customarily set up hourly backups. Granted, what matters most for the purpose of this discussion is a full backup-restore cycle which happens much less frequently than a mere backup. But I’ve had numerous occasions in my experience when I’ve needed to restore a backup – most often targeted at a separate staging DB for manual inspection, but also once to an actual live DB, sacrificing some live data made shortly after the backup. For comparison CiviCRM upgrades frequently include schema migrations, but those migrations rarely include rename operations. “Stand-alone” Mathesar use-cases would certainly have more frequent renaming though.

  • Everyone’s use-case will be different, but the above two points are just to say: I equate pg_dump with “backup stability”, and I place its importance on par with “rename stability”.

I’m ready to accept the OID approach, especially since it’s a currently lighter lift than using names.

But I do think it’s worth briefly entertaining the thought experiment of “what would we do if we wanted to use names?” Here’s what I might do:

  • When the service layer receives a request to rename a schema, table, or column, it propagates that rename operation into to all metadata as needed.

  • Other than that, we don’t do anything special.

  • This approach has plenty of drawbacks vs OIDs (and I’ve expounded on some possible mitigation strategies below)

    • If renaming is done outside Mathesar, metadata will be left dangling.

      We could expose an API and/or command line utility to propagate renames into Mathesar’s metadata. This might be sufficient for most cases.

    • If metadata is stored in a separate database, the renaming cannot be atomic.

      We could consider moving metadata into a schema within the user database.

    • Externally bookmarked URLs containing schema names or table names would break after renaming.

      We could also maintain a table of redirects for schemas and tables. This would never be perfect because situations like two table swapping names would not work, but it would still cover most cases. I’m not sure how big of a problem this would actually be though, even if we did nothing.

    • When user A performs a rename operation, some client-side operations might break for user B without a full page refresh.

      We already have this problem with other DDL operations, but using names would make the problem more common. If we eventually implement a realtime API, then this would be much less of a concern.

    • As Brent points out, some introspection queries might be a tiny bit slower when run using names due to joining on name columns which are not indexed.

      I’m doubtful the performance concerns here would be significant, but we could do some testing to see.

Overall:

  • To me, the names approach seems simpler than the OIDs approach.

  • My biggest concern OIDs with OIDs is that I would really like to someday be able to use Mathesar with SQLite. This is a larger conversation, but for the stand-alone use case, I think SQLite has some major advantages over Postgres. Closing the door so tightly on that future would feel sad to me.

  • But I totally see the value in OIDs too! Personally I see advantages and disadvantages in both directions and I’m a bit torn.

  • The fact that Brent has such a strong opinion in favor of OIDs is very compelling, and so I’m on board with OIDs.

  • I just wanted to lay out the above thoughts about names to help us consider all the angles with such a far-reaching decision.

Dominykas Mostauskis

unread,
Mar 13, 2023, 6:37:49 AM3/13/23
to Sean Colsen, Brent Moran, Mathesar Developers
- `mathesar_catalog` (vs `pg_catalog`) sounds good;
- It's a problem that we keep our metadata in Django's database; what Sean said about losing atomicity when metadata and data are in different places resonates; ideally each user Postgres install would have its own metadata database/schema (or whatever is necessary for atomicity); not sure what a viable way to fix this could be;
- Both OIDs and names sound ok to me; in my mind OIDs add complexity, but that's outweighed by us already doing that; if we get rid of either OIDs or names (see my last email for what I mean by getting rid of names), that will be a big win in my book either way;
- Django IDs seems like a source of a lot of work and headache, imo; Brent's comment about movement stability not being important for us (it's the main advantage offered by Django IDs) cheered me up; I'd like to get rid of them at some point; we could replace them with triples like `(database_name: str, oid: str, attnum: Optional[str])`; if we had an identifier that's understandable by the whole stack that would remove a lot of redundant data juggling, and would be a huge win in my book;
- Losing portability across RDBMS vendors by going with OIDs would make me sad as well; that seems like a big concern; I'd like to brush it off by saying that we're already committed to only-Postgres, but I'm not sure that's true, though my insights here are more limited than I'd like them to be; I'd like to discuss this more.

Dominykas Mostauskis

unread,
Mar 13, 2023, 6:59:10 AM3/13/23
to Sean Colsen, Brent Moran, Mathesar Developers
Kriti said:
 
Is there agreement that we need to clarify what the architectural requirements are for those features?

No, this seems like premature optimization to me. We can't predict all the features we need to build, and if we make detailed plans for all sorts of features now, it will make us less flexible later. 

I do agree that we should have a clear understanding of the impact of any refactors, but this shouldn't be related to a whole list of specific features since those are not set in stone.

Kriti Godey

unread,
Mar 13, 2023, 12:18:09 PM3/13/23
to Dominykas Mostauskis, Sean Colsen, Brent Moran, Mathesar Developers
I'm nervous about not having a synchronised vision for what the groundwork is for.

This makes sense, I'm nervous about this too.

I think whoever is working on the spec for the refactor should address this and figure out what discussions need to happen. For the "remove SQLAlchemy" discussion, that seems to be Brent.
Reply all
Reply to author
Forward
0 new messages