Should we install things on the user database

15 views
Skip to first unread message

Brent Moran

unread,
Aug 11, 2023, 6:12:59 AM8/11/23
to Mathesar Developers
We've been discussing this ad-hoc in a bunch of places, and it seems
the decision to go ahead with adding schemata, functions, etc. to user
databases keeps being unmade, probably due to lack of clear
communication of the decision, misunderstanding of the decision,
and/or lack of buy-in regarding the decision.

My goals for this thread are:
- to make a decision about whether we should require the addition of
schemata to a target database when installing Mathesar, and
- to get enough buy-in on that decision that we can avoid re-un-making
it for a while, or until we have significant reasons to reassess it
(e.g., a huge client who wants to set it up for 1000 users, and with
whom we have a contractual agreement involving bunches of money, with
a deal-breaking opinion on the matter).

The fallback goal, in the event that we can't come to a decision in
this thread, is for team members to fully express their positions and
any points of concern so that an eventual meeting can be maximally
fruitful.

Terminology:

I'll try to refer to the database containing the tables that users
will access through the Mathesar UI as the "target database" in this
discussion. The reason for that is to encompass and acknowledge the
fact that different people with different access levels may be trying
to install Mathesar, so calling it "the user's database" would be
presumptuous.

Prerequisite assumptions:

- The issue of whether we can require the addition of schemata to the
target database during installation is separate from the question of
whether Mathesar's daily driver user can modify the target database.
We should try to avoid discussion along those lines.
- This discussion is dependent on our previous decision to assume
we're only supporting PostgreSQL for the foreseeable future. I'd like
to take that for granted during this discussion. If team members are
not willing to grant that, we should immediately stop this discussion
and rehash that one instead.
- The decision we want to reach here is whether we should assume we
can add schemata to the user's database at all, not how many we can
add.
- The question of whether we should allow selectively installing parts
of Mathesar, thereby avoiding installing some Mathesar schemata on the
database is separate, as per the previous point.
- We are only discussing additive modifications during installation,
and any modifications would be scoped under schemata that Mathesar
would create and own. We are not discussing modifying preexisting data
while installing, and have no plans to do so as far as I'm aware.

Why wouldn't we assume we can add schemata?

The main problems with requiring the addition of schemata to the
target database when installing Mathesar are that:
- some potential Mathesar installers may not have the permission to
add schemata on the target database, and
- some potential Mathesar installers may be reluctant to allow the
installation process to modify their database in any way, even
non-destructively.

Why did we start adding more schemata to the target database?

- SQLAlchemy requires a complete picture of the database in order to
write any SQL. This leads to massive problems of efficiency, and
maintaining state in python. It's bogging down the runtime speed of
the app, increasing the code complexity, and bogging down development
speed. To remove SQLAlchemy, we need to rewrite massive chunks of the
codebase without it. The options are:
- Rewrite functions that currently use SQLAlchemy to write SQL as
strings directly, then submit those strings as queries (commands) to
the database using psycopg.
- This is difficult to maintain, since most of the codebase would
then be strings; difficult to test, difficult to verify and debug,
etc.
- Rewrite those functions using some SQL-generating framework that's
more rudimentary than SQLAlchemy, and doesn't need to reflect the
whole DB.
- We still have not found one that has the features we need to
handle JSON, postgres-specific window functions, recursive CTEs, etc.
- It's not possible to do this completely without reflecting the DB,
since many functions have branching based on the current DB state.
- Rewrite those functions as SQL functions on the database, and use
thin wrappers in python to call them.

We chose the last option, since it's more maintainable, more testable,
and generally more pleasant to work with. It's also much faster to get
this working than to try to do it with the framework we haven't found,
or with strings, due to the easier testing, etc.

Current Features depending on DB layer installation

We have some features that require installation on the database under
any circumstances, even if we change course on this. So far, these
are:
- Custom types
- Custom casting functions
- These are "safer" casting functions that let a less technical user
more confidently change column types without risking data loss or
corruption.
- These enable type inference, and type inference without them would
need to be completely conceptually redesigned.
- Custom aggregations
- These are used by the summarization feature of data explorer.

The first two features listed are useless in a situation where no
modification of the DB is allowed, and may not matter. The last
feature listed is useful even in cases where Mathesar is installed so
that the daily driver user is read-only.

General advantages of depending on DB layer functions

- operations wrapped in a DB function are performed atomically.
- E.g., if you call our 'extract_columns' SQL function, either all
parts (creating a table, setting types/constraints, moving data,
dropping old columns) happen, or nothing happens.
- This isn't just an advantage for DDL, has advantages for DML as
well (insert, update, etc.)
- Type checking and safety. SQL is strongly typed, and we're using
that to help automatically validate user-submitted parameters.
- Speed of execution
- Postgres stores functions in such a way that they execute more
quickly than unstored queries submitted to the database from Python
(no matter how we try to avoid reflection).
- This is even more valuable for DQL and DML operations than DDL,
since they tend to be more repetitively called, and involve more data.
- Allows DBAs to have the functions that drive Mathesar available in
the database itself.
- While we're not currently selling this as a feature, we could.
- This also lets a DBA whose users are using the Mathesar UI have a
direct analog of whatever actions they're running available on the DB
in a scriptable format. This allows automating repetitive UI tasks.
- When/if we try to implement undo/redo, I think we will likely have
the best luck if we can use triggers, etc. on the user DB to maintain
audit tables alongside the user's data.
- This may be a faulty assumption, but it's the most "natural" implementation
- A user who isn't allowing any modifications to their DB _at all_
(including to data) won't get value from this anyway. Thus, this would
be relevant to DML-only, but not DQL-only users.
- A harder-to-measure ongoing benefit is increased development
velocity, i.e., more features in general. It's simpler to write SQL in
... SQL, and easier to test/maintain.
- A feature that we aren't planning very soon, but that would be a
benefit at some point is moving metadata closer to the data. I.e., in
the same database, in a `__ma_catalog` schema (or the like). I mention
this only in passing.

What if we _did_ try to take a stance of not installing anything on the DB?

The features and advantages above would, of course, not be available
in any installation where we were not allowed to add schemata. We'd
need to add some branching in our Python code to determine whether
Mathesar had its DB tools available or not, and when they're not:
- Turn off any DDL operations (makes sense regardless for this type of
installation), and modify the UI to get rid of import options and so
on.
- Turn off type inference (no problem; we're not importing)
- Modify type casting function in data explorer to avoid using our
custom casting functions (branching there too)
- Turn off column dynamic default detection; this relies on some DB
functions; may cause a number of bugs, since the front end uses that
to determine when you don't need to give a value for a column.
- Moving dynamic default detection to the DB was at a user request
wherein the `pglast` dependency in Postgres was preventing
installation.

Other work we'd need to do to make this possible is to rewrite all DDL
SQL function wrappers to gather necessary data from the database, then
call lower-level functions, rather than the higher-level functions
that do that work on the DB itself. I.e., we'd be bringing reflection
back into the Python layer.

I suppose another option would be to maintain a DB-layer version of
each D3L (data definition description language function) as well as a
python layer version of the same, and keep them in sync somehow. I do
not like the idea of maintaining this solution. The reason we'd need
two versions of each is that DML, DQL, and DDL operations all need to
do things like 'list the columns of table X'. If the function to list
table columns is only in Python, it's inaccessible from the DB layer.
If the function to list them is only on the DB, it's not available
when the schemata are not installed, and nothing will work.

What's a best-effort long-term setup that avoids installing on the target DB?

We'd still want to eventually achieve a setup free of SQLAlchemy, and
which minimizes reflecting database state in order to construct
queries to then submit to the database.

I think we would want to research and choose (or develop) a fairly
advanced SQL writing infrastructure that didn't rely on an accurate
picture of the database state to work. We _really_ don't want to be
maintaining the bulk of our back end code as strings in Python for any
amount of time. We've looked for such a library, and haven't found one
with the features we need.

However, as a stop-gap, we could move our current DB layer functions
up into python by converting their bodies into python f-strings that
we'd fill at the python layer with reflected info, then submit to the
database. I'd personally prefer this to solving the problems of
duplicating subroutines (e.g., list columns of table X) in both the DB
and Python layers. I think this is a time-suck with little to no value
at this point. It's also untenable to maintain a codebase of this
nature long-term (I've had to try to do it before).

This framework and situation will be slower to develop in, slowing
down feature production, since it's more difficult to write, test, and
modify code that's kept as strings in Python than it is to write,
test, and modify actual code. However, considering "doesn't install
things on the target DB" as a feature, that might balance out. Note,
however, we're talking about slowing down development of features for
all users in order to satisfy some hypothetical users' concerns.

Side note about transactions: I _think_ we can maintain atomicity with
this setup, but I'm not sure. A major advantage of the SQL functions
we have at the moment is that they automatically handle this. Either a
function succeeds, or it doesn't (and this is recursive). There's no
possibility of, e.g., changing the names of some columns but not the
others when this is done by our SQL functions. We'd want to replicate
this with any setup, which would involve managing transactions
manually in our codebase.

We'll have to be more careful about SQL injection using this setup,
since we lose the type checking in function arguments and returns.

Summary of disadvantages of this setup (according to Brent)
- Missing features that aren't user-facing DDL (e.g., casting in data
explorer, summarizing in data explorer)
- Difficult to maintain
- Difficult to develop in the first place
- Difficult to test
- More vulnerable to SQL injection; requires us to be more careful.
- Atomicity is not guaranteed by tools; we need to handle it ourselves.
- Can't use or develop features that need this in the future without
considering any users who don't allow adding schemata during
installation.

TL;DR Conclusion

I think that supporting Mathesar installations that don't touch the
target database at all will force us to sacrifice features, runtime
efficiency, development velocity, and code quality. Those sacrifices
will be felt by all actual users, and the benefit of being able to
install without adding any schemata to the target database will be
relevant only for some subset of hypothetical users. I don't think
that trade makes sense at this point. We haven't heard from one single
credible "I'm desperate to use your product, and I'm going to give you
money (or lots of users), but it can't add schemata to my DB"
potential user. And even if we did hear from them, I have enough
experience with users, feature requests, and user interviews, to
understand that developing features requested by not-yet-users is
usually a mistake, and making major architectural choices based on
that is even worse. Even developing features requested (as features
rather than problems to solve) from actual current users is generally
not the best approach. I think we should make architectural decisions
like this based on what we think will result in the best overall
feature set for Mathesar, and I think tight integration with a
PostgreSQL database, implying being able to add schemata to it, gives
us more options than it cuts off.

Dominykas Mostauskis

unread,
Aug 14, 2023, 6:44:00 AM8/14/23
to Brent Moran, Mathesar Developers
I'm currently in the let's-move-even-more-stuff-into-Postgres camp, but am open to new viewpoints and am interested in counter-arguments.

An argument against installing Mathesar's internal schemata is that some users might be reluctant to allow Mathesar to make changes to their database.  I think we could address that by having an optional, access-restricted installation method where the installing user creates our schemata for us (only create, not populate), creates a user for us that can only write to those schemata, and gives our installer access to that user and thus to only those schemata, so that we can perform the installation in a way that we can modify only the things the user explicitly gave us access to. The user could be as granular when choosing which other schemas we can access and how. Provided this is feasible, it should give a user an increased sense of control and safety.

PS, thanks for the informative write-up and I especially appreciate the effort you put in into structuring the discussion.

Pavish Kumar Ramani Gopal

unread,
Aug 16, 2023, 2:24:06 AM8/16/23
to Brent Moran, Dominykas Mostauskis, Mathesar Developers

Thanks for the detailed writeup Brent.

Before responding to the questions posed, I'd like to take a moment to describe my concerns, since based on your email, I think there might have been considerable miscommunication in our previous discussions over Matrix and other emails.

Pavish's concerns:

Scenario

  • Misty is a technical person and a highly experienced DBA/sysadmin.
  • Misty manages a live production application: an ecommerce platform.
  • Misty has no need for tools like Mathesar. She's comfortable with executing SQL directly on the DB.
  • Ash is a product team member who works on the same application.
  • Ash is non-technical and his job requires frequently modifying product information shown on the ecommerce website.
  • Misty has created a PostgreSQL user account for Ash.
    • The account only has DML permissions on selected tables.
  • Ash does not understand DBs very well and does not know SQL.
  • Ash requests Misty's help every time he needs something changed. This is frustrating for both of them.
  • Ash comes across Mathesar and wants Misty to install Mathesar for him to use.
  • Misty attempts to install Mathesar with the PostgreSQL user account she previously created for Ash.
  • Misty discovers:
    • Mathesar requires an user account with permission to make changes to the DB during installation.
    • Mathesar automatically creates the schemas on their DB.
  • Misty does not trust Mathesar.
  • Misty's reluctant to provide an account with a higher permission level.
  • We, the Mathesar team, provide our reasons, but she doesn't like that we make changes automatically.
  • Misty looks for other options to avoid this. There is none.
  • Misty either says no to Ash, or looks for other tools.

Is Misty paranoid?

  • Misty is not someone who doesn't like any changes to her DB. She runs all kinds of queries on prod DBs. She simply does not trust Mathesar making changes to her DB.
  • Categorizing Misty as an user 'who does not want anything installed on her DB' makes her seem paranoid. That's not the case.
  • Here are Misty's actual concerns:
    • She is not comfortable with providing elevated permissions to external tools.
    • She does not trust tools that may potentially cause damage to her DB.
    • She does not want tools to automatically make changes to her DB upon installation.

Should we ignore these users?

  • Misty might be an hypothetical user at this point, but so are all our user personas. We do not have enough users to understand their requirements well.
  • There has been at least one user who has reached out to us to request help with installation, who had similar constraints.

Response to Brent's mail:

DX vs UX

I agree with the technical challenges and the improvements to DX with moving stuff to the DB. This is not disputed.

From the user's perspective (the user being the persona I mentioned above):

  • Mathesar requires elevated permissions to install. This is not ideal for someone who is uncomfortable with providing such permissions to a tool they do not trust.
  • Mathesar automatically makes changes to the target database. This is going to make users reluctant to proceed with Mathesar.
  • Users who like a strong control over their DB might consider installing Mathesar a risky activity.

The UX here is sub-optimal, and might lead to users not preferring Mathesar being used on their production DBs.

The Symptom vs The Underlying Problem

It occurs to me that we've been targeting our discussions around the symptom and not the problem I've been trying to address.
The symptom being: 'Users who do not want anything installed on their DB'.
The problem being: 'Users being uncomfortable with providing elevated permissions to their DB, and not trusting tools that automatically make changes to their DB upon installation'.

A Reasonable Compromise

I think Dom's suggestion is great. It addresses all the concerns of my user persona.

  • Provide an additional option to let users create the schemas themselves and ensure that the user account provided to Mathesar has write access to those schemas.
  • This makes sure that:
    • Mathesar can be installed with a PostgreSQL user with limited privileges.
    • Mathesar does not make top level changes automatically.
    • Users have a high degree of control over their DB.
    • Misty will be happy. Ash will be happy. I will be happy.

Brent Moran

unread,
Aug 16, 2023, 3:53:25 AM8/16/23
to Mathesar Developers
I really think the Reasonable Compromise is a great idea (thanks Dom)
and is a low-effort (and therefore low-cost) way to expand the
potential set of users who would be willing to install. I agree we
should offer that, if we decide that we do want to assume we'll be
able to add things to the target database. I'd still like to let
others weigh in, and get 'affirmative buy-in' at the meeting from
anyone concerned. The user who still wouldn't be able to install
Mathesar in your scenario is Ash. I.e., he'll need to get Misty's
cooperation. We need to make sure everyone's okay with that compromise
in order to proceed.

A caveat: I've done a bit of research on the implications of not
having a user with raised privileges for installation, but one who
only owns a couple of schemata since Dom pitched the idea. The main
limit I can think of is that we won't be able to use CREATE EVENT
TRIGGER while installing. This isn't used yet, but would be quite
beneficial for implementing Undo/Redo for DDL: we'd use event triggers
for maintaining any audit tables for these events. We could try to get
around this limitation by maintaining the audit tables within the
logic of our current DDL functions (this should be safe; they're
transactional). This would mean that Undo/Redo would only apply to
actions performed through Mathesar, which might get hairy if one user
is modifying a table in Mathesar and another in psql (E.g., there
might be some weird state not captured in the audit tables). I suggest
we cross that bridge when we come to it, if we ever do.

Mukesh Murali

unread,
Aug 16, 2023, 6:19:25 AM8/16/23
to Brent Moran, Mathesar Developers

I am in the camp of moving things into Postgres and don’t see any point in moving away from it at this point. I think our resources are better spent on making our integration with Postgres our stronghold.

Whether we should require the addition of schemata to a target database when installing Mathesar

We don’t need to create a Mathesar schema. We only need a schema to store our functions. I really like Dom’s approach of having the user create the schema beforehand and letting them add functions to it.

My approach would be to

  • By default, create a schema to install our functions.
  • If the user wishes to provide his own they should be able to customize the schema we need to install the functions on.
  • If he does neither of the above steps, raise a warning and silently ignore that database.

Sean Colsen

unread,
Aug 16, 2023, 3:25:33 PM8/16/23
to Mukesh Murali, Brent Moran, Mathesar Developers

TL;DR: I’d lean towards not installing things on the target DB. But I’m okay with forging ahead, given that everyone else so far seems very united and I don’t want to hold up the show.

Even beginning to capture my plethora of thoughts and opinions on this topic seems to have required a rather compendious email — but I don’t want it to come off as too vociferous. I respect that others are quite united on Dom’s approach, especially the back-end team, so I’m fine with continuing that direction. That said, it’s not the direction I’d be inclined to head personally, so I’ll try to explain a bit about why, mostly as food-for-thought. I want to be clear: I’m not pushing hard on this.

~ ~ ~

As a principle my inclination would be for Mathesar to offer the greatest possible functionality, given the permissions it has.

Cautious administrators

If I only want to install Mathesar for DML, then I would be confused as to why Mathesar would require its own schema. Pavish aptly identified this problem as “users being uncomfortable with providing elevated permissions to their DB” and he distinguished this problem from other “symptoms” that we’ve been discussing. I’ll refer to this problem as the “cautious administrator” problem. Assuming we continue with Dom’s plan, I think we need to identify clear strategies to address it.

The separate step of creating the schema adds complexity and potential confusion to an installation process that we’ve already identified as a major weak point of ours. I see this as a small downside to Dom’s plan, but important to acknowledge nonetheless.

In pursuit of mitigating the “cautious administrator” problem, I think our installation docs should explain why Mathesar requires its own schema. How do we explain this requirement to Mathesar administrators?

Benefits to users

What does our schema do for Mathesar administrators and their users? My reading of Brent’s initial email is that our schema offers the following benefits:

  • It enables Mathesar’s Money, Email, and URI types
  • It improves casting by:
    • lowering risk when changing column types (though I’m curious to understand this risk better)
    • allowing type inference during import (though I’m curious if this is an intrinsic limitation or an incidental one)
  • It enables a few (three, I think?) rather esoteric summarization functions

If an administrator wants to install Mathesar for DML only, then the list of benefits above does almost nothing to justify our schema. As such, I don’t think we should attempt to justify our schema to administrators by framing it as a benefit to them or their users.

Benefits to developers

From what I can tell, the real reason Mathesar requires an extra schema is that it makes life easier for us — the Mathesar developers. With the exception of a few summarization functions, there seems to be nothing fundamental to Mathesar’s functionality that should necessarily require this extra schema and the permissions to create/update it. Rather, it’s just that we’ve chosen this architecture because it was the easiest way to build this product. While I don’t think that’s ideal, I do think it’s okay. I imagine it will behoove us to justify our schema by candidly explaining that it’s an architectural choice which has allowed us to quickly build a stable product.

Principles and paradigms

Above, I said that “offering the greatest possible functionality with the permissions we have” should be a principle that we strive for. But why?

One obvious reason reason is that it allows people to minimize their Mathesar permissions, which is good for security and such. But there’s another reason too: it steers us towards more modular product design, which ultimately leads to more powerful features.

Example: custom types

The product design for our UI Types seems to have been firmly rooted within the paradigm of installing stuff on the target DB. You want a type that Postgres doesn’t have? Ok, we’ll install one!

But let’s consider what a “type” means from a user’s perspective. Here’s what I can come up with:

  • Special text-based display formatting (e.g. currency symbol)
  • Special HTML display formatting (e.g. hyperlink)
  • Special validation when entering data
  • A special icon in the column header
  • A special name which allows users to conveniently create a new column with a common combination of the above behaviors

Now let’s look at the defining characteristics for the custom types that we currently have:

Text-based
formatting
HTML
formatting
Validation Icon Name
Money ✅ ⁽¹⁾ ✅ ⁽⁴⁾ ✅ ⁽⁵⁾
Email ✅ ⁽³⁾ ✅ ⁽⁴⁾ ✅ ⁽⁵⁾
URI ✅ ⁽²⁾ ✅ ⁽³⁾ ✅ ⁽⁴⁾ ✅ ⁽⁵⁾

I don’t think it would be too difficult to give users control over the constituent behaviors above. Here’s a hypothetical five-step plan, which addresses each ✅ above. (The superscript numbers map to the list items below):

  1. Extend our display options for Number types to include a “Units” string and a “Units Position” (“Start” or “End”). This would allow us to handle the special text-based formatting necessary for Mathesar’s current Money type.

    Added benefit: users would also be able to apply currency formatting to types within pre-existing databases.

  2. Extend our display options for text types to automatically turn URLs into hyperlinks. This would allow us to handle the special HTML behavior for URI types.

    Added benefits: Users could get hyperlinks in plain text if they want. Users could turn off the HTML hyperlinks in the URI type while retaining the validation.

    Going further: We could process email addresses too, giving some additional (and configurable!) functionality to the Email type.

  3. Allow users to set their own CHECK constraints. We could have an assortment of “off-the-shelf” constraints available for them to pick from, for example: URI, email, hexidecimal color, UK postcode, etc. We could also allow the user to specify a regex.

  4. The type icon admittedly gets a bit tricky. I see the following approaches:

    • Abandon special icons. (I think this could be a viable short-term strategy).
    • Allow the user to pick an icon. Store their choice in metadata. They’d need to pick from a set of known icons.
    • Derive the icon based on structure and metadata. For example, if we see a CHECK constraint that matches our “Email” constraint, then we’d apply the Email icon.
  5. The name is easier than the icon because I don’t think it would be particularly important for us to mirror that name back to the user after creating the column. For example, if I create a “Money” column and then subsequently see that column’s data type as “Number”, I think that’s okay. We could consider ways of adapting the column-creation UX to make it clear to users that creating a Money column is actually a set of multiple smaller building blocks pre-assembled.

    Names are useful insofar as they allow users to quickly set up a new column with a common combination of behaviors. We could codify that behavior within the service layer so that when a user tells Mathesar to create a new “URI” column, Mathesar sets CHECK constraint to validate it and sets the metadata to display it as a hyperlink. Further down the future, we could even allow users to “define their own types” by choosing their own combination of settings and storing that combination in metadata associated with a database or schema. Maybe we’d call it a “Column Template” or something. Or we could have a mechanism to copy-paste type config and metadata from column to column. Sometimes I do this in a spreadsheet when I’ve set up some custom validation or conditional formatting.

My point here is that if had we designed our type system under a paradigm of “don’t install things on the target DB”, I think we may have ultimately arrived at more powerful features — especially for the use cases where Mathesar is connecting to pre-existing DBs. Even if we forge ahead with Dom’s plan, I think we can still keep these concerns of mine in-mind as we design new features, aiming for modularity and graceful degradation.

It’s worth noting one downside I see to the approach above compared to our current approach: an array of emails currently will validate each email, whereas (if I understand correctly) we’d need an entirely different CHECK constraint to produce a valid array of emails without a custom type. This would potentially be a limitation, but a fairly minor one in my opinion. An approach like the above seems overall better suited to working with pre-existing databases.

Other database software

Perhaps the most important reason that I’d like to avoid installing things on the target DB is to lay an architectural foundation which would leave a viable path open for eventually supporting non-Postgres database systems.

At the start of this thread, Brent seemed to preempt this retort by saying:

This discussion is dependent on our previous decision to assume we’re only supporting PostgreSQL for the foreseeable future. I’d like to take that for granted during this discussion. If team members are not willing to grant that, we should immediately stop this discussion and rehash that one instead.

Brent, what previous discussion are you referring to? Almost a year ago, you opened a discussion about Extending reliance on PostgreSQL. Is that what you’re referring to? Or is there a different discussion? I’ll acknowledge that my concerns in this email would have been much better placed within that thread last September, and I apologize that I didn’t raise them at that point. At that time, dependence on Postgres seemed like a foregone conclusion to me, so I silently acquiesced. But since then I’ve had private conversations with Kriti in which she’s expressed interest in eventually supporting non-Postgres systems, so I’ve become more confused on this topic and more eager to speak up. As a team it seems like we’re clear that in the short term we’re only supporting Postgres. But it’s not clear to me that we’re unified on a long term vision of only ever supporting Postgres.

I predict that architectural choices like “more logic in the DB layer” will forestall us ever directly supporting other RDMS because it will just be too much work. I’m excited about the potential of FDWs to expand compatibility, but for RDMS like SQLite and DuckDB one of the coolest things is that you don’t need to run any server software. Hypothetical desktop Mathesar would benefit from being able to access such user databases directly because, as Brent said: “the server that we’d have the most trouble changing in the back end is the PostgreSQL server”.

I raised a similar concern when we discussed OIDs vs names several months ago. In that thread Dom said:

Losing portability across RDBMS vendors by going with OIDs would make me sad as well; that seems like a big concern; I’d like to brush it off by saying that we’re already committed to only-Postgres, but I’m not sure that’s true, though my insights here are more limited than I’d like them to be; I’d like to discuss this more.

From a user’s perspective, compatibility with other database systems is a make-or-break feature in so many cases, and we are very unlikely to hear complaints from such users because they will write us off immediately. Hopefully FDW will alleviate the majority of that friction, but I’m not 100% confident in it.

Alternate architecture

Since it seems like we’re very unlikely to reconsider the “logic in db-layer” architecture, I don’t want to get into the weeds too much on how we would do it, but I did have some thoughts in response to the three options that Brent laid out. Brent’s Option 2 is what I’d personally lean towards:

Rewrite those functions using some SQL-generating framework that’s more rudimentary than SQLAlchemy, and doesn’t need to reflect the whole DB. We still have not found one that has the features we need to handle JSON, postgres-specific window functions, recursive CTEs, etc. It’s not possible to do this completely without reflecting the DB, since many functions have branching based on the current DB state.

  • For the reflection concern, I’d be curious to hear some specific examples of circumstances where we’d need to reflect DB state for branching logic. I can see how we need to translate OIDs into names for a lot of operations. Is that mostly what we’re talking about? Are there other important examples? Even still, I would imagine that we’d be able to run that reflection within a transaction so as to preserve a stateless service layer. It would require an extra round trip from the service layer to the DB layer, but that doesn’t seem like a huge problem to me as I would expect latency to typically be low between those two layers. Are we concerned about the performance of that round trip?

  • For the library concern, I hear you on this. I’m curious if you’ve looked in the direction of libraries that help with templating SQL instead of building queries structurally. In a way, this approach is somewhat like Brent’s option 1 and option 2. For example:

    Granted, some of those libraries are not python-specific, but with this being a low-level part of our application, I would imagine we might have some opportunity to easily call into a lower-level language.

    Libraries like these seem to have different approaches to mitigating SQL injection, likely with different degrees of safety. I agree that this is a concern, and one argument for moving logic into the DB layer. But it doesn’t seem like an “unsolvable” problem to me. Perhaps I’m being too optimistic though.

Summary

Just to be clear: I don’t want to drag this out. The back end team should ultimately have more weight when deciding whether or not to move more logic into the DB layer, and the fact that we have broad consensus there makes me wary of even voicing these opinions of mine. I actually began composing this response after reading Brent’s notes in the meeting agenda which felt much more open-ended and had no replies yet. I’m happy to go with the flow!

Dominykas Mostauskis

unread,
Aug 17, 2023, 11:58:58 AM8/17/23
to Sean Colsen, Mukesh Murali, Brent Moran, Mathesar Developers
I think Sean's points are really good. I agree that if we redesign our systems to be modular and flexible we'd end up with more powerful features. That's a valuable point to make, and it made me want to zoom out more and reevaluate more things.

To expand on that even further, I think if we redesign any of our systems we'll be able to make them much better and make the product much better (largely due to hindsight). That is, insofar that we can even judge what a good product is without having many users.

I guess it comes down to our appetite for change and what we can do within the bounds of that appetite. Maybe we can quantify that in cycles. E.g. how many cycles would we be willing to dedicate to being Postgres independent? Consider how long undoing SQLAlchemy has been going on for, and we're only through the DML part of it (the most isolated part, possibly the easy part). And, replacing SQLAlchemy does not involve any frontend, UX or design changes, and so far almost no service-layer changes either. What would getting rid of Postgres-dependence be like?

I'd say, becoming not Postgres-based is more work than we're willing to do. I would be thrilled to be proven wrong though.

Dominykas Mostauskis

unread,
Aug 17, 2023, 12:43:24 PM8/17/23
to Sean Colsen, Mukesh Murali, Brent Moran, Mathesar Developers
I've a few points that are tangential to my previous email:

- the dichotomy between DX and UX is sometimes false: e.g. if a developer struggles to satisfy user expectations, that's both a DX and a UX problem; I would struggle to judge whether installing our schemata into user databases is a DX or a UX optimization;
- as Sean points out, the direct user benefit of us installing ourselves into the user database (like custom types), is not that significant and could be implemented to not require that; however, there is at least one thing that we'll want to support and does require us putting things in the user db: backups; if we want backups, and we're not willing to get rid of all the metadata we currently store in the Django/service-layer, we'll have to move that to the user database to have reliable backups; undo/redo is also closely related to this;
- basing Mathesar on Postgres, while it has its downsides, it seems to also have its upsides, which we've not yet fully experienced, but likely will; my rationale here is that if you want to be able to support a set of databases, you'll end up being able to rely on the intersection of features they have, each database degrading that intersection; if you're basing on a single vendor, you have the widest feature-set.

Kriti Godey

unread,
Aug 17, 2023, 3:45:41 PM8/17/23
to Dominykas Mostauskis, Sean Colsen, Mukesh Murali, Brent Moran, Mathesar Developers

Some scattered thoughts:

  • I’m in agreement with the principle that Mathesar should offer the greatest possible functionality, given the permissions it has.
  • Custom types offer more functionality than Sean described, primarily custom filtering and grouping.
    • Custom filtering and grouping is the main thing that benefits from type information being stored at the DB layer, since it means the filtering/grouping can be done at the SQL layer.
    • i.e.. we store URL schemes, email domains, etc. as separate parts, so we can easily use them in filtering and grouping.
    • I am not sure how we would replicate this.
    • We could abandon this functionality, of course, but I’m hesitant to do that because I think functionality like this is something we do uniquely and it’s because of our approach of installing things on the DB. Other tools can’t replicate this easily.
  • Building on my previous point, I think there’s an argument to be made that installing things on the DB allows us to create a more unique and useful product for some users, even if it means we are not viable for other users.
    • I think we should try to validate this hypothesis from conversations with users.
  • I agree that we don’t have a long term vision of only ever supporting Postgres.
    • But I don’t think it makes sense for us to reduce dependance on Postgres to get there, because being close to the DB helps make us unique.
    • I think if we chose to support non-Postgres databases in the future, it makes more sense for us to replicate how we work with Postgres with e.g. SQLite or MySQL rather than trying to create abstractions that work with multiple database backends.
    • I think trying to create abstractions that work with multiple database backends will end up less powerful over the long term (see SQLAlchemy or NocoDB), and also remove our unique value proposition.
    • I’d rather end up with e.g. “Mathesar for Postgres” and “Mathesar for SQLite” that play into the strengths of each database backend rather than a single product that only focuses on the shared abstracted out features of Postgres and SQLite.

I wanted to get these thoughts out there although they’re not very well organized. Feel free to ask clarifying questions if anything I said didn’t make sense.

Sean Colsen

unread,
Aug 21, 2023, 2:00:28 PM8/21/23
to Kriti Godey, Dominykas Mostauskis, Mukesh Murali, Brent Moran, Mathesar Developers

Thanks for these thoughts Kriti. To me they are very well-organized!

I’d rather end up with e.g. “Mathesar for Postgres” and “Mathesar for SQLite” that play into the strengths of each database backend rather than a single product that only focuses on the shared abstracted out features of Postgres and SQLite.

I find this to be the most compelling part of your message.

Given your response, I see the decision to install our own schema on the target DB as finalized, and I respect that. I appreciate that this thread has laid a recurrent debate to rest with an archive of our decision process.

I have some follow-up thoughts which I hope will be useful in honing our strategy going forward.

Installing things on the DB allows us to create a more unique and useful product for some users, even if it means we are not viable for other users.

In the abstract sense, I can see how this could be true, but I don’t have a clear picture of these “unique features”. I think it will be important for us to precisely identify those features we’re able to uniquely offer due to our strategy of tight integration with the DB. The better we can identify those unique features, the better we can:

  • Use our marketing site to communicate those unique features to potential users in order to distinguish Mathesar from our competitors.
  • Use our documentation to communicate those unique features to administrators installing Mathesar in order to ameliorate concerns they may have about Mathesar’s dependence on seemingly excessive DB privileges.
  • Use our feedback processes to communicate those unique features to current users as a way to “validate our hypothesis”, as you say.

We have features like our “Money” type which incidentally have come to depend on our tight DB integration. But for the purpose of the above communication, I’m most interested in identifying features that would otherwise not be feasible without our tight DB integration. The Money type is, to me, not a compelling example of such a feature because (as I described in my previous email) I can see a straightforward path towards implementing that feature with looser DB integration.

Here’s my understanding thus far of the “unique features” we’re able to offer by virtue of our tight DB integration:

  • Our special summarization functions: percentage_true, peak_time, and peak_month

    But these do not make for a good showcase because they are esoteric functions that I imagine most users would never need. Plus, in theory, I think we could eventually offer comparable functionality through a more robust formula system.

  • Maybe something about casting functions during import and column type alteration?

    But I don’t understand exactly what we’re gaining due to our tight DB integration.

I haven’t included any type-specific logic in the list above, because I still don’t see any which would otherwise not be feasible without our tight DB integration. My previous email demonstrated a path towards replicating some of our type-specific behavior through expanded use of metadata instead of logic installed on the user’s DB.

Kriti pointed out some additional type-specific functionality of which I was not previously aware — “functions”. Thanks! With our Library sample data, I now see how I can use this feature to learn that johnson.com is the most common email domain among all patrons. Neat! I’ve actually had a real use-case for this sort of thing before. Circa 2010 I was redesigning an HTML email newsletter template for a client and I used a query like this to get a rough sense of which email service providers we needed to prioritize supporting with different email formatting features. As well, I would sometimes perform bounce rate analysis on a per-domain basis.

But I don’t see a good argument for this particular feature necessarily being dependent on tight DB integration. The expression split_part("Email", '@', 2) extracts the domain of an email address. Similar expressions could utilize regex_replace to extract various parts of a URI. A well-rounded design which exposes this feature to the user would require some more UX thought, but to me it seems perfectly viable from a technical perspective. Beyond being viable, I contend that, in light of our new focus on “users with pre-existing Postgres DBs”, this is exactly the sort of feature that should not depend on our custom types. Most pre-existing DBs are bound to already have columns that store email addresses. We can — and should — work with those types to the greatest extent possible. I’m curious if I’m missing something here though. Is there an important reason why these special functions require our custom types? I can imagine that the performance of filtering and grouping might be superior when applied to a type which is properly configured to index composite parts like “domain”. But if that’s actually a significant concern, I’d be curious to benchmark it.

Kriti and Dom, you both seem to be arguing that a strategy of tight DB integration will allow us to build more of these unique features in the future. I can see, theoretically, how this could be true, so I hope you’re right.

For a long time we’ve been straddling a difficult line between two broad product directions: “pre-existing-DBs” vs “Mathesar-only-DBs”, and more recently I’ve been advocating publicly and privately that focusing on one of these directions will bring clarity to important architectural decisions. Now I’m delighted that we’ve finally narrowed our focus! Had we chosen “Mathesar-only-DBs”, I’d be much more inclined to move logic into the DB! But tight DB integration seems somewhat at odds with our new focus on “pre-existing-DBs”.

As we move forward with an architecture of tight DB integration, I’m sure we’ll face many more opportunities to move logic into the DB layer. Maintainability and DX concerns may sometimes prove to be reason-enough to do so. But I’d like to emphasize my plea for caution: just because we can build some of our functionality on top of our tight DB integration, doesn’t necessarily mean we should. I think it’ll be important to evaluate our options from the perspective of a user installing Mathesar to connect to a pre-existing database. If there is some meaningful benefit that our tight DB integration can offer to that user, then I’d like us to find a way to portray it!

Brent Moran

unread,
Aug 22, 2023, 2:19:57 AM8/22/23
to Mathesar Developers
Thank you everyone for a lively and thoughtful discussion. This whole thread has been quite thought-provoking for me. 

One particular spot of thought provocation has been around the topic of supporting multiple DBs (someday). I find myself less pessimistic than most of the team on this front, it seems. While I would prefer to concentrate on PostgreSQL for the foreseeable future, imagine a situation where
  • We've wrapped all (or at least most) Mathesar functionality that queries or modifies the user's database into functions that live on that database, and
  • The webservice now deals exclusively in model wrangling, and calling those DB functions to get or manipulate data and objects on the user's database.
In this situation, I could make a  strong case that it would actually be easier for an outside party, or for us, to implement "Mathesar for MariaDB" or the like, since it would amount to writing functions on the target DBMS that satisfy the "Mathesar Interface" consisting of the DB functions, and doing some minor wiring so the webservice understands how to connect to that database system. Alternatively, if someone wanted to write a webservice in a different language, or with a different client-facing API, it would amount to writing a webservice that could use (some subset of) the "Mathesar Interface". I.e., taking this all the way to the conclusion would result in a complete separation that would increase modularity and flexibility. To contrast with the current situation, we're using PostgreSQL-specific features throughout our codebase at the moment:
  • OIDs (we're all aware of this), but also
  • Postgres-specific catalog tables,
  • Postgres-specific types for collecting and manipulating certain metadata,
  • Postgres-specific functions for implementing a number of filters, sorts, etc.,
  • Postgres-specific system information functions.
The system information functions and catalog tables are how we query things like "What are the names of the columns of table X?" right now. The move of that logic to a DB function that is called 'get_table_cols' or the like that just returns a JSON blob with relevant info makes it easier (or at least less full-codebase) to support a different DBMS rather than harder. It's possible, but extremely non-trivial to rewrite this to be less DBMS-specific, and would involve recreating or replacing huge swaths of SQLAlchemy functionality around reflection. If you poke your nose into the SQLAlchemy codebase and look at how they're accomplishing such queries, you'll see that they have a whole bunch of DBMS-specific branching and logic. On the OID front, recall that one benefit and goal of rewriting this functionality in the DB is that we avoid most problems associated with names as identifiers regarding state, and also avoid any reflection and state wrangling associated with OIDs as identifiers. That is, for a PL/pgSQL function on the DB, the (dis)advantages for one type of identifier or another are less pronounced. Thus, we've already written most of the DB layer functions with overloaded signatures that let you call them with 'string' or 'numeric' identifiers, as you prefer. This means that should we want to move away from OIDs and towards names as identifiers in our (python) codebase, the job is easier now than before.

Regarding reflection, Sean wrote:
  • For the reflection concern, I’d be curious to hear some specific examples of circumstances where we’d need to reflect DB state for branching logic. I can see how we need to translate OIDs into names for a lot of operations. Is that mostly what we’re talking about? Are there other important examples? Even still, I would imagine that we’d be able to run that reflection within a transaction so as to preserve a stateless service layer. It would require an extra round trip from the service layer to the DB layer, but that doesn’t seem like a huge problem to me as I would expect latency to typically be low between those two layers. Are we concerned about the performance of that round trip?

Anywhere that we need info about DB state in order to construct the desired query is the issue here. 

An example would be 'duplicate column X'. Whether you're identifying the column in qustion by name or OID/attnum, you still need to know its type, its constraints, any default, whether it is referred to by a foreign key, etc. I.e., a bunch of non-identifying info. The options are either to cache this in the service layer (we have discussed the problems with this ad-nauseum), or you need to reflect it from the database. You also need to know things about the rest of the columns of the table (to avoid name-clashes, to avoid making a mistake when copying multi-column constraints, etc.). 

A non-DDL example would be inserting or reading from any column where the Postgres type to Python type mapping is lossy. An example is Date/Time types. For these types, we need to convert to a string with some specified format before python ever gets hold of it, at least with our current architecture, in order to avoid loss in the conversion in Python. This loss isn't terrible for simple reading (though it may be irritating for some users), but it's pretty relevant for updating and writing. The loss is not reversible in the python layer, and thus you can end up modifying a datum with lower precision than the original, then writing that lower-resolution modified version back to the database. The current way we do the necessary conversion involves a bunch of SQLAlchemy 'magic', but the crux of the matter is that we convert the PostgreSQL date/time type to text of a known format on the database, then convert from the DB text type to the python str type when reading the value in python. The reverse happens for writing. We send a string of a specified format to the database, which is converted to text on the DB, and convert that text to the target PostgreSQL type. This doesn't currently involve installing anything on the DB, but it does require knowing the type of each column in a SELECT * FROM mytable query (i.e., so we know what operations to apply), and thus requires reflection from the database. For purposes of intellectual honesty, I note here that we could theoretically get around this somehow, but the type conversion machinery is built into psycopg, which is the standard way to connect Python to a PostgreSQL database. Replacing or modifying that machinery would be quite difficult.

Regarding latency, I'm personally quite skeptical about assuming that the round-trip reflection step(s) won't be relevant. While it would make most sense architecturally to have the service on the same machine with the service DB, we're pitching 'connect to a preexisting DB' as a feature when it comes to at least the user (target) DB. For example, it could be an AWS RDS cluster. Or worse, one RDS cluster and one GCP CloudSQL cluster. We've already dealt with an issue where someone was trying to get wired up to an RDS DB. They're running Docker locally on their laptop, and connecting to AWS, and getting timeouts from the webservice. It's also not as simple as 'a reflection here or there'. There are currently many parts of our codebase where we need DB state. The obvious option is to group the reflections, i.e., notice when some parts of our logic are going to need DB state, reflect somewhere in the call, and then pass that context around, dropping it after a given call. To some extent, this is our current strategy. For a number of reasons, this makes for pretty complicated code involving managing lots of out-of-scope state for each function along the call stack. The alternate simpler strategy is to simply reflect info from the DB each time we need it. This will increase the number of reflections even beyond our current situation, thereby reducing performance even further.

A few things about unique features from tight DB integration: 
  • I think at our current "taking-flight" stage, DX is a feature, since it lets us respond more quickly to user-visible feature requests and needs. I agree we should communicate this.
  • Performance is also a feature for a database project, and having functions close to the data by design (or architecture) reduces the ways in which different installation setups could affect performance.
  • Having Mathesar functionality available in the database is itself a feature. In fact it's the feature I personally use at the moment more than anything else.
    • Useful for DBAs to script different pieces of Mathesar in an environment they understand and trust.
    • Useful for folks who want to use parts of Mathesar by themselves in general. I was previously (and still am) strongly opposed to letting webservice (i.e,. Django) pieces worm their way down into the lower-level python functions since I want to be able to ship and use those functions without needing Django. This is an extension of that concept to being able to ship a set of useful DB functions that don't depend on anything else.
    • Useful for me personally since it's easier to remember (for example) how to call those functions (also they're documented extensively) than it is to remember the right syntax for adding a unique constraint to a preexisting table when I'm setting up test data.
  • Type-specific logic. I think Sean's table regarding this is quite useful, and we should refer back to it (and maybe extend it) moving forward. I agree that you could replicate much of our custom type functionality using the proposed framework for a UI user. There are caveats to this:
    • We'd need to lose the concept of a 'custom type' in Mathesar, and reorganize how we get users thinking about such things. I completely agree that helping users add check constraints to columns is more flexible, but also potentially more conceptually difficult. I.e., it's no longer an "Email Type" column, it's a text (or varchar, or char) column that includes a "does this look like an email" validator. Thus, the simplification of 'this is an Email-type column' is a feature for some potential users.
    • Some parts won't be available, or easily visible, from the database layer for users looking at a table through another client. I think "this is a Money column" is a useful thing to know, and it will be invisible to a user connecting through a different client without any flag or name on the database.
I completely support Sean's assertion that we need to communicate to users (or at least installers) the value they are getting (and maybe some value we credibly foresee them getting in the future) by allowing us to add schemata to their DB (or to use schemata they add). I especially like the idea of spending effort on determining what (if any) specific advantages this architecture gives us when connecting to a preexisting database. Off the top of my head:
  • A cautious DBA could install in layers: install the SQL functions and see if they work (especially with given relevant roles and privileges) before proceeding, simply drop the schemata if something is amiss.
  • A cautious DBA can actually run the SQL tests (we're still moving more over) against the target installation DB safely and verify that the DB functions work on the running prod DB before proceeding, without changing the DB state beyond adding the mathesar schemata (and a competent DBA will be able to quickly verify this with a 5-minute look through how pgTAP works; and some DBAs will already be familiar with it). This is an advantage since our python tests need to setup a test DB to run safely, and we define that DB according to our assumptions; these assumptions may not align with the target DB. We should link relevant pgTAP docs on this, and improve the experience for running the tests in that context if we want to emphasize this.
  • Lowered sensitivity to scale w.r.t. performance. The more we are able to improve on performance, the less a potential user will need to worry about whether Mathesar will work for them based simply on the size of their data.
  • Less likely to break when encountering unexpected types or other DB features, since fewer layers and languages need to understand the type or feature. Not sure how to communicate this positively to users.
  • More likely to be able to still have basic functionality when encountering unexpected types or other DB features.
I'd like to dig into the last bullet-point a bit, taking unknown types as an example. Currently, if a type is unknown to us, it's quite difficult to proceed, since the logic about how to input to and output from that type is not defined in the Python layer, but it's needed in the python layer in order to know how to use that type. This is the reason that for each of our custom DB types, we have an associated Python class that carries a bunch of info about identification and input/output for the associated DB type. For an unknown DB type, no such Python class is possible (or at least it's not possible to define it until runtime).  We certainly can't create a column of that type using info at the python layer, since we don't know the right name to identify the type, and we don't know how the arguments for that type work. Contrast this with the msar.build_type_text function. This function uses a bunch of DB state and a bunch of internal Postgres catalog tables to build a 'type text' (e.g., 'numeric(5,3)') given any combination of information you give it, as long as it uniquely defines a type on the DB and its arguments (for example, any valid name for the type works, so does its OID). Moreover, this function works when copying columns, on the database level, without python needing to know anything about the columns' types. This means that you can currently duplicate a column of unknown type using the DB function for that. You can also do it using the python-layer wrapper. Not sure about further up the stack.

I'll admit that you could conceivably (with enough reflection steps) implement the same logic in the python layer, but we likely wouldn't have bothered, or at least it would have been a major endeavor. I.e., this feature's cost would have been sufficiently higher when implemented in the python layer in developer time that the feature would simply be missing for users (at least for quite some time) due to not having been implemented. Working in the DB layer, I didn't even intend to implement it for unknown types. It's a byproduct of the fact that the way you get info about any type is the same on the DB layer.


Okay, I'm cutting myself off here, since this email has become pretty long and rambly. Apologies if I've missed responding to anything important; there's a lot to cover.

Brent

Dominykas Mostauskis

unread,
Aug 22, 2023, 6:43:13 AM8/22/23
to Brent Moran, Mathesar Developers
Tangential, but the insights about Mathesar's custom Postgres types lead me to think that we should get rid of them. Checks and deconstructors are more portable. That would be conceptually more complicated, but I think it's worth it. Besides, current UI type workflows with our frontend would/could stay the same. To be clear, our custom Postgres types are an implementation detail of UI types.

Sean Colsen

unread,
Aug 22, 2023, 8:23:26 AM8/22/23
to Dominykas Mostauskis, Brent Moran, Mathesar Developers
Thanks for writing all this up, Brent. It's really helpful!

This whole section about the "Mathesar Interface" at the DB level is very interesting. Understanding more about your longer term visions for it will be useful for us all, I think.

Mukesh Murali

unread,
Aug 22, 2023, 8:31:15 PM8/22/23
to Sean Colsen, Dominykas Mostauskis, Brent Moran, Mathesar Developers

I just want to quickly point out using temporary functions as a potential brainstorming idea which could help with permissions and installing on user database concerns.

We are installing two things on the user database

  1. Functions required by Mathesar
  2. Custom types

While Custom types are optional, we need Mathesar functions on the user database to run Mathesar. So for users concerned about permissions, during runtime, we can install all the functions in the pg_temp schema. Compared to inlining the SQL functions, this is much easier although there could be a performance cost, we scan through the function that is being executed and create a list of dependent functions to install to reduce some of the performance overhead. These functions are temporary so the user does not have to create any schema and can reduce the friction.

Brent Moran

unread,
Aug 23, 2023, 12:43:53 AM8/23/23
to Mathesar Developers
I didn't know about that pg_temp hack. That's a pretty interesting idea! There would definitely be overhead with our current architecture, but that will reduce as we consolidate our connections (if we're able to). I think it would be worth experimenting with at some point.

Regarding a long-term vision w.r.t. Mathesar interface(s):

Given that we're trying to transparently represent the real underlying database state, I think of Mathesar as just one user interface to a database, alongside pgAdmin, DBeaver, psql, whatever. And within Mathesar, I see 4 different 'user interfaces' for different types of clients
  • the UI (this is obvious)
  • the API
    • useful for alternate front ends
    • useful for a 'power user' who may find it convenient to treat Mathesar as a RESTful(ish) way to just get data out of their database in bulk JSON form
  • the db library
    • Previously, this was the interface I used most for DDL, and it's still the interface I use most for DML/DQL.
    • Scriptable
    • Useful for anyone who wants to interact with Mathesar via either a python interpreter, or a broader python script or program
    • Possible use case: Airflow jobs that modify or manipulate tables on some schedule.
  • set of DB functions (sort of an extension, but we're not packaging it that way for compatibility)
    • Currently, this is the interface I use for DDL stuff
    • Useful for DBAs to script any provided functionality in an easier setting than pure SQL, but without needing python
    • Useful for using Mathesar functionality from a programming language outside the DB, but without needing to generate complex SQL in that language
I think it would be super cool if most or all functionality offered by the UI were offered by all of these interfaces in some fashion. In particular, the API certainly provides a way to do any DB interaction allowed by the UI. For each API request, then, it would be neat to have a clear function or (very small) set of functions in the db library that handle the actual DB interaction backing that request. Then, for each db library function, it would be nice for it to be a pretty thin wrapper around a DB function that provides the actual database manipulations and querying. This would let different users with different needs use the same underlying database in different ways, but with a common language to talk about what they're doing, based on Mathesar idioms. An example of how this might be useful:

Business UI user says, "Hey, dev-friend, I keep having to go through and capitalize names in this column; can you automate that?". Then, dev-friend could glance at the API calls that happen, easily find the db library functions being called, and write a quick script to do the repetitive task. The UI user doesn't really need to know about the python layer, but having a common framework and a similar (enough) interface at the API and db library levels enables a quicker and easier solution from the dev-friend.

Business UI user says, "Hey dev-friend, I want to know what tables I can join to each of my tables in data explorer without clicking into all 50 of them. Is that possible?" Dev-friend looks at the API calls, finds the backing functions (get_joinable_tables), and whips up a quick script to produce the desired info.

These scenarios are actually already pretty much possible, if the dev-friend wants to implement in python. Adding an interface in the form of a set of functions in the DB layer that mostly matches the API would let the same sorts of tasks be done directly on the database by a DBA who doesn't want to deal with python, as well as the other stuff listed above.

Moreover, this would be an architecture that would allow us to iteratively develop functionality in a way that lets us test the functionality at (say) the SQL layer and make sure everything is copacetic before adding it to other layers. As I've occasionally discussed with Mukesh or other backend devs, we could even package and promote that functionality, and get user feedback on it without waiting for it to be available in the UI. A prerequisite to that is for enough functionality to be available on a given layer for it to be a viable interface on its own so that some users will actually want to use Mathesar that way.

Another aside: The dream would be to set up these functions to be useful enough and general enough on the DB layer that we basically wouldn't need to write SQL in the Python layer. The db library could at that point just provide a python 'API client' where the API is comprised of the set of DB functions.

Hope that makes some sense. To be clear, I'm not saying that we should be immediately going this direction, or demanding we ever go that direction. But, this is an architecture that I often have in the back of my mind somewhere when deciding how to implement some piece of functionality. The plus side (as I've mentioned before) is that I've found that it's usually easier to change direction if you have clear lines between parts of a complicated machine, and thinking in this way leads to those lines emerging and being maintained kind of naturally. I.e., it's easier to merge two parts than to extricate them from each other.

Brent

Kriti Godey

unread,
Aug 23, 2023, 6:11:58 PM8/23/23
to Brent Moran, Mathesar Developers

+1 to the architecture Brent described. I have also been thinking of Mathesar as having several interfaces, not just the UI, and I think tighter DB integration helps us offer a unified set of features across all these interfaces. I think offering unique features not possible without DB integration is one advantage of DB integration, but even if most features would be possible to have in the UI in other ways, it’s not as easy to have a consistent set of features across the UI / API / DB library without tight DB integration.

We should be looking for users who would benefit from multiple interfaces, since we see that as one of our benefits. If we are unable to find users who see this approach as a benefit, then we can rethink this assumption. This might help with the niche research.

Also to be clear, when I said this:

I’d rather end up with e.g. “Mathesar for Postgres” and “Mathesar for SQLite” that play into the strengths of each database backend rather than a single product that only focuses on the shared abstracted out features of Postgres and SQLite.

I did not intend this as an argument for or against tight integration with the DB, I just don’t think multi-DB support is an argument for no integration with the DB since there are (arguably) better ways to achieve that.

Sean said:

But I’d like to emphasize my plea for caution: just because we can build some of our functionality on top of our tight DB integration, doesn’t necessarily mean we should.

I agree with this.

This is why I think it’s very important to make Mathesar work with different levels of permissions (DDL, DML only, and read only). I think this will give us a natural constraint not to make too many things dependent on specific DB configurations. Also it will address a common user concern, so that's good too.

Reply all
Reply to author
Forward
0 new messages