Thanks for the detailed writeup Brent.
Before responding to the questions posed, I'd like to take a moment to describe my concerns, since based on your email, I think there might have been considerable miscommunication in our previous discussions over Matrix and other emails.
Pavish's concerns:
Scenario
I agree with the technical challenges and the improvements to DX with moving stuff to the DB. This is not disputed.
From the user's perspective (the user being the persona I mentioned above):
The UX here is sub-optimal, and might lead to users not preferring Mathesar being used on their production DBs.
It occurs to me that we've been targeting our discussions around the symptom and not the problem I've been trying to address.
The symptom being: 'Users who do not want anything installed on their DB'.
The problem being: 'Users being uncomfortable with providing elevated permissions to their DB, and not trusting tools that automatically make changes to their DB upon installation'.
I think Dom's suggestion is great. It addresses all the concerns of my user persona.
I am in the camp of moving things into Postgres and don’t see any point in moving away from it at this point. I think our resources are better spent on making our integration with Postgres our stronghold.
Whether we should require the addition of schemata to a target database when installing Mathesar
We don’t need to create a Mathesar schema. We only need a schema to store our functions. I really like Dom’s approach of having the user create the schema beforehand and letting them add functions to it.
My approach would be to
TL;DR: I’d lean towards not installing things on the target DB. But I’m okay with forging ahead, given that everyone else so far seems very united and I don’t want to hold up the show.
Even beginning to capture my plethora of thoughts and opinions on this topic seems to have required a rather compendious email — but I don’t want it to come off as too vociferous. I respect that others are quite united on Dom’s approach, especially the back-end team, so I’m fine with continuing that direction. That said, it’s not the direction I’d be inclined to head personally, so I’ll try to explain a bit about why, mostly as food-for-thought. I want to be clear: I’m not pushing hard on this.
~ ~ ~
As a principle my inclination would be for Mathesar to offer the greatest possible functionality, given the permissions it has.
If I only want to install Mathesar for DML, then I would be confused as to why Mathesar would require its own schema. Pavish aptly identified this problem as “users being uncomfortable with providing elevated permissions to their DB” and he distinguished this problem from other “symptoms” that we’ve been discussing. I’ll refer to this problem as the “cautious administrator” problem. Assuming we continue with Dom’s plan, I think we need to identify clear strategies to address it.
The separate step of creating the schema adds complexity and potential confusion to an installation process that we’ve already identified as a major weak point of ours. I see this as a small downside to Dom’s plan, but important to acknowledge nonetheless.
In pursuit of mitigating the “cautious administrator” problem, I think our installation docs should explain why Mathesar requires its own schema. How do we explain this requirement to Mathesar administrators?
What does our schema do for Mathesar administrators and their users? My reading of Brent’s initial email is that our schema offers the following benefits:
If an administrator wants to install Mathesar for DML only, then the list of benefits above does almost nothing to justify our schema. As such, I don’t think we should attempt to justify our schema to administrators by framing it as a benefit to them or their users.
From what I can tell, the real reason Mathesar requires an extra schema is that it makes life easier for us — the Mathesar developers. With the exception of a few summarization functions, there seems to be nothing fundamental to Mathesar’s functionality that should necessarily require this extra schema and the permissions to create/update it. Rather, it’s just that we’ve chosen this architecture because it was the easiest way to build this product. While I don’t think that’s ideal, I do think it’s okay. I imagine it will behoove us to justify our schema by candidly explaining that it’s an architectural choice which has allowed us to quickly build a stable product.
Above, I said that “offering the greatest possible functionality with the permissions we have” should be a principle that we strive for. But why?
One obvious reason reason is that it allows people to minimize their Mathesar permissions, which is good for security and such. But there’s another reason too: it steers us towards more modular product design, which ultimately leads to more powerful features.
The product design for our UI Types seems to have been firmly rooted within the paradigm of installing stuff on the target DB. You want a type that Postgres doesn’t have? Ok, we’ll install one!
But let’s consider what a “type” means from a user’s perspective. Here’s what I can come up with:
Now let’s look at the defining characteristics for the custom types that we currently have:
Text-based formatting |
HTML formatting |
Validation | Icon | Name | |
---|---|---|---|---|---|
Money | ✅ ⁽¹⁾ | ✅ ⁽⁴⁾ | ✅ ⁽⁵⁾ | ||
✅ ⁽³⁾ | ✅ ⁽⁴⁾ | ✅ ⁽⁵⁾ | |||
URI | ✅ ⁽²⁾ | ✅ ⁽³⁾ | ✅ ⁽⁴⁾ | ✅ ⁽⁵⁾ |
I don’t think it would be too difficult to give users control over the constituent behaviors above. Here’s a hypothetical five-step plan, which addresses each ✅ above. (The superscript numbers map to the list items below):
Extend our display options for Number types to include a “Units” string and a “Units Position” (“Start” or “End”). This would allow us to handle the special text-based formatting necessary for Mathesar’s current Money type.
Added benefit: users would also be able to apply currency formatting to types within pre-existing databases.
Extend our display options for text types to automatically turn URLs into hyperlinks. This would allow us to handle the special HTML behavior for URI types.
Added benefits: Users could get hyperlinks in plain text if they want. Users could turn off the HTML hyperlinks in the URI type while retaining the validation.
Going further: We could process email addresses too, giving some additional (and configurable!) functionality to the Email type.
Allow users to set their own CHECK constraints. We could have an assortment of “off-the-shelf” constraints available for them to pick from, for example: URI, email, hexidecimal color, UK postcode, etc. We could also allow the user to specify a regex.
The type icon admittedly gets a bit tricky. I see the following approaches:
The name is easier than the icon because I don’t think it would be particularly important for us to mirror that name back to the user after creating the column. For example, if I create a “Money” column and then subsequently see that column’s data type as “Number”, I think that’s okay. We could consider ways of adapting the column-creation UX to make it clear to users that creating a Money column is actually a set of multiple smaller building blocks pre-assembled.
Names are useful insofar as they allow users to quickly set up a new column with a common combination of behaviors. We could codify that behavior within the service layer so that when a user tells Mathesar to create a new “URI” column, Mathesar sets CHECK constraint to validate it and sets the metadata to display it as a hyperlink. Further down the future, we could even allow users to “define their own types” by choosing their own combination of settings and storing that combination in metadata associated with a database or schema. Maybe we’d call it a “Column Template” or something. Or we could have a mechanism to copy-paste type config and metadata from column to column. Sometimes I do this in a spreadsheet when I’ve set up some custom validation or conditional formatting.
My point here is that if had we designed our type system under a paradigm of “don’t install things on the target DB”, I think we may have ultimately arrived at more powerful features — especially for the use cases where Mathesar is connecting to pre-existing DBs. Even if we forge ahead with Dom’s plan, I think we can still keep these concerns of mine in-mind as we design new features, aiming for modularity and graceful degradation.
It’s worth noting one downside I see to the approach above compared to our current approach: an array of emails currently will validate each email, whereas (if I understand correctly) we’d need an entirely different CHECK constraint to produce a valid array of emails without a custom type. This would potentially be a limitation, but a fairly minor one in my opinion. An approach like the above seems overall better suited to working with pre-existing databases.
Perhaps the most important reason that I’d like to avoid installing things on the target DB is to lay an architectural foundation which would leave a viable path open for eventually supporting non-Postgres database systems.
At the start of this thread, Brent seemed to preempt this retort by saying:
This discussion is dependent on our previous decision to assume we’re only supporting PostgreSQL for the foreseeable future. I’d like to take that for granted during this discussion. If team members are not willing to grant that, we should immediately stop this discussion and rehash that one instead.
Brent, what previous discussion are you referring to? Almost a year ago, you opened a discussion about Extending reliance on PostgreSQL. Is that what you’re referring to? Or is there a different discussion? I’ll acknowledge that my concerns in this email would have been much better placed within that thread last September, and I apologize that I didn’t raise them at that point. At that time, dependence on Postgres seemed like a foregone conclusion to me, so I silently acquiesced. But since then I’ve had private conversations with Kriti in which she’s expressed interest in eventually supporting non-Postgres systems, so I’ve become more confused on this topic and more eager to speak up. As a team it seems like we’re clear that in the short term we’re only supporting Postgres. But it’s not clear to me that we’re unified on a long term vision of only ever supporting Postgres.
I predict that architectural choices like “more logic in the DB layer” will forestall us ever directly supporting other RDMS because it will just be too much work. I’m excited about the potential of FDWs to expand compatibility, but for RDMS like SQLite and DuckDB one of the coolest things is that you don’t need to run any server software. Hypothetical desktop Mathesar would benefit from being able to access such user databases directly because, as Brent said: “the server that we’d have the most trouble changing in the back end is the PostgreSQL server”.
I raised a similar concern when we discussed OIDs vs names several months ago. In that thread Dom said:
Losing portability across RDBMS vendors by going with OIDs would make me sad as well; that seems like a big concern; I’d like to brush it off by saying that we’re already committed to only-Postgres, but I’m not sure that’s true, though my insights here are more limited than I’d like them to be; I’d like to discuss this more.
From a user’s perspective, compatibility with other database systems is a make-or-break feature in so many cases, and we are very unlikely to hear complaints from such users because they will write us off immediately. Hopefully FDW will alleviate the majority of that friction, but I’m not 100% confident in it.
Since it seems like we’re very unlikely to reconsider the “logic in db-layer” architecture, I don’t want to get into the weeds too much on how we would do it, but I did have some thoughts in response to the three options that Brent laid out. Brent’s Option 2 is what I’d personally lean towards:
Rewrite those functions using some SQL-generating framework that’s more rudimentary than SQLAlchemy, and doesn’t need to reflect the whole DB. We still have not found one that has the features we need to handle JSON, postgres-specific window functions, recursive CTEs, etc. It’s not possible to do this completely without reflecting the DB, since many functions have branching based on the current DB state.
For the reflection concern, I’d be curious to hear some specific examples of circumstances where we’d need to reflect DB state for branching logic. I can see how we need to translate OIDs into names for a lot of operations. Is that mostly what we’re talking about? Are there other important examples? Even still, I would imagine that we’d be able to run that reflection within a transaction so as to preserve a stateless service layer. It would require an extra round trip from the service layer to the DB layer, but that doesn’t seem like a huge problem to me as I would expect latency to typically be low between those two layers. Are we concerned about the performance of that round trip?
For the library concern, I hear you on this. I’m curious if you’ve looked in the direction of libraries that help with templating SQL instead of building queries structurally. In a way, this approach is somewhat like Brent’s option 1 and option 2. For example:
Granted, some of those libraries are not python-specific, but with this being a low-level part of our application, I would imagine we might have some opportunity to easily call into a lower-level language.
Libraries like these seem to have different approaches to mitigating SQL injection, likely with different degrees of safety. I agree that this is a concern, and one argument for moving logic into the DB layer. But it doesn’t seem like an “unsolvable” problem to me. Perhaps I’m being too optimistic though.
Just to be clear: I don’t want to drag this out. The back end team should ultimately have more weight when deciding whether or not to move more logic into the DB layer, and the fact that we have broad consensus there makes me wary of even voicing these opinions of mine. I actually began composing this response after reading Brent’s notes in the meeting agenda which felt much more open-ended and had no replies yet. I’m happy to go with the flow!
Some scattered thoughts:
I wanted to get these thoughts out there although they’re not very well organized. Feel free to ask clarifying questions if anything I said didn’t make sense.
Thanks for these thoughts Kriti. To me they are very well-organized!
I’d rather end up with e.g. “Mathesar for Postgres” and “Mathesar for SQLite” that play into the strengths of each database backend rather than a single product that only focuses on the shared abstracted out features of Postgres and SQLite.
I find this to be the most compelling part of your message.
Given your response, I see the decision to install our own schema on the target DB as finalized, and I respect that. I appreciate that this thread has laid a recurrent debate to rest with an archive of our decision process.
I have some follow-up thoughts which I hope will be useful in honing our strategy going forward.
Installing things on the DB allows us to create a more unique and useful product for some users, even if it means we are not viable for other users.
In the abstract sense, I can see how this could be true, but I don’t have a clear picture of these “unique features”. I think it will be important for us to precisely identify those features we’re able to uniquely offer due to our strategy of tight integration with the DB. The better we can identify those unique features, the better we can:
We have features like our “Money” type which incidentally have come to depend on our tight DB integration. But for the purpose of the above communication, I’m most interested in identifying features that would otherwise not be feasible without our tight DB integration. The Money type is, to me, not a compelling example of such a feature because (as I described in my previous email) I can see a straightforward path towards implementing that feature with looser DB integration.
Here’s my understanding thus far of the “unique features” we’re able to offer by virtue of our tight DB integration:
Our special summarization functions: percentage_true
,
peak_time
, and peak_month
But these do not make for a good showcase because they are esoteric functions that I imagine most users would never need. Plus, in theory, I think we could eventually offer comparable functionality through a more robust formula system.
Maybe something about casting functions during import and column type alteration?
But I don’t understand exactly what we’re gaining due to our tight DB integration.
I haven’t included any type-specific logic in the list above, because I still don’t see any which would otherwise not be feasible without our tight DB integration. My previous email demonstrated a path towards replicating some of our type-specific behavior through expanded use of metadata instead of logic installed on the user’s DB.
Kriti pointed out some additional type-specific functionality
of which I was not previously aware — “functions”. Thanks! With
our Library sample data, I now see how I can use this feature to
learn that johnson.com
is the most common email
domain among all patrons. Neat! I’ve actually had a real
use-case for this sort of thing before. Circa 2010 I was
redesigning an HTML email newsletter template for a client and I
used a query like this to get a rough sense of which email
service providers we needed to prioritize supporting with
different email formatting features. As well, I would sometimes
perform bounce rate analysis on a per-domain basis.
But I don’t see a good argument for this particular feature
necessarily being dependent on tight DB integration. The
expression split_part("Email", '@', 2)
extracts
the domain of an email address. Similar expressions could
utilize regex_replace
to extract various parts of
a URI. A well-rounded design which exposes this feature to the
user would require some more UX thought, but to me it seems
perfectly viable from a technical perspective. Beyond being
viable, I contend that, in light of our new focus on
“users with pre-existing Postgres DBs”, this is exactly the
sort of feature that should not depend on our custom
types. Most pre-existing DBs are bound to already
have columns that store email addresses. We can — and should —
work with those types to the greatest extent possible. I’m
curious if I’m missing something here though. Is there an
important reason why these special functions require our custom
types? I can imagine that the performance of filtering and
grouping might be superior when applied to a type which is
properly configured to index composite parts like “domain”. But
if that’s actually a significant concern, I’d be curious to
benchmark it.
Kriti and Dom, you both seem to be arguing that a strategy of tight DB integration will allow us to build more of these unique features in the future. I can see, theoretically, how this could be true, so I hope you’re right.
For a long time we’ve been straddling a difficult line between two broad product directions: “pre-existing-DBs” vs “Mathesar-only-DBs”, and more recently I’ve been advocating publicly and privately that focusing on one of these directions will bring clarity to important architectural decisions. Now I’m delighted that we’ve finally narrowed our focus! Had we chosen “Mathesar-only-DBs”, I’d be much more inclined to move logic into the DB! But tight DB integration seems somewhat at odds with our new focus on “pre-existing-DBs”.
As we move forward with an architecture of tight DB integration, I’m sure we’ll face many more opportunities to move logic into the DB layer. Maintainability and DX concerns may sometimes prove to be reason-enough to do so. But I’d like to emphasize my plea for caution: just because we can build some of our functionality on top of our tight DB integration, doesn’t necessarily mean we should. I think it’ll be important to evaluate our options from the perspective of a user installing Mathesar to connect to a pre-existing database. If there is some meaningful benefit that our tight DB integration can offer to that user, then I’d like us to find a way to portray it!
For the reflection concern, I’d be curious to hear some specific examples of circumstances where we’d need to reflect DB state for branching logic. I can see how we need to translate OIDs into names for a lot of operations. Is that mostly what we’re talking about? Are there other important examples? Even still, I would imagine that we’d be able to run that reflection within a transaction so as to preserve a stateless service layer. It would require an extra round trip from the service layer to the DB layer, but that doesn’t seem like a huge problem to me as I would expect latency to typically be low between those two layers. Are we concerned about the performance of that round trip?
I just want to quickly point out using temporary functions as a potential brainstorming idea which could help with permissions and installing on user database concerns.
We are installing two things on the user database
While Custom types are optional, we need Mathesar functions on the user database to run Mathesar. So for users concerned about permissions, during runtime, we can install all the functions in the pg_temp
schema. Compared to inlining the SQL functions, this is much easier although there could be a performance cost, we scan through the function that is being executed and create a list of dependent functions to install to reduce some of the performance overhead. These functions are temporary so the user does not have to create any schema and can reduce the friction.
+1 to the architecture Brent described. I have also been thinking of Mathesar as having several interfaces, not just the UI, and I think tighter DB integration helps us offer a unified set of features across all these interfaces. I think offering unique features not possible without DB integration is one advantage of DB integration, but even if most features would be possible to have in the UI in other ways, it’s not as easy to have a consistent set of features across the UI / API / DB library without tight DB integration.
We should be looking for users who would benefit from multiple interfaces, since we see that as one of our benefits. If we are unable to find users who see this approach as a benefit, then we can rethink this assumption. This might help with the niche research.
Also to be clear, when I said this:
I’d rather end up with e.g. “Mathesar for Postgres” and “Mathesar for SQLite” that play into the strengths of each database backend rather than a single product that only focuses on the shared abstracted out features of Postgres and SQLite.
I did not intend this as an argument for or against tight integration with the DB, I just don’t think multi-DB support is an argument for no integration with the DB since there are (arguably) better ways to achieve that.
Sean said:
But I’d like to emphasize my plea for caution: just because we can build some of our functionality on top of our tight DB integration, doesn’t necessarily mean we should.
I agree with this.
This is why I think it’s very important to make Mathesar work with different levels of permissions (DDL, DML only, and read only). I think this will give us a natural constraint not to make too many things dependent on specific DB configurations. Also it will address a common user concern, so that's good too.