Project: Enable Preprocessing of Grouped columns

27 views
Skip to first unread message

Aritra Majumder

unread,
Jul 18, 2023, 11:54:56 AM7/18/23
to Mathesar Developers
Dear Mathesar developers,
I hope you're all doing great. I've written a draft for the project "Enable Preprocessing of Grouped Columns". It's a little lengthy, but I tried my best to not make it boring.

Motivation behind this project: This project aims to enable a few preprocessing options, which would be a decorator on top of the summarization functions, i.e., we will be able to preprocess a grouped column before applying the summarization functions. (Please see the use cases.)

Use cases: There might be a lot of use cases. I will describe two to make it brief.
  1. A person tracks all his transactions using Mathesar. Now he wants to find out which month the transactions occurred most frequently. He doesn't have access to the month directly, but to the timestamps. If we allow him to preprocess timestamps to extract months and then apply mode aggregation, he will be able to do that.
  2. A traveler uses Mathesar to track all the countries she visited over different years. Now she wants to "distinct list aggregate" all the countries grouped by years but finds out that there are very similar instances of names of countries appearing in the list (e.g., "India" and "india"). In this specific case, if she had been able to preprocess the country column to convert every letter to lower case, she would have gotten a better result.
Proposal: First of all, here are some points:
  • The preprocessing will not be persistent, as it only exists in the context of summarization. The expected SQL for the second use case might be: ARRAY_AGG(DISTINCT LOWER("Column"))
  • Not all the columns will have the same preprocessing options. (e.g., Truncate to month will only be available for the Date and Date-time columns.)
  • Even the list of available summarization functions will change based on preprocessing functions applied. Currently, available summarizations depend on the abstract type of the column, and preprocessing the same will change the abstract type, which explains the former.
  • In general, multiple preprocessing steps can be applied to a single column, but we can keep it simple for now by enabling only one level of preprocessing.
Considering these, I intend to make following changes:
  1. Add an extra dropdown to let the users select what type of preprocessing they want to apply.
  2. Enable preproc in aggregation_expressions in the API structure of queries/run. (Preproc will be optional so that it doesn't break the current behavior.)
  3. If preproc is enabled, Apply the preproc function first to the column, and then apply the aggregation function to the updated column.
Now I would like to ask you all: What are some preprocessing functions we might want to add?

Some available preproc functions are:
  1. Truncate to day ( Considers year and month as well).
  2. Truncate to month (Considers year as well).
  3. Truncate to year.
  4. Extract URI schema.
  5. Extract URI authority.
  6. Extract the email domain.
There are a few I want to suggest:
  1. Truncate to day of week.
  2. Truncate to (only) month.
  3. Convert to lower case.
  4. Convert to upper case.
  5. Add or subtract a number.
I kindly request Brent, Sean, and Pavish to review the proposal and assess if everything I mentioned is feasible to implement (i.e., breaks no/minimal other functionalities).


Regards,
Aritra


Kriti Godey

unread,
Jul 18, 2023, 5:06:56 PM7/18/23
to Aritra Majumder, Mathesar Developers
Hi Aritra,

Thanks for this write-up! I have a few questions:

(1) Is this a part of your GSoC work, or are you proposing an additional project?
(2) I'm a little confused about where the extra dropdown you're proposing would go. Will it be part of the data explorer or the table page?
(3) Why do we need an extra dropdown at all? Why can't we integrate these preproc functions into the current UX (since we already have preproc functions)? I'd like to avoid adding additional options to the UI unless absolutely necessary because having too many options can create usability issues.

As far as the project goes, our general workflow for approving new projects is to write up a project page (see this page for a list of projects and the template) and then the entire core team discusses and approves (either via email or in a meeting). Once you've worked out more details, we can follow that process.

Let me know if you have any questions or thoughts.

Kriti

Brent Moran

unread,
Jul 19, 2023, 2:21:00 AM7/19/23
to Mathesar Developers, Aritra Majumder
For some context:
During a meeting for the summarization project, Sean pointed out (and I tend to agree) that we may not get the highest value for the summarization feature by just continuing to add more aggregations, since we're already experiencing diminishing returns on that. It may be more valuable to add some preprocessing options for aggregated columns instead.

@Kriti Godey
Regarding your question (3), we currently don't have a UI for preprocessing on the aggregated column, but only the grouping column. E.g.., you might want to aggregate a URL column to a distinct of domains (URL Authority). Right now, you'd have to frame that as two separate transformations, rather than a single summarization. If you want to _group_ by URL Authority, that's possible:

Screenshot 2023-07-19 at 14-16-36 Data Explorer public Mathesar.png

The idea is to add a UI element to do the same sort of manipulation (when appropriate) to the summarized columns as well, then add some preprocessing function options. Here's the same columns, but with the URL column as a summarized rather than grouping column:
image.png

@Aritra Majumder Please make sure you're also clear on this; your write-up makes it seem like you're a bit fuzzy on which things we can/can't preprocess currently.


Kriti Godey

unread,
Jul 19, 2023, 11:30:48 AM7/19/23
to Brent Moran, Mathesar Developers, Aritra Majumder
I'm really confused about all the terminology here, especially because the terms used in the API and UI are so different. It would help if there was a basic wireframe that showed what changes this project is proposing to the product (doesn't have to be fancy).

Aritra Majumder

unread,
Jul 19, 2023, 1:03:45 PM7/19/23
to Kriti Godey, Brent Moran, Mathesar Developers

Hi, everyone. 
Sorry for being opaque with all my descriptions. I want to clarify the details through this.

First of all, @Kriti Godey
(1) Yes, this should be a part of my GSoC project (as the project is closely related to the previous work)
(2) The extra dropdown will only be added to the data-explorer page. (I've made a wireframe. Apologies! As it isn't even close to Mathesar's data-explorer page)
(3) Currently, We have preproc functions available only for grouping column(s). We want to make it available for summarized columns as well.

I have already explained why we might need some preprocessing options for summarized columns. Again, we do have a few preprocessing options for grouping columns (Brent attached some images of the same). 

We want to add preprocessing options for the summarized (grouped) column as well, and we would like to add an extra dropdown (that will have a few preprocessing options) in the data-explorer page beside the dropdown of summarization options.

In the following image (which somewhat resembles the data-explorer page), I have shown where the new dropdown might be placed.


image.png

Thank you. If you have any questions, please let me know.

@Kriti Godey, I will write up a project page ASAP.

Regards,
Aritra

Mailtrack Sender notified by
Mailtrack
19/07/23, 22:30:58

Kriti Godey

unread,
Jul 20, 2023, 4:30:32 PM7/20/23
to Aritra Majumder, Brent Moran, Mathesar Developers
Thank you Aritra, this is very helpful in helping me understand the work you're proposing.

Adding preprocessing options for summarized columns does seem like a good feature idea, but some design / product work on how best to convey these options to the user seems warranted.  I'm worried about adding more UI elements to the Data Explorer, mainly because of usability – giving more options to the user puts more cognitive load on the user and can make the product as a whole harder to understand. There have already been concerns raised about the current usability of the Data Explorer (see this email) and I don't want to make the problem worse.

I'm not sure if we have the capacity for design / product work related to this right now, so please check with Brent / Sean about that.

Let me know if you have any questions.
Reply all
Reply to author
Forward
0 new messages