The future of the records mode

Antonin Delpeuch

unread,

Dec 18, 2019, 2:04:13 AM12/18/19

to openr...@googlegroups.com

Hi all,

This is a discussion I have been thinking about initiating for a while now, and Antoine's recent feedback on the blank down operation motivated me to write it up, so there you go: let's talk about the records mode!

First, it is true that this mode and the associated operations (blank down, fill down) are not very well documented at the moment. But even if we had good documentation, this sort of feature should be understandable from the tool itself - and at the moment even relatively seasoned users can struggle with it. This is a discussion I have had informally with a number of users at training workshops: a lot of people are not too sure what the records mode does. Some have an intuitive understanding but are not very confident to explain it to others. Last year even Jacky filed an issue ( https://github.com/OpenRefine/OpenRefine/issues/1546) where he was confused by the records mode being turned on automatically after project creation (after years of contributing to the project).

Second, this is a half-baked feature in many ways. It only allows to represent hierarchical data up to a single depth level in a faithful way. Column groups have been introduced to deal with deeper structures, but it's an even more cryptic feature that does not play well with a lot of operations. Column groups are only created by certain importers, are not editable by users ( https://github.com/OpenRefine/OpenRefine/issues/93). Moving columns around breaks column groups as they rely on the column order (just like the records mode itself).

Third, this mode is massively inefficient, since it requires grouping rows on the fly to compute operations, facets, display the project in the UI… This adds a big overhead which can be easily noticed when working on large projects with hierarchical data: compare the speed of the UI when working in rows and records mode, the difference is really significant. The scalability improvements I plan to do in 2020 will be hard to apply to the records mode for the same reason.

Fourth, the records mode makes working with facets quite subtle. Each record can generate multiple values in a facet, so one has to think carefully about what a particular facet configuration selects. For instance, in records mode with a single facet, clicking on "invert" will generally not invert the set of selected records.

Fifth, the records mode has played a role of "rudimentary grouping feature" that users have been relying on to work around the lack of proper support for grouping. That's bad: we should have proper support for grouping and aggregating! Similarly, it can also be used to break out of OpenRefine's data model: add a column to the left of all others with only a non-blank value in the first cell, switch to the records mode, and you can now run an expression which can access the entire table via the "record" variable. This is a hack: if users want to have access to the entire table from a scripting language, why not give them that directly? We could very well have an operation where the user would be able to manipulate the entire table in Python via pandas, or in R, for instance ( https://github.com/OpenRefine/OpenRefine/issues/1226).

So it'd be keen to rethink grouping support in OpenRefine, potentially phasing out the records mode eventually (of course, on a very long term - this is a big breaking change and we need to make sure users are onboard with any alternative before doing that). Even without any grouping support, I think we could remove a lot of the use cases for the records mode by adding proper support to store hierarchical objects like arrays or even JSON objects in a single cell ( https://github.com/OpenRefine/OpenRefine/issues/2128). The hierarchical importers (XML, JSON) could also be changed to import data in a single column and/or use XPath / JSON Pointer ( https://github.com/OpenRefine/OpenRefine/issues/2201) to derive other columns from the hierarchical data. There are still workflows that I am not completely sure how to handle (for instance, how do you reconcile a list of values nested in a JSON record?) but I think it should be possible to come up with a solution that would be more principled and better integrated than the current records mode.

I am keen to hear your thoughts on this!

Antonin

Thad Guidry

unread,

Dec 18, 2019, 10:32:51 AM12/18/19

to openr...@googlegroups.com

Been onboard for changing this for 10 years. :-)

Agree on proper direct support of hierarchical objects. Oracle and Postgres can sing JSON natively...we should also as I've stated time and again ;-)

Agree "Records mode" has been a very light hack introduced to cover only 2 use cases at the time it was invented (Sorry David!). Our users have greater needs with real "grouping and aggregating".

I am all for phasing out the current "Records mode" and beginning to introduce better grouping functions for creating and manipulating Records.

Thad

https://www.linkedin.com/in/thadguidry/

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/1534475405.293242.1576652650821%40email.ionos.fr.

Timothy R. Mendenhall

unread,

Dec 18, 2019, 11:03:23 AM12/18/19

to openr...@googlegroups.com

Hi there,

I agree that records mode is problematic. However, my colleagues and I do rely on it for several tasks around reconciling multi-valued cell data (and parsing out various aspects e.g. labels, URIs, from the reconciliation values) and maintaining the source and new data in a logical group. I also use the rows/records distinction as a hack/workaround to create an adjacent row and copy data from the source row into a new row.

--Ryan Mendenhall

Columbia University

To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/CAChbWaM5NxaK%3DiGaHoWeddbBeYY7ucEk0tdW4RQE_Mg92Odt%3DQ%40mail.gmail.com.

--

Timothy Ryan Mendenhall (he/him/his)

Metadata Librarian

Columbia University Libraries

Original and Special Materials Cataloging

102 Butler Library

535 West 114th Street

New York, NY 10027

trm...@columbia.edu

(212) 851-2452

Antoine Beaubien

unread,

Dec 18, 2019, 6:40:03 PM12/18/19

to OpenRefine

I love this discussion and think this is an important point. Row/record modes aren't very intuitive in the interface.

My first though reading this was about grouping the values in more than 1 column, which is the current record model.

But, I fear that changing that, although interesting and probably inevitable in the future, may not be the next step. Why? Because we could already deal with multi-grouping by using secondary projects with a mix of « list (group) » in a cell.

Being able to pull data in and out of a project, either in WD or SQL, seams very powerful, and is something that is not so much effort compared with the usefulness it gives. Also, being able to describe the relations between rows could enable much interesting interface improvements.

Also, the need to target and manipulate statements seams a most for me. There are some action (delete/modify) that I can't do from OR and must use QS to perform, when it could be done from OR…

I know I'm new to this OpenRefine, but I still used it to push around 8000 items of movie data, and I see how some things could be much easier. That being said, I think the topic Antonin raises here is important.

To unsubscribe from this group and stop receiving emails from it, send an email to openr...@googlegroups.com.

magdmartin

unread,

Jan 2, 2020, 7:55:46 AM1/2/20

to OpenRefine

We count a lot of non-programmer in our user base. Being able to work in records mode enables them to do operations not possible with regular spreadsheet software. In the last few years, our users are working more often with JSON and hierarchical data (vs csv). I agree with the shortcoming of OpenRefine listed previously and that we should better support hierarchical data.

Antonin, to complete your analysis, we can also include several requests for a more robust JSON exporter to

* Support multi-level nested JSON (https://stackoverflow.com/questions/31328001/openrefine-working-with-templating-to-export-json-as-records);

* Recreate original JSON(s) with changed values (https://github.com/OpenRefine/OpenRefine/issues/1897)

Martin

Reply all

Reply to author

Forward