This is a discussion I have been thinking about initiating for a while now, and Antoine's recent feedback on the blank down operation motivated me to write it up, so there you go: let's talk about the records mode!
First, it is true that this mode and the associated operations (blank down, fill down) are not very well documented at the moment. But even if we had good documentation, this sort of feature should be understandable from the tool itself - and at the moment even relatively seasoned users can struggle with it. This is a discussion I have had informally with a number of users at training workshops: a lot of people are not too sure what the records mode does. Some have an intuitive understanding but are not very confident to explain it to others. Last year even Jacky filed an issue (
https://github.com/OpenRefine/OpenRefine/issues/1546) where he was confused by the records mode being turned on automatically after project creation (after years of contributing to the project).
Second, this is a half-baked feature in many ways. It only allows to represent hierarchical data up to a single depth level in a faithful way. Column groups have been introduced to deal with deeper structures, but it's an even more cryptic feature that does not play well with a lot of operations. Column groups are only created by certain importers, are not editable by users (
https://github.com/OpenRefine/OpenRefine/issues/93). Moving columns around breaks column groups as they rely on the column order (just like the records mode itself).
Third, this mode is massively inefficient, since it requires grouping rows on the fly to compute operations, facets, display the project in the UI… This adds a big overhead which can be easily noticed when working on large projects with hierarchical data: compare the speed of the UI when working in rows and records mode, the difference is really significant. The scalability improvements I plan to do in 2020 will be hard to apply to the records mode for the same reason.
Fourth, the records mode makes working with facets quite subtle. Each record can generate multiple values in a facet, so one has to think carefully about what a particular facet configuration selects. For instance, in records mode with a single facet, clicking on "invert" will generally not invert the set of selected records.
Fifth, the records mode has played a role of "rudimentary grouping feature" that users have been relying on to work around the lack of proper support for grouping. That's bad: we should have proper support for grouping and aggregating! Similarly, it can also be used to break out of OpenRefine's data model: add a column to the left of all others with only a non-blank value in the first cell, switch to the records mode, and you can now run an expression which can access the entire table via the "record" variable. This is a hack: if users want to have access to the entire table from a scripting language, why not give them that directly? We could very well have an operation where the user would be able to manipulate the entire table in Python via pandas, or in R, for instance (
https://github.com/OpenRefine/OpenRefine/issues/1226).
So it'd be keen to rethink grouping support in OpenRefine, potentially phasing out the records mode eventually (of course, on a very long term - this is a big breaking change and we need to make sure users are onboard with any alternative before doing that). Even without any grouping support, I think we could remove a lot of the use cases for the records mode by adding proper support to store hierarchical objects like arrays or even JSON objects in a single cell (
https://github.com/OpenRefine/OpenRefine/issues/2128). The hierarchical importers (XML, JSON) could also be changed to import data in a single column and/or use XPath / JSON Pointer (
https://github.com/OpenRefine/OpenRefine/issues/2201) to derive other columns from the hierarchical data. There are still workflows that I am not completely sure how to handle (for instance, how do you reconcile a list of values nested in a JSON record?) but I think it should be possible to come up with a solution that would be more principled and better integrated than the current records mode.
I am keen to hear your thoughts on this!
Antonin