GREL to join multi-valued cells?

84 views
Skip to first unread message

Timothy R. Mendenhall

unread,
Aug 13, 2019, 2:44:09 PM8/13/19
to openr...@googlegroups.com
Hi there,

Is there a way to use the All column --> Transform to join multi-valued cells in all columns of an OpenRefine project?  In running tests on a single column, I can use GREL on multi-valued cells with an expression like row.record.cells['ColumnName'].value.join(';') but this doesn't actually join the values, it just creates duplicate sub-arrays.  Any help would be appreciated!

Ryan Mendenhall
--
Timothy Ryan Mendenhall (he/him/his)
Metadata Librarian
Columbia University Libraries
Original and Special Materials Cataloging
102 Butler Library
535 West 114th Street
New York, NY 10027



Ettore Rizza

unread,
Aug 13, 2019, 2:53:03 PM8/13/19
to openr...@googlegroups.com
Hi Timothy,

Could you provide a very basic example of your data, just to be sure we are talking about the same thing? 

Thanks!
--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/CAH%3DR%2BfafEa1yNA1jtVG%3D0F2%3DtS6W-TgoPmytMsngd9cWZHavtQ%40mail.gmail.com.

Timothy R. Mendenhall

unread,
Aug 13, 2019, 3:06:30 PM8/13/19
to openr...@googlegroups.com
Hi there,

With the URI column as a key, this is a sample of some data that I was testing.

-Timothy Mendenhall

EmailListQuestion.tsv

Ettore RIZZA

unread,
Aug 13, 2019, 5:02:07 PM8/13/19
to openrefine
Re Timothy, sorry for the delay.

Let's simplify the problem a little bit, it could be of interest to everyone here.

You have a dataset that looks like this:

screenshot-127.0.0.1_3333-2019.08.13-21_50_21.png
And you'd like to get that...

screenshot-127.0.0.1_3333-2019.08.13-21_51_05.png
...but without having to click on "join Multi-valued cells" in each column, since your real dataset has 18 and could contain many more.

This is a perfectly legitimate request, probably very common, but surprisingly difficult to perform in OpenRefine (at least it seems difficult to me, but maybe I'm missing an obvious solution).

The main reason for this difficulty is that a column name must always be explicitly mentioned when using row.record.cells["The_Column_Where_I_Want_To_Do_Something"], which prevents it from being used with All --> Transform.

I do have a workaround, but it involves duplicating the entire dataset into a single column, exporting that single column to CSV using "Custom Tabular exporter", and then recreating a project from that CSV. This is complicated and certainly not a viable solution.

I'll think about it a little bit and wait to see if anyone has a better idea before explaining in detail the procedure I'm talking about.

For developers: an obvious solution would be to add "join/split Multivalued cells" in the "All" tab. The other would be to create a new "thisColumn" variable inside the cells variable that would allow an operation to be performed only in the column where you are, for example row.record.cells.thisColumn.value.join(';'). But I don't know if the data model allows it.

Best,

Ettore Rizza


Owen Stephens

unread,
Aug 13, 2019, 5:25:51 PM8/13/19
to OpenRefine
I agree with Ettore that it would be good to have a way of doing 'join multi-valued cells' across a project (although I have some concerns that there is some possibility of combining cells unintentionally - I guess that is always a risk with using join multi-valued cells operation).

I think potential solutions might be better discussed via a github issue for the feature - or possibly two features based on Ettore's suggestion. However, for the purposes of this discussion I'll stick to what I think is the best/easiest solution at the moment - which is basically to create an Operation History JSON with the relevant steps in it, and use Undo/Redo->Apply to apply this.

The JSON you need for this project would be:

[
  {
    "op": "core/multivalued-cell-join",
    "columnName": "DeprecatedHeading",
    "keyColumnName": "URI",
    "separator": ";",
    "description": "Join multi-valued cells in column DeprecatedHeading"
  },
  {
    "op": "core/multivalued-cell-join",
    "columnName": "RDFS_Type",
    "keyColumnName": "URI",
    "separator": ";",
    "description": "Join multi-valued cells in column RDFS_Type"
  },
  {
    "op": "core/multivalued-cell-join",
    "columnName": "inScheme",
    "keyColumnName": "URI",
    "separator": ";",
    "description": "Join multi-valued cells in column inScheme"
  },
  {
    "op": "core/multivalued-cell-join",
    "columnName": "AltLabel",
    "keyColumnName": "URI",
    "separator": ";",
    "description": "Join multi-valued cells in column AltLabel"
  },
  {
    "op": "core/multivalued-cell-join",
    "columnName": "ChangeID",
    "keyColumnName": "URI",
    "separator": ";",
    "description": "Join multi-valued cells in column ChangeID"
  },
  {
    "op": "core/multivalued-cell-join",
    "columnName": "BroaderTerm",
    "keyColumnName": "URI",
    "separator": ";",
    "description": "Join multi-valued cells in column BroaderTerm"
  },
  {
    "op": "core/multivalued-cell-join",
    "columnName": "Label",
    "keyColumnName": "URI",
    "separator": ";",
    "description": "Join multi-valued cells in column Label"
  },
  {
    "op": "core/multivalued-cell-join",
    "columnName": "Note",
    "keyColumnName": "URI",
    "separator": ";",
    "description": "Join multi-valued cells in column Note"
  },
  {
    "op": "core/multivalued-cell-join",
    "columnName": "Change",
    "keyColumnName": "URI",
    "separator": ";",
    "description": "Join multi-valued cells in column Change"
  },
  {
    "op": "core/multivalued-cell-join",
    "columnName": "ChangeDate",
    "keyColumnName": "URI",
    "separator": ";",
    "description": "Join multi-valued cells in column ChangeDate"
  },
  {
    "op": "core/multivalued-cell-join",
    "columnName": "Creator",
    "keyColumnName": "URI",
    "separator": ";",
    "description": "Join multi-valued cells in column Creator"
  },
  {
    "op": "core/multivalued-cell-join",
    "columnName": "ChangeSubject",
    "keyColumnName": "URI",
    "separator": ";",
    "description": "Join multi-valued cells in column ChangeSubject"
  },
  {
    "op": "core/multivalued-cell-join",
    "columnName": "Related",
    "keyColumnName": "URI",
    "separator": ";",
    "description": "Join multi-valued cells in column Related"
  },
  {
    "op": "core/multivalued-cell-join",
    "columnName": "Narrower",
    "keyColumnName": "URI",
    "separator": ";",
    "description": "Join multi-valued cells in column Narrower"
  },
  {
    "op": "core/multivalued-cell-join",
    "columnName": "Broader",
    "keyColumnName": "URI",
    "separator": ";",
    "description": "Join multi-valued cells in column Broader"
  },
  {
    "op": "core/multivalued-cell-join",
    "columnName": "Example",
    "keyColumnName": "URI",
    "separator": ";",
    "description": "Join multi-valued cells in column Example"
  }
]

There are various ways you could generate the JSON, but I did it by writing the following into a GREL transform:

'['+forEach(row.columnNames.slice(1),cn,'
  {
    "op": "core/multivalued-cell-join",
    "columnName": "'+cn+'",
    "keyColumnName": "URI",
    "separator": ";",
    "description": "Join multi-valued cells in column '+cn+'"
  }
]').join(",")

Then rather than actually doing the transform I just copied the output from one row in the preview window. This is a little hacky, but it was quick way of getting the output I needed. There are (I know) more elegant ways, but it gives the output required and would work for any project with any number of columns, as long as you adjusted the keyColumnName value to match the relevant key column for the records in your project.

Owen



Hi Timothy,

To unsubscribe from this group and stop receiving emails from it, send an email to openr...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openr...@googlegroups.com.


--
Timothy Ryan Mendenhall (he/him/his)
Metadata Librarian
Columbia University Libraries
Original and Special Materials Cataloging
102 Butler Library
535 West 114th Street
New York, NY 10027



--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openr...@googlegroups.com.

Bradley Ferguson

unread,
Aug 13, 2019, 5:32:16 PM8/13/19
to OpenRefine
I would do the operation on the first column, extract it from the history, then open a text editor to copy/paste/modify the operation for the rest of my columns.

With the mouse, the list of column names, and Ctrl-C/Ctrl-V it wouldn't be too tedious.

This list could then be applied to the dataset.

I agree that collapsing all records into a single row should be a built-in operation.

Brad

Bradley Ferguson

unread,
Aug 13, 2019, 5:36:55 PM8/13/19
to OpenRefine
I see that Owen beat me to the punch. The abuse of the GREL preview is a good trick, too. (thumbs up)

Brad

Owen Stephens

unread,
Aug 13, 2019, 5:46:30 PM8/13/19
to OpenRefine
:)

If there is a way to use GREL when another solution would do it better or more elegantly - I've probably written that GREL!
But joking aside,
forEach(row.columnNames,cn,<do something with cn>)

is probably one of my favourite OpenRefine tricks - for anyone not familiar I recommend having a look at https://github.com/OpenRefine/OpenRefine/wiki/Recipes#7-columns on the wiki

Owen

Thad Guidry

unread,
Aug 13, 2019, 5:55:41 PM8/13/19
to openr...@googlegroups.com
This is an old enhancement ask...

The "collapsing records" is called "Flatten" in other ETL programs and Database operations.
We have "Flatten Records" as an old enhancement request SOMEWHERE... and when found, probably needs some design love.
It would be analogous to our current Transpose.



--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.

Ettore Rizza

unread,
Aug 13, 2019, 8:38:39 PM8/13/19
to OpenRefine
Yes, good job! I also think that's the best solution. But it requires (if only to think about it) a thorough knowledge of how OpenRefine works.

Note (for the sake of completeness): when the elements of a record are perfectly aligned upwards, like the record 1 in this screenshot...

screenshot-127.0.0.1_3333-2019.08.14-02_19_17.png



It is possible to use a All --> Transform with this small Python/Jython script to get the expected results. It's also possible with Grel, but more confusing.

your_separator = ";" #change this separator if you want

for column in row.columnNames:
     if value == row.record.cells[column].value[0]:
         return your_separator.join([str(i) for i in row.record.cells[column].value])



screenshot-127.0.0.1_3333-2019.08.14-02_20_56.png


 A simple "Facet by blank" would then allow to clean the empty rows. But, as you can see, if the record elements are not correctly aligned (like in record 2), the result will no longer be usable. 
Reply all
Reply to author
Forward
0 new messages