How to merge rows of records?

1,670 views
Skip to first unread message

Jens Willmer

unread,
Jun 6, 2017, 5:41:34 AM6/6/17
to OpenRefine
I have two data sources with different values. I managed to find an identifier in both so that I could create records out of the data. Now I have records with 2 rows that I like to merge into one row. The following picture shows how I imagine the merge should work. 

Is this easily possible? What are my options?

 


Owen Stephens

unread,
Jun 6, 2017, 7:14:03 AM6/6/17
to OpenRefine
I'd approach this by using 'Merge multi-valued cells'. This will insert the text from cells in the lower line into the cells from the upper line, using a separator you specify.
You will then have a single row per record, but some cells will contain multiple values. You can decide how to prune this but you might use something like (I'm assuming the use of the pipe | separator when you joined the cells):

value.split("|").uniques()[0]

Jens Willmer

unread,
Jun 6, 2017, 7:32:38 AM6/6/17
to OpenRefine
I have around 60 columns. Do I have to do this for each column or can I apply it to all?

Owen Stephens

unread,
Jun 6, 2017, 9:08:57 AM6/6/17
to OpenRefine
As far as I know at the moment you'd have to apply it to each column in turn. The 'Transform All' option that has been posted about on this forum recently (but is not yet part of an OpenRefine release) might change this in the future.

One option to save you time is to carryout the operation on a single column, then 'Extract' the action via the Undo/Redo tab. You could then duplicate the relevant part of the JSON, but replacing the column name details with subsequent columns in turn - then you end up with a set of JSON that you can implement through the Apply action that does the same step to all the columns. This is obviously a work around, and while it might prove quicker than doing each column one at a time, it is still clunky.

There might be some time saving steps depending on exactly your requirements. In your example you seem to suggest you are only interested in taking data from subsequent rows when the relevant cell in the first cell is empty? If this is the case you could write a custom text fact that looks for blank cells in the same row as the record id. e.g.

if(isNonBlank(cells["ID"].value),forEach(row.columnNames,cn,if(isBlank(row.cells[cn].value),cn,false)),false)

If you use this as a custom text facet in Row mode, I think this will give you a facet which is a list of all the columns where there is at least one row with a blank in the first record line, which may at least narrow down the columns you need to carryout the process on. Once you've done all these, you can simply delete all the rows that don't have an ID in the ID column.

Owen

Ettore Rizza

unread,
Jun 6, 2017, 9:22:49 AM6/6/17
to OpenRefine
What about using this Grel formula to create an new column, and then "split into several columns" it by || ? 

forEach(
forEach
(row.columnNames, e, row.record.cells[e].value), c, c.join(" ")
).join("||")

If I am not wrong, this will give a "flat horizontal" version of your records (as in the screenshot), which you can then export using Custom tabular exporter.



Owen Stephens

unread,
Jun 6, 2017, 9:25:19 AM6/6/17
to OpenRefine
Nice :)

Jens Willmer

unread,
Jun 6, 2017, 9:39:21 AM6/6/17
to OpenRefine
Thx for the idea and the facet. I think this is the way to go at the moment.

aurielle perlmann

unread,
Jul 30, 2018, 7:42:19 PM7/30/18
to OpenRefine
thanks! this saved me!
Reply all
Reply to author
Forward
0 new messages