Joining of record file rows into one row

FRAZER NYAMBE

unread,

Jul 27, 2022, 11:59:37 AM7/27/22

to OpenRefine

I am having a file that contains a lot of records in it and when I export these records as .XSL format, one record is split into multiple rows in an Excel file. I would like to join the rows for each particular record into one row Using OpenRefine, such that each row should contain one record's details.

Owen Stephens

unread,

Jul 27, 2022, 12:38:29 PM7/27/22

to OpenRefine

This is definitely something that you can do in OpenRefine, but the detail may depend on the starting structure of the data. Can you share some examples of what the data originally looks like (either real data or some example data structured in the same way - whichever is easier).

If you can provide this information I can then provide advice on how to achieve the outcome you need

Owen

FRAZER NYAMBE

unread,

Aug 3, 2022, 8:43:09 AM8/3/22

to openr...@googlegroups.com

This is the XML file with multiple records I am working with.

--
You received this message because you are subscribed to a topic in the Google Groups "OpenRefine" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/openrefine/Ss-JoVM_VpM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/61418e67-62c9-4cb5-9d9c-86e3ae845a41n%40googlegroups.com.

CONFIDENTIAL NOTICE: This message (and its attachments) may contain confidential information. Any unauthorized use or disclosure is prohibited.
If you have received this communication in error, please notify us immediately by responding to this email and then delete it from your system.

1.xml

Owen Stephens

unread,

Aug 3, 2022, 9:42:58 AM8/3/22

to OpenRefine

Thanks

The answer may depend on how you want to handle multi-value fields in the CSV. As an example the first record has:

<record>
<header>
<identifier>oai:dspace.unza.zm:123456789/4153</identifier>
<datestamp>2019-08-19T12:30:47Z</datestamp>
<setSpec>com_123456789_289</setSpec>
<setSpec>col_123456789_290</setSpec>
</header>
<metadata>
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:doc="http://www.lyncode.com/xoai" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Morphological characterisation of low and high oil sunflower(Hellanthus Annuus. L.)Varieties for use in marker assisted selection</dc:title>
<dc:creator>Chinyundo, Anthony</dc:creator>
<dc:subject>Helianthus Annuus. L.</dc:subject>
<dc:subject>Sun flower oil</dc:subject>
<dc:subject>Cooking oil</dc:subject>
<dc:description>Morphological characterization etc. etc.</dc:description>
<dc:date>2015-11-11T13:39:13Z</dc:date>
<dc:date>2015-11-11T13:39:13Z</dc:date>
<dc:date>2015-11-11</dc:date>
<dc:type>Other</dc:type>
<dc:identifier>http://dspace.unza.zm/handle/123456789/4153</dc:identifier>
<dc:language>en</dc:language>
<dc:format>application/pdf</dc:format>
</oai_dc:dc>
</metadata>
</record>

From this data, in OpenRefine you'll end up with a record that has two values in the 'setSpec' column, three values in the 'dc:subject' column and three values in the 'dc:date' column (the column names will be longer than this if you've just used the names generated by the XML import, but hopefully this is clear)

So I get an example like this

So the question is what the CSV should look like for this data? Because XML is a complex hierarchical format that can have nested elements, it doesn't neatly map to CSV - this requires decision making on the part of the person creating /specifying the CSV.

In this case you can join each of these records into a single row by using "Edit Cells -> Join multi-valued cells" menu option in the various columns - you will have to decide what character you want to use as a separator. If I do this I get:

You can see that this record is now on a single row but with the cells for dc:date and dc:subject (and offscreen also setSpec) having multiple values. If this is what you need, that's fine, but you might want to make other decisions at this point - e.g. having multiple columns for the subject headings (can be done using Edit column -> Split into several columns). I also think the data is a bit odd - some of the multiple values seem like they are lacking context - for example what are each of those dc:date values actually meant to show? Why are there three of them?

If you look across all the records you'll find that different fields are repeated in different records - some have multiple creators, some multiple descriptions, some multiple identifiers etc. etc. You'd have to use the 'join multi-valued cells' option on all the relevant columns to ensure all records are on a single line

Hope this is of some help

Best wishes

Owen

FRAZER NYAMBE

unread,

Aug 5, 2022, 3:34:33 PM8/5/22

to openr...@googlegroups.com

Thank you for the information, it's exactly what I was looking for...

And then how would one get to identify count records that contain for example ("dc:date", "dc:description") after the "join multivalued" ?

To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/03423230-3d17-47c1-a018-401373fffbden%40googlegroups.com.

Owen Stephens

unread,

Aug 8, 2022, 5:20:44 AM8/8/22

to OpenRefine

If you have a list of joined values in a cell like:

2015-03-04T07:36:13Z|2015-03-04T07:36:13Z|2015-04-14T10:40:01Z

You can get a count by using a GREL like:

value.split("|").length()

This splits the list into an array and takes the length.

If you want to count unique values only you can use:

value.split("|").uniques().length()

I usually use these expression in a "Custom Text Facet" but you could also add another column based on this expression if you want the count in it's own column

Owen

Reply all

Reply to author

Forward