Please recommend detailed tutorial on cross function

1,254 views
Skip to first unread message

Jonathon Paarlberg

unread,
Nov 11, 2015, 12:02:22 PM11/11/15
to OpenRefine
Hi all. Can anyone recommend a very detailed tutorial on OpenRefine's cross function? I have never used the function before, but I need it to count the number of matching values between some similarly named columns on separate OpenRefine projects (i.e., separate data tables).

I might also want to use it to list the values from match-associated fields.

Thad Guidry

unread,
Nov 11, 2015, 10:02:19 PM11/11/15
to openrefine
Our docs show 2 example projects where a column is being used to lookup the value in another column from a different project :
On Wed, Nov 11, 2015 at 11:02 AM, Jonathon Paarlberg <lion...@gmail.com> wrote:
Hi all. Can anyone recommend a very detailed tutorial on OpenRefine's cross function? I have never used the function before, but I need it to count the number of matching values between some similarly named columns on separate OpenRefine projects (i.e., separate data tables).

I might also want to use it to list the values from match-associated fields.

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Owen Stephens

unread,
Nov 12, 2015, 4:29:19 AM11/12/15
to OpenRefine
As well as the documentation mentioned by Thad, I'd also recommend a blog post by Tony Hirst that I've found useful - it is slightly broader than just the 'cross' functionality, but it does include the cross function

An alternative (which I've started to use more and more) is the VIB-BITS OpenRefine extension https://www.bits.vib.be/index.php/software-overview/openrefine
This extension has lots of additional functionality, including a new function which will appear in the column drop down to "Add column(s) from other projects..." this wraps the cross function in an easy to use interface where you pick the OR project you want to use, pick which column you are matching on, and choose one or more columns you want to add to your project. There are also nice options for dealing with multiple rows in the second project matching a single row in your current project.

I'd highly recommend the VIB-BITs extension for this and although I still handcraft 'cross' functions sometimes I think this is as much habit as anything else.

Owen

Jonathon Paarlberg

unread,
Nov 12, 2015, 8:18:11 AM11/12/15
to OpenRefine
Thanks, Thad.

Jonathon Paarlberg

unread,
Nov 12, 2015, 8:39:23 AM11/12/15
to OpenRefine
Thanks, Owen. The VIB-BITS extension does look very attractive, but I doubt I can start using it right away as it may be considered a new software by my company; if so, it would have to go through some checks before I'm allowed to use it. Also, I'm trying to create resuable JSON scripts for cleaning up tables that will be shared as part of a procedure that's passed off to someone else. It would seem a bit awkward to ask the next person who uses the procedure to first install the extension. (I will ask about installing the extension in the future though. Now's not the right time.)

I do appreciate the link to the tutorial about merging datasets with OpenRefine. I will definitely have a look as that will be part of the task I have in front of me. I have to take seven tables of raw data and bring them together or digest them into related tables according to values in 36 columns; no one table includes all 36.

Jonathon Paarlberg

unread,
Nov 12, 2015, 8:48:01 AM11/12/15
to OpenRefine
Hmm. The cross function adds a column including the value looked up on a match. What if there's more than one match? Would I have to choose which value from an associated field to include in the output? Or, can I include them all in an array? What I'd actually like to do is count the number of matches and output the count next to the value in the table that has more unique values.

For example, suppose I have 1000 unique Protocol ID numbers in one table and 800 unique Protocol ID numbers in the other table. Of course, what I want to know is how many ID's in one table can be found in the other.

Jonathon Paarlberg

unread,
Nov 12, 2015, 8:50:06 AM11/12/15
to OpenRefine
"Of course, what I want to know is how many ID's in one table can be found in the other."

And, how many hits for each ID.

Owen Stephens

unread,
Nov 12, 2015, 8:56:08 AM11/12/15
to OpenRefine
That makes sense

Just to say that VIB-BITS extension does create the reusable JSON in exactly the same way doing the same work using GREL directly - so if you were able to use it in the future then you wouldn't loose this ability to reuse the JSON, and the JSON would work on any OpenRefine install - it wouldn't need the VIB-BITS extension installed to work.

Owen

Owen Stephens

unread,
Nov 12, 2015, 8:58:12 AM11/12/15
to OpenRefine
The cross function returns an array of rows from the other project - one row for each match it finds in the file. If all you want to know is 'is there one or more hit' you can simply use the 'length' function to get a count of the number of matching rows in the other project.

I often do this using a cross function in a Custom Text Facet rather than by adding a new column - although you can do the same thing either way

Owen

Owen Stephens

unread,
Nov 12, 2015, 9:04:31 AM11/12/15
to OpenRefine
To extract one or more values from the result of a cross you have to add some supporting stuff around it in the GREL - essentially to go through the returned array and find the content of the column you want:

forEach(cell.cross("Project name","Match col"),r,r.cells["Col name"].value).join("|")

For example - this extracts the value from a col in the secondary project. If there is just one matching row, you get a single value. If there are multiple matches you get pipe separated list.

Alternatively as mentioned above you can use something like:

cell.cross("Project name","Match col").length()

to just give the number of matches - 0 = no matches, 1 = one match, etc.

Owen

Jonathon Paarlberg

unread,
Nov 12, 2015, 12:18:55 PM11/12/15
to OpenRefine
Oh, now that's interesting.

Jonathon Paarlberg

unread,
Nov 12, 2015, 12:22:09 PM11/12/15
to OpenRefine
Cool! I think
cell.cross("Project name","Match col").length()

in a custom facet is exactly what I need this time.

Many thanks to all.

Jonathon Paarlberg

unread,
Nov 18, 2015, 9:07:21 AM11/18/15
to OpenRefine
Using the
forEach(cell.cross("Project name","Match col"),r,r.cells["Col name"].value).join("|")

method, I get "Cannot retrieve field from null" whenever there's a null value present in the looked up column. It's easy to simply filter on that "Cannot retrieve" phrase and then transform them all to null values, but it's getting pretty tedious considering that there must be a way to prevent generating them to begin with. (Also, I've eventually got to generate reusable JSON scripts and I'd like to make them fairly straightforward.) Any suggestions?

Thanks for your help. I really appreciate it.


On Thursday, November 12, 2015 at 9:04:31 AM UTC-5, Owen Stephens wrote:

Thad Guidry

unread,
Nov 18, 2015, 10:08:42 AM11/18/15
to openrefine
Maybe instead of forEach, you can use forNonBlank() ... or one one our other controls...along with forEach(), if it gets you there.

Read through the documentation to understand how the different controls operate and can even be used within one another...or wrapped, if it helps.


Message has been deleted

Jonathon Paarlberg

unread,
Nov 18, 2015, 12:39:13 PM11/18/15
to OpenRefine
Beautiful.

Experimenting for a few minutes with forNonBlank, embedded in my forEach statement led to this GREL, which works great when using "Add column based on this column":

forEach(cell.cross("Foreign Table","Foreign Match Field"),r,forNonBlank((r.cells["Imported Field"].value),v,v,"")).join("|")

I have used forNonBlank before, but not embedded like this.

That solution not only prevented me from having to convert values generated by GREL-internal errors to null, but it also prevented the generation of very many empty rows on the subsequent multi-value cell split.

Thanks!
Reply all
Reply to author
Forward
0 new messages