GREL for creating abbreviations from full names

52 views
Skip to first unread message

Steve

unread,
Nov 29, 2017, 9:06:01 PM11/29/17
to OpenRefine
I have a column of author names that are in one of the following formats:

Lastname, First
Lastname, First M
Lastname, First MM
Lastname, FM

I would like to change all of them to:

Lastname, F
or
Lastname, FM
or
Lastname, FMM

I have tried a number of GREL scripts, but am new to this and can't quite get what I need. Does anyone have any ideas? Thank you!

Ettore Rizza

unread,
Nov 30, 2017, 7:12:54 AM11/30/17
to OpenRefine
Hi Steve, 

based on your examples, it looks like it's enough to delete all the lowercase letters and blank spaces in the first names part (after the comma). You can do that like this:



value.split(", ")[0] + ", " + value.split(", ")[1].replace(/[a-z ]/, '')


Hope this helps,

Ettore

Owen Stephens

unread,
Nov 30, 2017, 7:39:05 AM11/30/17
to OpenRefine
Hi Steve

Bascially names are a pain in the ... neck. The problem is that even if the names follow a consistent set of patterns like you've outlined here, you can still get small but significant (in terms of what you want to do) variations within the names. So 

Bertillon, Marie-Claude

is a different pattern to both

Bertillon, Phillipe
AND
Bertillion, Chloé

Ettore's solution works well for names like Bertillon, Phillipe, but wouldn't work with Bertillon, Marie-Claude or Bertillion, Chloé

So - for example - if any of the names in your set contain accented or non-latin characters you may wish to try the following variation on Ettore's solution:

value.split(", ")[0] + ", " + value.split(", ")[1].replace(/[\p{Ll} ]/, '')

The:
\p{Ll}
in the replace statement picks up accented and non-latin lowercase letters, which [a-z] doesn't. This 'replace' is using a "regular expression" - if you haven't come across regular expressions before I'd recommend http://regex.bastardsbook.com as a starting point - but basically they are a way of matching patterns in text strings (not unique to OpenRefine, but used widely in programming and other software)

However, this approach across the board would lead to Bertillion, Marie-Claude being replaced with Bertillion, M-C which may not be what you want.

A good workflow in OpenRefine is to filter your project to narrow down to a single set of problems which can be tackled together. Usually you do this using a facet or a filter to get to get all the rows that follow a particular pattern/exhibit a particular problem, then use a transformation to fix them. This is often in balance with the complexity of the transformation you need to write - that is to say, the more varied problems you try to fix with a single transformation, the more complex that transformation has to be.

In this case you could try applying a text filter to make sure you only transform names which will be successfully transformed by your transform statement. For example, if you create Text Filter with the case-sensitive and regular expression boxes checked, and use the following expression:

^.+,( ?\p{Lu}\p{Ll}*)+$
This will find all cells that match the pattern -> Start with any number of characters followed by a comma, then a repeated pattern of: optional space, uppercase char, optionally any number of lowercase chars. Anything that matches this pattern can be transformed using the GREL suggested by Ettore.

If you then 'Invert' this filter (you need to be using OpenRefine 2.8 for this - it's a new feature introduced in 2.8) you can see all the rows that might not work well with Ettore's suggestion, and then try to work out other transformations to fix these variants.

Owen

Ettore Rizza

unread,
Nov 30, 2017, 7:55:47 AM11/30/17
to OpenRefine
@Owen : Excellent explanation that deserves to be added to the official documentation!

But as I said above, my formula only applies to the given examples. ;) Hence the importance of providing a sample of the dataset as representative as possible. I wonder if it should not be mentioned in bold somewhere on the Google Group.

Steve

unread,
Nov 30, 2017, 11:14:29 AM11/30/17
to OpenRefine
Wow, thanks very much to both of you! This program continues to amaze me.

I implemented Owen's modified code, but realized I have another issue. Many of the first names have the last letter capitalized because of an operation I did in a much earlier workflow to capitalize the second letter of many of the abbreviated names that were lowercase. (Lastname, Fm --> Lastname, FM). Unfortunately, this throws a wrench in Owen's code:


I'm thinking what I need to do is to filter the data to isolate records that look something like this: (Lastname, FirsT M), make the T lowercase, and then run Owen's code.

I've included a sample data set this time! Thanks again to you for your willingness to help and explain the steps that you're doing. I'm learning a lot from each post.
Names_Sample.csv
Auto Generated Inline Image 1

Owen Stephens

unread,
Dec 2, 2017, 4:36:11 AM12/2/17
to OpenRefine
Hi Steve,

One of the best things about OpenRefine is the ability to undo steps at any point through the project. My first instinct in this case would be to rewind the project to the point where that capitalisation happened, fix that step, then reapply all the subsequent steps. To make sure you don't lose anything you can take a copy of the project first by clicking Export->Export Project.

Then I would:
  1. Open the Undo/Redo panel
  2. Click Extract
  3. First check only the steps that followed the one that introduced the error - copy the text from the righthand panel into a text editor
  4. Close the extract window
  5. Now in the Undo/Redo panel, rewind the project to just before the erroneous step
  6. Work out what went wrong, and re-write that transform so it doesn't introduce that capitalisation
  7. Now in the Undo/Redo panel click Apply
  8. Paste in the text you previously copied (in step 3)
Now you should have your project with all the same transforms applied but without the incorrect capitalisation. 

I can see that you've also got some issues with missing capitals in hyphenated surnames - you could take a similar approach to fixing this problem.

This is what I'd recommend. However, if you just want to work from where you are, I think the following would work:
value.split(" ")[0]+" "+forEach(value.split(" "),v,v.match(/(\p{Lu}+).*/).join("")).get(1,value.split(" ").length()).join("")

This is a bit complicated - but basically it splits the string on spaces, extracts the first element (surname + comma), then adds back the leading uppercase letters in each of the remaining parts of the string

Owen

Steve

unread,
Dec 2, 2017, 5:03:20 PM12/2/17
to OpenRefine
Perfect, thanks so much Owen.

All the best,
-Steve

Stephen Chignell

unread,
Dec 2, 2017, 7:51:43 PM12/2/17
to openr...@googlegroups.com, Owen Stephens
Amazing, thanks so much Owen. I feel like I've learned so much in the past couple of days that I can apply to my next OR project.

-Steve
--
You received this message because you are subscribed to a topic in the Google Groups "OpenRefine" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/openrefine/wHAmS0vkKrg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Stephen Chignell
Department of Ecosystem Science and Sustainability
Colorado State University
Fort Collins, CO 80523
steve.c...@colostate.edu
Google Scholar Profile
Reply all
Reply to author
Forward
0 new messages