Split family names based on first word with capital

442 views
Skip to first unread message

Ger Dijkstra

unread,
Dec 19, 2016, 2:52:54 PM12/19/16
to OpenRefine
Hello,

I am trying to split up a column with family names into two separate columns (intraposition, family name). The intrapositions are always in lower case, the "real" family name always starts with an uppercase. To illustrate what I mean:



I assume the supposed action would be: 

Edit column --> Add column based on this column...

But I have no clue what "expression" to use in order to split family name. Any thoughts?

Many thanks in advance!

Ger




John Little

unread,
Dec 19, 2016, 4:32:30 PM12/19/16
to OpenRefine
This is where regular expression and the match() function will come in handy.


Here is a step-by-step approach...

#1 filter your rows to only those value which begin with a capital letter  
Family name > Text filter
* select regular expression
* select case sensitive
* expression = ^[A-Z].*

#2 move those names to a newly added column "newFamilyName" 
Family name > Edit column > Add column based on this column...  > OK
New Column Name = Family Name 2
close the text filter

#3 separate and move the FamilyName when name has an Intraposition
Family Name 2  > Edit cells > Transform...
expression = if(isNotNull(value), value, cells["Family name"].value.match(/^([a-z]\w+ )*([A-Z]\w+)/)[1])

#4 add an Intraposition column derived from Family Name 
Family name > Edit column > Add column based on this column...
expression = value.match(/(([a-z]\w+\s)+)([A-Z]\w+)/)[0]

I hope this helps.

--John



--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Owen Stephens

unread,
Dec 20, 2016, 5:58:24 AM12/20/16
to OpenRefine
A similar but alternative approach to the one suggested by John. The reliability of the 'match' step is the big challenge - which will depend on whether all the names follow the pattern you've set out here. But the following works with the name variants in your example:

Use "Add Column based on this column" with the expression:

forEach(value.match(/([A-Z][a-z]*)([a-z ]*)([A-Z][A-Za-z ]*)/),v,v.trim()).join("|")

call the new column "Name split" (or similar). This is slightly more complicated than just the 'match' expression to make sure the names don't end up with spaces at the start or end.

In the new column you should now have the names split into Given name, Intraposition and Family name separated by the pipe character |

You can then use 'Split into several columns' and specify the pipe character as the one to split on. This should give you three columns with the Given name, Intraposition and Family name.

Owen

Ger Dijkstra

unread,
Dec 20, 2016, 8:03:29 AM12/20/16
to OpenRefine
Hello John and Owen,

Many thanks for your fast and kind reply. I will try the suggested solutions and will let you know the outcome!

Kind regards,
Ger

Ger Dijkstra

unread,
Dec 21, 2016, 12:46:03 PM12/21/16
to OpenRefine
Hello John, 

I tried to reproduce your steps, but I get stuck at step #3. Please see the screenshots underneath each of the steps as described by you.



Here is a step-by-step approach...

#1 filter your rows to only those value which begin with a capital letter  
Family name > Text filter
* select regular expression
* select case sensitive
* expression = ^[A-Z].*





#2 move those names to a newly added column "newFamilyName" 
Family name > Edit column > Add column based on this column...  > OK
New Column Name = Family Name 2
close the text filter



 
#3 separate and move the FamilyName when name has an Intraposition
Family Name 2  > Edit cells > Transform...
expression = if(isNotNull(value), value, cells["Family name"].value.match(/^([a-z]\w+ )*([A-Z]\w+)/)[1])



 
#4 add an Intraposition column derived from Family Name 
Family name > Edit column > Add column based on this column...
expression = value.match(/(([a-z]\w+\s)+)([A-Z]\w+)/)[0]




Could you advise me what I should do differently?


Many thanks and kind regards from Berlin,

Ger

 

Ger Dijkstra

unread,
Dec 21, 2016, 1:10:32 PM12/21/16
to OpenRefine
Hello Owen,

I followed your suggestion as well, but that leads to the following result:



Process wise I would think as follows:

1. move all words as from the first name written with uppercase into a new column (which then contains the "real" family names)
2. all words left in the original column are then either an intraposition OR are words without any capitals / uppercase in it

Somehow the last step in John's approach got pretty close to that, I think. But I am not a programmer nor IT specialist, maybe my way of thinking is not correct.

Kind regards!
Ger

John Little

unread,
Dec 21, 2016, 3:03:03 PM12/21/16
to openr...@googlegroups.com
Hi Ger.

It's a little slow here today so I took your question as an opportunity to practice my video production skills.  The video is low-fi, but I hope it shows you what to do...  I show two different methods.  The second method only collapses some of the initial steps and begins at timestamp 1:38


Best

--John

--

Thad Guidry

unread,
Dec 21, 2016, 3:22:47 PM12/21/16
to openr...@googlegroups.com
There might be a much simpler way to do this.  Using Facets to inspect things for patterns and then...
 
What if you just treat the Family Names as multi-value cells ?  :)

Then split on different patterns that you see in a Text Facet on s substring of Family Name... like "van den" or "van der", etc.
using the Edit column -> Split into several columns -> by separator using separator "van den" and uncheck Remove this column.

Ger Dijkstra

unread,
Dec 22, 2016, 6:14:37 AM12/22/16
to OpenRefine
Hello John,

This is awesome! Thank you very much for your efforts, very much appreciated!

Happy holidays and kind regards,
Ger

Ger Dijkstra

unread,
Dec 22, 2016, 6:41:42 AM12/22/16
to OpenRefine
Hello Thad,

Thanks for your suggestion! I'll be working with larger datasets (+100,000 entries), therefore John's method seems to fit better to such datasets. Since I am still rather new to OpenRefine I will study what Facets could do, so thanks for pointing that out.

Happy holidays and kind regards,
Ger

Thad Guidry

unread,
Dec 22, 2016, 11:11:48 AM12/22/16
to OpenRefine
Ger,

There are many ways using Facets to expose patterns or to even have a nice filtering mechanism for patterns.
Family Names have 1 or many Words.
So how do you know when you have many Words ?
Count the number of whitespaces you have in a cell...or...

Add Customized Facet
  value.smartSplit(" ").length()

Another is just looking at Words themselves...
Add Customized Facet
  Word Facet







Thad Guidry

unread,
Dec 22, 2016, 11:14:55 AM12/22/16
to OpenRefine
oops forgot to mention the 2 cool features of Facets....
you can sort by count ... in addition to name.
you can click on an elements in a Facet to include them or not and then the grid updates immediately.
you can have many many Facets working at the same time, slicing and dicing through ways of viewing your columns of data.


Owen Stephens

unread,
Jan 4, 2017, 4:41:59 AM1/4/17
to OpenRefine
Hi Ger,

After the Christmas break I'm now back at my computer. It looks like you are happy with the approach John illustrated, but I wanted to know why my approach hadn't worked. The reason my solution doesn't work is I'd not read your question carefully enough. My solution assumed that the Given name and Family name were is a single cell at the start of the process whereas on re-reading you were clear that you already have the Given name in a separate column.

To get my solution working it needs tweaking to just work with the data in your 'before' Family Name column. So to get the Intraposition you can use:

forEach(value.match(/([a-z ]*)([A-Z][A-Za-z ]*)/),v,v.trim())[0]

and to get the Family name

forEach(value.match(/([a-z ]*)([A-Z][A-Za-z ]*)/),v,v.trim())[1]

Happy new year!

Owen

Ger Dijkstra

unread,
Jan 4, 2017, 12:04:31 PM1/4/17
to OpenRefine
Hello Owen et al.,

First of all: happy new year to you all!

I have just tested your tweaked code and I can happily confirm that it works easy and well! John's procedure has one advantage, because it does not impact names without an uppercase:


This is where I used Thad's advise to work with the facet option in the column "Achternaam 2" of the above shown example. Thus I could find the empty cells and manually adjust family names where there was a typo (in the sense of no words with uppercase).

For all of you: MANY thanks for your kind and professional help!

Kind regards,
Ger
Reply all
Reply to author
Forward
0 new messages