regex capitalization help

466 views
Skip to first unread message

olivia solis

unread,
Sep 21, 2016, 9:58:09 PM9/21/16
to OpenRefine
Hello all,

I'm struggling to come up with an expression to change the case of a letter that follows a specific pattern in a regular expression. For instance, I want to change "Mcfarland, Jenny" and "Mckay, John", etc. to "McFarland, Jenny" and "McKay, John".

I've been trying replace statements like 
   value.replace(/Mc[a-z]/, "Mc...")
I get stuck at the "...".

What am I missing?

Thanks!
Olivia

Tom Morris

unread,
Sep 22, 2016, 12:16:30 AM9/22/16
to openr...@googlegroups.com
Hi Olivia. Too late at night to give a super well thought out answer, but in the hopes of unblocking you, at least for the late evening until our brilliant folks in other timezones come online, how about reframing the problem?

What about splitting on your regex and then capitalizing the first word of the second piece? Processing regex capture groups could probably achieve the same effect, but I don't think .replace() would be the way down that path.

Anyone have other ideas?

Tom

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Owen Stephens

unread,
Sep 22, 2016, 5:38:48 AM9/22/16
to OpenRefine
I think Tom is right that you probably need a different approach - there are multiple ways of going about this but my first attempt was:

if(startsWith(value,"Mc"),value.substring(0,2)+value[2].toUppercase()+value.substring(3,value.length()),value)

which I think will do the job - basically this tests that the string starts with 'Mc' then takes the first two letters (Mc), adds on the third letter converted into uppercase, then adds the remainder of the name back to the end. Unlike the approach you were taking it doesn't check for a lowercase letter first - but I think toUppercase will only affect lowercase letters so this shouldn't make a difference. However you could add in a check if you needed

Owen
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.

olivia solis

unread,
Sep 22, 2016, 10:36:14 AM9/22/16
to OpenRefine
Thank you both, Tom and Owen! I will try both approaches, attacking case by case. My examples were not the best, so apologies for that. Somethings "Mc" appears at the beginning of the value (the majority of cases), and sometimes not (e.g."Andrew Mcdonald collection"). For the cases where it does, Owen's solution will work great. 

There are also cases where I need to make the first letter after a dash uppercase (e.g. "The Asian-american Society", "Vice-president's Office"). The letter is not necessarily at a consistent location in the string, so it's a little tricky.

You both have pointed me in the right direction though. I very much appreciate the feedback.

-Olivia

Owen Stephens

unread,
Sep 22, 2016, 11:10:49 AM9/22/16
to OpenRefine
So you can modify the approach I outlined to deal with 'Mc' names in the middle by using a 'split' to break down the sentence into words and then iterating through the words with a forEach:

forEach(value.split(" "),v,if(v.startsWith("Mc"),v.substring(0,2)+v[2].toUppercase()+v.substring(3,v.length()),v)).join(" ")

This won't deal with your other cases, but it would work for all the Mc names

I'd suggest with the other things you need to get a list of the cases you need to correct - often with OpenRefine it works well if you tackle one case at a time rather than trying to get a single expression to deal with all the different scenarios.

Owen

Joe Wicentowski

unread,
Sep 22, 2016, 11:19:37 AM9/22/16
to OpenRefine
Brilliant, thanks for sharing this technique, Owen!

> So you can modify the approach I outlined to deal with 'Mc' names in the middle by using a 'split' to break down the sentence into words and then iterating through the words with a forEach:
>
> forEach(value.split(" "),v,if(v.startsWith("Mc"),v.substring(0,2)+v[2].toUppercase()+v.substring(3,v.length()),v)).join(" ")

And I wholeheartedly agree with you here:

> I'd suggest with the other things you need to get a list of the cases you need to correct - often with OpenRefine it works well if you tackle one case at a time rather than trying to get a single expression to deal with all the different scenarios.

Joe

olivia solis

unread,
Sep 22, 2016, 6:48:16 PM9/22/16
to OpenRefine
Apologies for the late response, but thank you so much for this! I am relatively new to GREL and programming languages in general. Being able to apply new functions in concrete situations is very illuminating. Your advice helps me a great deal. :)

I will go use case by use case to think about solutions.

Best,
Olivia

Owen Stephens

unread,
Sep 23, 2016, 4:36:14 AM9/23/16
to OpenRefine
Just a final (maybe) couple of thoughts on this.

First thought: In terms of the case for a hyphen should be followed by an uppercase letter the following GREL would resolve this issue for words containing a single hyphen:

forEach(value.split(" "),v,if(v.split("-").length()==2,v.split("-")[0]+"-"+v.split("-")[1].toTitlecase(),v)).join(" ")

If you have cases where a single word has multiple hyphens in it you'd need to modify the expression to deal with this.

Second thought: once you have your cases sorted out, then using the 'Filter' function (especially with regular expression box ticked) is really helpful to make sure you are doing transforms only too relevant data - this reduces the chances of some issue with the transform having unexpected side effects.

Owen
Reply all
Reply to author
Forward
0 new messages