Regex for matching multiple upper case

Dave Brown

unread,

Jul 30, 2022, 7:58:16 AM7/30/22

to OpenRefine

I'm struggling with Regex (which I don't really know)

I'm trying to extract the words in upper case from a document that contains entries like

"ADA VILLAS, Stroud Green (1855) Under Crouch Hill in the 1866 directory; in the 1871 Census. In the 1874 directory under Crouch Hill and Birkbeck Road, but by 1877 under Birkbeck Road. By 1882 nos.157-171 ELTHORNE ROAD."

What I want to do is to extract the start string

"ADA VILLAS"

and then the second and any additional strings in Upper case like

"ELTHORNE ROAD"

Each upper case phrase can have two to four words - and sadly doesn't contain any fixed terminator - the first comma in this example is not always there - it can be a space, a bracket, comma or full stop.

By trial and error worked out that

\b[A-Z]{2,}\b

matches any uppercase word longer than 2 characters, so I can split the words into an array, and then use the Regex to extract any uppercase words into an array - this gives in this example an array [ADA, VILLAS, ELTHORNE, ROAD]

But I really want to extract the word combinations if this is possible (so ADA VILLAS, ELTHORNE ROAD).

Many thanks if anyone has any ideas.

Regards

David Brown

Owen Stephens

unread,

Jul 30, 2022, 5:25:47 PM7/30/22

to OpenRefine

I haven't tested extensively but I think you could try:

value.find(/\b\p{Lu}+(?:\W+\p{Lu}+)\b/)

This should give an array of all the uppercase phrases (using \p{Lu} is slightly more inclusive that [A-Z] as it includes things like accented uppercase letters)

Owen

Dave Brown

unread,

Jul 31, 2022, 3:25:58 AM7/31/22

to OpenRefine

Thanks Owen - that worked well - only fails if the first word has an apostrophe in it (like ABBOT'S CLOSE = which returns ABBOT'S but no close). I can facet these cases out and deal with them differently.

You have saved a lot of my time,

Many thanks

Daid Brown

Owen Stephens

unread,

Aug 1, 2022, 5:41:22 PM8/1/22

to OpenRefine

No problem!

If you want to see if you can catch names with apostrophe's in addition, you could try adjusting the regular expression to something like:

value.find(/\b[\p{Lu}']+(?:\W+[\p{Lu}']+)\b/)

Which adds in the apostrophe as a valid option in the words you are finding by using [\p{Lu}'] instead of just \p{Lu} as the repeated value - i.e. repeats of any uppercase character or an apostrophe.

You might find some further edge cases which you could also extend for by adding some more characters in the square brackets e.g. perhaps some of the names contain hyphens and you could use [\p{Lu}'-]