How to extract only the second last word of a string

dhkl...@gmail.com

unread,

Jan 10, 2017, 10:04:17 AM1/10/17

to OpenRefine

Hi Guys,

It's certainly a basic question, but fiddling around with it for a while now didn't help, so maybe you could give some advice: I have to clean a rather messy set of data containing a column with address information in various different formats. Based on this column I want to create new columns just stating country and city (without street, zip code, etc).

Some examples to illustrate my problem. These are in the original address-column:

P.O. Box 123 Accra, Ghana

Blabla Kolgiet X, 1, 1234, 12345, Copenhagen, Denmark

"Residencia General", 123 street Number 123, Apartment 1234, Bogotá, Colombia

123ABC, 1234 Carmen Philippines

And the result would be a new city column and a new country column for statistical purposes.

Getting the country is rather easy, using rpartition. However for some reason I cannot figure out how to get the city. I assume it should be easy, using an expression that says: Take the second last word in each string (since there isn't always a seperator before or after the city name). This would get me most of the cities (city names with more than one word wouldn't be captured but that would be fine, I guess. If there is a better solution, I'm glad to hear it :)).

But how do I achieve this? Split (or smartSplit) doesn't seem to work with a "second last"-bit (or I didn't figure it out) and partition or rpartition doesn't seem to work in every instance because the seperators are changing (most of the time they would spit out something like "P.O. Box 123 Accra" which gets rid of the country, but not the rest).

So if there is a way I would be glad for any help I can get.

Thank you!

Owen Stephens

unread,

Jan 10, 2017, 10:16:41 AM1/10/17

to OpenRefine

You could try using Split with a regular expression - a regular expression is a way of matching characters in a string with some very powerful features. There are a few options to the approach but a quick test suggests the following is good for the data you give as examples:

value.split(/\W+/)

You should find the output is an array of words from the address so you could extract the second to last one using

value.split(/\W+/)[-2]

To explain this a bit:

in the 'split' function the / / instead of the usual " " indicate you are using a regular expression rather than matching specific character(s)

the \W+ is a regular expression that matches any sequence of one or more "non-word" characters - so that would match a comma, or a space, or a comma followed by a space, etc.

The output of the 'split' is an array - so you can select the last but one item in the array using [-2]

Hope this helps

Owen

Tom Morris

unread,

Jan 10, 2017, 10:59:19 PM1/10/17

to openr...@googlegroups.com

Owen has the right approach. A couple of small REFINEments (so to speak):

- "last but one" is the British way of saying "second to last"

- if you're not guaranteed to have whitespace separating your pieces (e.g. there's a comma, but not space), you may need to expand the set of characters that you split on

- I always recommend using facets to browse the data in the way in which you're going to transform it to see what surprises are buried in the data. In this case, set up a custom text facet using your proposed expression (e.g. value.split(/\W+/)[-2]), sort by count, and browse the low frequency items to see what type of cruft is in there.

Tom

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward