Hi Guys,
It's certainly a basic question, but fiddling around with it for a while now didn't help, so maybe you could give some advice: I have to clean a rather messy set of data containing a column with address information in various different formats. Based on this column I want to create new columns just stating country and city (without street, zip code, etc).
Some examples to illustrate my problem. These are in the original address-column:
P.O. Box 123 Accra, Ghana
Blabla Kolgiet X, 1, 1234, 12345, Copenhagen, Denmark
"Residencia General", 123 street Number 123, Apartment 1234, Bogotá, Colombia
123ABC, 1234 Carmen Philippines
And the result would be a new city column and a new country column for statistical purposes.
Getting the country is rather easy, using rpartition. However for some reason I cannot figure out how to get the city. I assume it should be easy, using an expression that says: Take the second last word in each string (since there isn't always a seperator before or after the city name). This would get me most of the cities (city names with more than one word wouldn't be captured but that would be fine, I guess. If there is a better solution, I'm glad to hear it :)).
But how do I achieve this? Split (or smartSplit) doesn't seem to work with a "second last"-bit (or I didn't figure it out) and partition or rpartition doesn't seem to work in every instance because the seperators are changing (most of the time they would spit out something like "P.O. Box 123 Accra" which gets rid of the country, but not the rest).
So if there is a way I would be glad for any help I can get.
Thank you!