Help please - fuzzy matching

719 views
Skip to first unread message

Steve Kong

unread,
Mar 28, 2017, 9:58:22 AM3/28/17
to OpenRefine
Hi, I'm very new to OpenRefine.
I have some data where I want to find and extract a list of names from cells in rows where any of the names match. For example the data is like:
Row 1:'Jon Smith 46 years old is but a new brown car'
Row 2:'Jonny smith 46 yrs likes brown cars'
Row 3:'Jon works at a garage repairing cars'
Row 4:'My mate Smithy loves those brown BMWS'

What expression would I need to search each row and then extract any of the following fuzzy matches :
Jon smith
Jonny smith
Jon
Smithy
And then return a new column with just those name, i.e. Actually display the names in a column next to column where name trxt was taken from?

The idea is that I extract data that might well be the same person in different phrases.

Thank you in anticipation, I'm very new to this.

John Little

unread,
Mar 28, 2017, 10:56:03 AM3/28/17
to openr...@googlegroups.com
Hi Steve:

Below is one possible solution, assuming your data are in a single column and the column name is "Text"

  1. Text > Edit column > Add column based on this column 
    • column name = jon   
    • expression = value.match(/.*(joh?n+y?).*/i)[0]
  2. Text > Edit column > Add column based on this column  
    • column name = smith
    • expression = value.match(/.*(smith[y]?).*/i)[0]
This technique uses regex.  (You can find a lot of online help for regex if you want to use this method.)  
  • The regex is between the two '/'  :  .*(joh?n+y?).*
  • The initial and trailing '.*' inside the slashes matches anything.  
  • The inside parenthesis is a matching group.  
  • The '?' is a multiplier matching 0 or 1 instances
  • the '+' is a multiplier matching 1 or more instances.  
  • the 'i' at the end of the trailing slash tells the regex to match as case-insensitve
  • The trailing [0] is the chronological order of matches.  
    • There is only one match and OpenRefine starts counting at zero -- so the first match is identified by 0.

One weakness here is the match only works for the name variants you identify in advance.  If you had thousands of rows with thousands of names this wouldn't be a convenient solution.  But this list is full of some very creative and generous solutionists so let's see what they say.

--John






--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Steve Kong

unread,
Mar 28, 2017, 11:37:48 AM3/28/17
to OpenRefine
Jon, this is awesome thank you so much.

If it's not too much is it possible to have an OR so that it's possible to add Jon OR Jonny OR Smith and then return any of those into the same column rather than separate ones?

I could then type in a big list and the results would I get would be from every row but in the same column.

Very best and thank you again..the power of forums!
Message has been deleted

Ettore Rizza

unread,
Mar 28, 2017, 12:19:32 PM3/28/17
to OpenRefine
It's probably simpler with python/jython :

import re


pattern 
= re.compile(r"\b(joh?n|joh?nny?|smithy?)\b", re.I)


return " ".join(pattern.findall(value))

Result :


The syntax seems a bit dauntig, but the important part is just the regular expression between parenthesis : (joh?n|joh?nny?|smithy?) 

You can add other alternatives by separating them with a vertical bar.

Steve Kong

unread,
Mar 29, 2017, 2:36:20 AM3/29/17
to OpenRefine
Ettore, thank you so much, so helpful.

Just out of interest, is there a way to do the same here but rather than write each name with an OR (|) can it look up a list in a spreadsheet?

Many thanks again.

Ettore Rizza

unread,
Mar 29, 2017, 3:58:30 AM3/29/17
to OpenRefine
Hi Steve, 

if you have a list of values in a spreadsheet, you can turn it into a regular expression with a simple text editor. Here is an example with Sublime Text (not free but its evaluation version is unlimited in time):

Copy your spreadsheet column, paste it into the text editor, and then replace each new line \n with a vertical bar |

Screencast : 



You will then be able to construct a code like this one, which will match in this case each country name in an open refine column : 

import re

pattern
= re.compile(r"\b(Afghanistan|Åland Islands|Albania|Algeria|American Samoa|Andorra|Angola|Anguilla|Antarctica|Antigua and Barbuda|Argentina|Armenia|Aruba|Australia|Austria|Azerbaijan|Bahamas|Bahrain|Bangladesh|Barbados|Belarus|Belgium|Belize|Benin|Bermuda|Bhutan|Bolivia (Plurinational State of)|Bonaire, Sint Eustatius and Saba|Bosnia and Herzegovina|Botswana|Bouvet Island|Brazil|British Indian Ocean Territory|Brunei Darussalam|Bulgaria|Burkina Faso|Burundi|Cambodia|Cameroon|Canada|Cabo Verde|Cayman Islands|Central African Republic|Chad|Chile|China|Christmas Island|Cocos (Keeling) Islands|Colombia|Comoros|Congo|Congo (Democratic Republic of the)|Cook Islands|Costa Rica|Côte d'Ivoire|Croatia|Cuba|Curaçao|Cyprus|Czech Republic|Denmark|Djibouti|Dominica|Dominican Republic|Ecuador|Egypt|El Salvador|Equatorial Guinea|Eritrea|Estonia|Ethiopia|Falkland Islands (Malvinas)|Faroe Islands|Fiji|Finland|France|French Guiana|French Polynesia|French Southern Territories|Gabon|Gambia|Georgia|Germany|Ghana|Gibraltar|Greece|Greenland|Grenada|Guadeloupe|Guam|Guatemala|Guernsey|Guinea|Guinea-Bissau|Guyana|Haiti|Heard Island and McDonald Islands|Holy See|Honduras|Hong Kong|Hungary|Iceland|India|Indonesia|Iran (Islamic Republic of)|Iraq|Ireland|Isle of Man|Israel|Italy|Jamaica|Japan|Jersey|Jordan|Kazakhstan|Kenya|Kiribati|Korea (Democratic People's Republic of)|Korea (Republic of)|Kuwait|Kyrgyzstan|Lao People's Democratic Republic|Latvia|Lebanon|Lesotho|Liberia|Libya|Liechtenstein|Lithuania|Luxembourg|Macao|Macedonia (the former Yugoslav Republic of)|Madagascar|Malawi|Malaysia|Maldives|Mali|Malta|Marshall Islands|Martinique|Mauritania|Mauritius|Mayotte|Mexico|Micronesia (Federated States of)|Moldova (Republic of)|Monaco|Mongolia|Montenegro|Montserrat|Morocco|Mozambique|Myanmar|Namibia|Nauru|Nepal|Netherlands|New Caledonia|New Zealand|Nicaragua|Niger|Nigeria|Niue|Norfolk Island|Northern Mariana Islands|Norway|Oman|Pakistan|Palau|Palestine, State of|Panama|Papua New Guinea|Paraguay|Peru|Philippines|Pitcairn|Poland|Portugal|Puerto Rico|Qatar|Réunion|Romania|Russian Federation|Rwanda|Saint Barthélemy|Saint Helena, Ascension and Tristan da Cunha|Saint Kitts and Nevis|Saint Lucia|Saint Martin (French part)|Saint Pierre and Miquelon|Saint Vincent and the Grenadines|Samoa|San Marino|Sao Tome and Principe|Saudi Arabia|Senegal|Serbia|Seychelles|Sierra Leone|Singapore|Sint Maarten (Dutch part)|Slovakia|Slovenia|Solomon Islands|Somalia|South Africa|South Georgia and the South Sandwich Islands|South Sudan|Spain|Sri Lanka|Sudan|Suriname|Svalbard and Jan Mayen|Swaziland|Sweden|Switzerland|Syrian Arab Republic|Taiwan, Province of China|Tajikistan|Tanzania, United Republic of|Thailand|Timor-Leste|Togo|Tokelau|Tonga|Trinidad and Tobago|Tunisia|Turkey|Turkmenistan|Turks and Caicos Islands|Tuvalu|Uganda|Ukraine|United Arab Emirates|United Kingdom of Great Britain and Northern Ireland|United States of America|United States Minor Outlying Islands|Uruguay|Uzbekistan|Vanuatu|Venezuela (Bolivarian Republic of)|Viet Nam|Virgin Islands (British)|Virgin Islands (U.S.)|Wallis and Futuna|Western Sahara|Yemen|Zambia|Zimbabwe)\b", re.I)

return " ".join(pattern.findall(value))


By the way, this regular expression is case insensitive, so no matter if the country name is written afghanistan or AFGHANISTAN.
Reply all
Reply to author
Forward
0 new messages