Comparing similar string in columns

1,848 views
Skip to first unread message

Franco Giovannini

unread,
Jul 13, 2017, 6:37:46 AM7/13/17
to OpenRefine
Hi, I have a table where two columns can contain similar or different values, and I would like to identify rows where the two values are similar, ideally ranking similarity. 

example

FIRST EXAMPLE      first example            --> match
SECOND EXAMPLE     A Second Example         --> very similar
THIRD EXAMPLE      A third bad example      --> quite similar
FOURTH EXAMPLE     WOOOOOOOOOO              --> No match

I've tried applying phonetic transformations (metaphone, metaphone3) but in my case it really works when there is a match. 
Can you suggest an approach to that problem with Openrefine? Is there a unction able to calculate - for example - a distance (like Levensthein)?

Many thanks
Franco


John Little

unread,
Jul 13, 2017, 2:03:54 PM7/13/17
to OpenRefine
Franco:

I don't pretend to know much about clustering but [and so ?] your question is a bit unclear to me.  Hopefully my suggestions below are new information for you to consider....

* You say you've tried metaphone, metaphon3 ... [but can one calculate] a distance "like" Levenshtein"?  One possible interpretation of your statement suggests that you may not be aware there is a Levenshtein distance operation inside OpenRefine. In case you have not seen it, you can access that function from within the Cluster operation by changing the Method from "key collision" to "nearest neighbor".  At that point you will see a pre-set default Distance Function set to "levenshtein".

  
* In the archive of this list some have discussed the clustering operation at greater length.  You may want to look further into that discussion history.   

* One outcome of the above discussion was an in-depth explanation by Owen Stephens found in the wiki: https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth 

Hope that helps.  

 



--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Franco Giovannini

unread,
Jul 14, 2017, 6:20:55 AM7/14/17
to OpenRefine, John....@duke.edu
Thank you John. In fact yes, I'm aware of the availability of the Levensthein method in Openrefine. The fact is that, while for metaphone, cologne-phonetic Openrefine has a function, e.g. like phonetic(value,'metaphone'), that I can use to associate a value to a string, what I would need is a function retruring a distance between two values, something like "levensthein(stringA, stringB), because I need to compare values in two different columns. Hope this clarifies my problem, thank again for your answer!
Ciao,
Franco

Ettore Rizza

unread,
Jul 14, 2017, 9:14:43 AM7/14/17
to OpenRefine
Hello Franco, 

you can do this using Jython. Here is an example script (Don't forget to replace "Column 1" and "Column 2" with your column names )

import difflib

column1
= cells['Column 1']['value'].strip().lower()
column2
= cells['Column 2']['value'].strip().lower()

score
= difflib.SequenceMatcher(None, column1, column2).ratio()

if score == 1:
   
return "match"
elif score > 0.9:
   
return "very similar"
elif score > 0.8:
   
return "quite similar"
else:
   
return "no match"

Screenshot:


Hope this helps.

Franco Giovannini

unread,
Jul 14, 2017, 9:21:13 AM7/14/17
to OpenRefine
That's *exactly* what I was looking for! Thanks a lot Ettore!!
Ciao,
Franco

Ettore Rizza

unread,
Jul 14, 2017, 9:33:01 AM7/14/17
to OpenRefine
Forgot to mention, but the values I've provided are obviously indicative. You can test other combinations :

elif score > 0.9:
   
return "very similar"
elif score > 0.8:
   
return "quite similar"
elif score > 0.7
   
return "a bit similar"

Or something like this :

elif  0.8 <= score <= 0.9 :
   
return "very similar"
elif 0.6 <= score <= 0.7 :
   
return "quite similar"

Thad Guidry

unread,
Jul 14, 2017, 11:09:54 PM7/14/17
to OpenRefine
Just for completeness sake and to clarify that OpenRefine has something for other usecases besides Franco's direct need...

When your asking for a similarity score, then you have to apply a Nearest Neighbor method (also known as kNN) as we describe on our wiki page here:
and algorithms like Levenshtein, PPM, etc. support kNN.

If your trying to just do A-B comparisons against each row, then we have 2 ways to easily and quickly perform that:

fingerprint()
ngramFingerprint()

For example:

ngramFingerprint(cells.column2.value) == ngramFingerprint(value)

-Thad

Giovannini Franco

unread,
Jul 15, 2017, 3:07:24 AM7/15/17
to thadg...@gmail.com, openr...@googlegroups.com
Thank you. Great software and great community :)

Inviato dal mio smartphone LG

------ Messaggio originale------
Da: Thad Guidry
Data: sab, 15 lug 2017 05:10
A: OpenRefine;
Oggetto:Re: [OpenRefine] Re: Comparing similar string in columns
+ThadGuidry<https://www.google.com/+ThadGuidry>

--
You received this message because you are subscribed to a topic in the Google Groups "OpenRefine" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/openrefine/qCnQTOfdHAA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to openrefine+...@googlegroups.com<mailto:openrefine+...@googlegroups.com>.
Reply all
Reply to author
Forward
0 new messages