Similarity Join

31 views
Skip to first unread message

Rosario Di Carlo

unread,
Mar 14, 2018, 12:42:56 PM3/14/18
to OpenRefine
Hi guys,
I need to join two datasets if the similarity (es. Levenstein) between two cells exceeds a threshold.

Example: movie1_dataset and movie2_dataset


movie1_title movie2_title
------------ -------------
Avatar Star wars Jedi
Star wars Indiana Jones
Tomb raider
Avatar 2010


Output:


movie1_title movie2_title
------------
------------
Avatar Avatar 2010
Star wars Star Wars Jedi


How can I get this in Openrefine?

Ettore Rizza

unread,
Mar 14, 2018, 2:39:07 PM3/14/18
to OpenRefine
Hi Rosario,

There are several methods. One of the most efficient is using the reconcile-csv application. This application uses the agorithm of Dice, not Levensthein, but it produces a percentage of similarity that can be used as a treshold.

If you understand a bit of French, here is a little video tutorial. But the explanations on the website seem quite clear to me. Feel free to report what seems confusing. By searching for "reconcile csv" on this Google group, you should also find discussions about it.

Hope this helps,

Ettore

Rosario Di Carlo

unread,
Mar 14, 2018, 5:54:00 PM3/14/18
to OpenRefine
This is exactly what I was looking for, I tested it and work very well.
Thank you very much Ettore.
Reply all
Reply to author
Forward
0 new messages