question regarding levenstein clustering algorithm (i get strange results)

231 views
Skip to first unread message

Sergio Letuche

unread,
Dec 8, 2020, 5:37:06 AM12/8/20
to openr...@googlegroups.com
greetings,

I am using the nearest neighbor method, levenstein, with radius 3.0 and block chars 6.

What i do not understand is why for example Unipress and Seepresse are in the same group, since they seem to differ in four edits

remove Uni, replace with See, and add an e in the and of the string.

What am I missing?

Unipress
Seepresse
UNIPRESS.
Edipresse
Légipresse
Mediapresse
Mediapresse.
Socpresse
UNIPRESS
UNIPRESS,
UniPress

jonathan...@gmail.com

unread,
Dec 8, 2020, 5:44:15 AM12/8/20
to openr...@googlegroups.com

It’s a great use case. I wonder why you’re using nearest neighbour. I use key collision (which I tend to use more than NN) and get what I would call sensible results with most of the sub-methods

 

Jonathan Stoneman

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/CANnh_k_ww6PLg7LJ3bLKOmVs%2BD8h-DyfvEWuJcyNGix%3D_%3D9uMA%40mail.gmail.com.

Petros Liveris

unread,
Dec 8, 2020, 5:52:24 AM12/8/20
to OpenRefine, jonathan...@gmail.com
Thank you for your kind reply.

Could you please elaborate, how would you solve this problem?
I need to use key collision, but allow for each value to differ up to three edits.
Jonathan
and Jonatan and Janathann should be added in the same cluster.

How could i do this with key collision and its submethods?

Tom Morris

unread,
Dec 8, 2020, 1:05:20 PM12/8/20
to openr...@googlegroups.com
Sounds like a bug. Please file a bug report in the issue tracker.

Thanks,
Tom

--

Owen Stephens

unread,
Dec 9, 2020, 7:45:08 AM12/9/20
to OpenRefine
I don't think this is a bug

The way the clusters are created is not "group all the values that are a Levenshtein distance X from each other" but rather "group the values that are a Levenshtein distance X from a value". That is to say, the values in the cluster will all be within that distance of "X" (in your case 3) from a common value in that group. In this case it is the value "edipresse"

Unipress -> Edipresse is three edits (replace "U" with "e" and "n" with "d", add a trailing "e")
Seepresse -> Edipresse is three edits (replace "S", "e", "e" with "E", "d", "i" respectively)

So these group in this cluster not because they are within three edits of each other, but because they are within three edits of another common value in the cluster

You should also see clusters without Edipresse in that group just Unipress in it's various forms etc.

Owen

Tom Morris

unread,
Dec 9, 2020, 1:38:35 PM12/9/20
to openr...@googlegroups.com
Good point Owen! I totally missed that. So the edit distance between two candidates could be as much as double your specified limit.

I wonder if there's a way that we can make this clearer to users.

Tom

Thad Guidry

unread,
Dec 9, 2020, 3:16:58 PM12/9/20
to openr...@googlegroups.com
Tom,

It could in fact, be much much more, depending on the maximal distance of any longer string being compared for symmetry.

Other tools make the config more explicit ( like KNIME) for example. Where a user has control over weighting the Deletes, Inserts, Exchanges.  I often set Deletes a bit higher with messy data, because typos are often more likely to miss characters rather than having more (folks often abbreviate, short-circuit their typing, etc.)

Stefano's implementation tried to assume a few sensible things however to account for more western languages as he detailed on our wiki page.  Anyways... Here's what KNIME does and provides, if it helps:

Levenshtein Distance

The Levenshtein distance is one of the most famous string metrics for measuring the difference between two strings. It is the minimum number of operations (i.e. deletions, insertions or substitutions) performed on a single character to transform one of the strings into the other. The maximal distance of two distances is bounded by the length of the longer string. Implementation node: For performance issues and to ensure symmetry while featuring weighted operations are the longer strings always considered to be the 1. parameter.

Examples
levenshtein("", "knime") = 5
levenshtein("knime", "kime") = 1
levenshtein("knime", "kinme") = 2
levenshtein("city constance", "constance city") = 10

Distance Options

Configuration
  • Deletion Weight: deletions are weighted according to the given value.
  • Insertion Weight: insertions are weighted according to the given value.
  • Exchange Weight: exchanges are weighted according to the given value.
  • Normalize distance: the resulting distance is in the range [0,1].
  • Uppercase input: transform all characters to uppercase before computing the distance. For performance issues it is preferable to uppercase the input in a precomputation step insteads of checking this option.


jonathan...@gmail.com

unread,
Dec 9, 2020, 3:25:21 PM12/9/20
to openr...@googlegroups.com

Thanks, Thad, for this. I have never thought deeply enough about how Refine works. I’ve only joined the group a couple of weeks ago, and have learned more about OR in these two weeks than in the many a training session!

 

There’s no way, is there, of making a master list of correct names to measure distances against?

 

Jonathan Stoneman

 

From: openr...@googlegroups.com <openr...@googlegroups.com> On Behalf Of Thad Guidry
Sent: 09 December 2020 20:17
To: openr...@googlegroups.com
Subject: Re: [OpenRefine] question regarding levenstein clustering algorithm (i get strange results)

 

Tom,

 

It could in fact, be much much more, depending on the maximal distance of any longer string being compared for symmetry.

 

Other tools make the config more explicit ( like KNIME) for example. Where a user has control over weighting the Deletes, Inserts, Exchanges.  I often set Deletes a bit higher with messy data, because typos are often more likely to miss characters rather than having more (folks often abbreviate, short-circuit their typing, etc.)

Stefano's implementation tried to assume a few sensible things however to account for more western languages as he detailed on our wiki page.  Anyways... Here's what KNIME does and provides, if it helps:

Levenshtein Distance

The Levenshtein distance is one of the most famous string metrics for measuring the difference between two strings. It is the minimum number of operations (i.e. deletions, insertions or substitutions) performed on a single character to transform one of the strings into the other. The maximal distance of two distances is bounded by the length of the longer string. Implementation node: For performance issues and to ensure symmetry while featuring weighted operations are the longer strings always considered to be the 1. parameter.

Examples
levenshtein("", "knime") = 5
levenshtein("knime", "kime") = 1
levenshtein("knime", "kinme") = 2
levenshtein("city constance", "constance city") = 10

Distance Options

Configuration

·        Deletion Weight: deletions are weighted according to the given value.

·        Insertion Weight: insertions are weighted according to the given value.

·        Exchange Weight: exchanges are weighted according to the given value.

·        Normalize distance: the resulting distance is in the range [0,1].

·        Uppercase input: transform all characters to uppercase before computing the distance. For performance issues it is preferable to uppercase the input in a precomputation step insteads of checking this option.

image002.png

Thad Guidry

unread,
Dec 9, 2020, 5:42:56 PM12/9/20
to openr...@googlegroups.com
You are welcome Jonathan.

No, we don't provide Levenshtein as a function directly in GREL, although a few voices have expressed this, but I don't think we'll ever add it directly.
Better to have GREL-like flavor with a richer, yet easier syntax for our users.  We might deliver that for our users if we get enough requests and +1's on an issue if someone makes it.

Outside of GREL?
Yes, you can with other Expression editor languages like Python.

Install fuzzywuzzy from pypi into your system environment on your PC/Laptop using our example guide here:  https://github.com/OpenRefine/OpenRefine/wiki/Jython
Where you will need to Extend Jython with the pypi module called "fuzzy wuzzy".  Detailed here: https://pypi.org/project/fuzzywuzzy/
fuzzywuzzy has a very fast C lang implementation of Levenshtein that you can optionally install.  Detailed here: https://pypi.org/project/fuzzywuzzy/

Add a new column based on...
Then in Expression editor in OpenRefine with Jython/Python as your expression language, it could look something like this afterwards (quick pseudocode you can change as needed):

from fuzzywuzzy import fuzz
from fuzzywuzzy import process
 
lookup = cells["MyColumn1"]["value"]

fuzz.ratio(lookup, value)

Hmm, I should probably write up a page on our Wiki to explain all that and give a step-by-step and then add it to that Jython example guide page on our wiki.
Maybe I'll do that this weekend.  Maybe not. :-)



Thad Guidry

unread,
Dec 9, 2020, 5:44:51 PM12/9/20
to openr...@googlegroups.com
whoops...should have a return.

return fuzz.ratio(lookup, value)

Anyways, you get the idea.


Tom Morris

unread,
Dec 9, 2020, 6:09:10 PM12/9/20
to openr...@googlegroups.com
On Wed, Dec 9, 2020 at 3:25 PM <jonathan...@gmail.com> wrote:


There’s no way, is there, of making a master list of correct names to measure distances against?


One way to do this would be to reconcile against your master list using something like reconcile-csv  (although I have no idea what kind of state it's in). Edit distances are calculated as part of the reconciliation process.

If reconcile-csv is unmaintained and doesn't work with current versions of OpenRefine, perhaps someone can suggest an alternative.

Tom

Owen Stephens

unread,
Dec 10, 2020, 5:44:27 AM12/10/20
to OpenRefine
On Wednesday, December 9, 2020 at 6:38:35 PM UTC tfmo...@gmail.com wrote:

I wonder if there's a way that we can make this clearer to users.


I don't know how easy it would be to do, but I think highlighting the term(s?) that leads to the cluster would be really useful and might help make the way this works clearer. For example in this case highlighting "Edipresse" in the cluster. As this example shows it isn't always easy to spot! (it took me a few minutes to work out what the common term was and I knew what I was looking for)

Owen

Owen Stephens

unread,
Dec 10, 2020, 6:24:07 AM12/10/20
to OpenRefine
For this specific task, if these are company names (they look like Publishers to me?) then the OpenCorporates Reconciliation service might be the way to go https://api.opencorporates.com/documentation/Open-Refine-Reconciliation-API

For the more general case, I agree that reconcile-csv is a possible solution here. But although reconcile-csv does still work (at least to the extent I've tested it on my own setup), it hasn't been updated in 6 years, and I'd say it probably, sadly, falls into the "unmaintained" category. The other nice thing about using the "reconcile" approach is that once you have got reconciliation candidates you can access the Levenshtein distance between the original cell value and the best candidate which can also be helpful it getting the right value.

For those who work with RDF the RDF Extension supports reconciliation against a local RDF file - although I'm not sure if or to what extent it supports fuzzy matching

For those willing/able to invest more in setting up a reconciliation service the "Conciliator" code might provide a way https://github.com/codeforkjeff/conciliator - but that would require a much higher degree of expertise than reconcile-csv.
There are other frameworks / examples of Reconciliation services listed at https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-API#examples which might be helpful for anyone wanting to go down this route.

A couple of other approaches I can think of but haven't really investigated for practicality:
* Setup a search engine (e.g. ElasticSearch or Solr) which supports search requests over http. Index the authorised/correct values, then use the "add column by fetching URLs" to search each value against the index, setting fuzzy search params as necessary)
* Use a Entity matching service which supports http requests and user defined entities (e.g. Dandelion https://dandelion.eu - free and paid options available) 

Best wishes

Owen
Reply all
Reply to author
Forward
0 new messages