It’s a great use case. I wonder why you’re using nearest neighbour. I use key collision (which I tend to use more than NN) and get what I would call sensible results with most of the sub-methods
Jonathan Stoneman
--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/CANnh_k_ww6PLg7LJ3bLKOmVs%2BD8h-DyfvEWuJcyNGix%3D_%3D9uMA%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/00cd01d6cd4f%240fd89150%242f89b3f0%24%40gmail.com.
--
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/4d58d2ce-555d-463d-a071-64ceb95b911fn%40googlegroups.com.
The Levenshtein distance is one of the most famous string metrics for measuring the difference between two strings. It is the minimum number of operations (i.e. deletions, insertions or substitutions) performed on a single character to transform one of the strings into the other. The maximal distance of two distances is bounded by the length of the longer string. Implementation node: For performance issues and to ensure symmetry while featuring weighted operations are the longer strings always considered to be the 1. parameter.
Examples
levenshtein("", "knime") = 5
levenshtein("knime", "kime") = 1
levenshtein("knime", "kinme") = 2
levenshtein("city constance", "constance city") = 10
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/CAE9vqEH%3DANsydSCkrMBgaxs%3DjU4Y7ae9FHvLerA-C_oLLryfag%40mail.gmail.com.
Thanks, Thad, for this. I have never thought deeply enough about how Refine works. I’ve only joined the group a couple of weeks ago, and have learned more about OR in these two weeks than in the many a training session!
There’s no way, is there, of making a master list of correct names to measure distances against?
Jonathan Stoneman
From: openr...@googlegroups.com <openr...@googlegroups.com> On Behalf Of Thad Guidry
Sent: 09 December 2020 20:17
To: openr...@googlegroups.com
Subject: Re: [OpenRefine] question regarding levenstein clustering algorithm (i get strange results)
Tom,
It could in fact, be much much more, depending on the maximal distance of any longer string being compared for symmetry.
Other tools make the config more explicit ( like KNIME) for example. Where a user has control over weighting the Deletes, Inserts, Exchanges. I often set Deletes a bit higher with messy data, because typos are often more likely to miss characters rather than having more (folks often abbreviate, short-circuit their typing, etc.)
Stefano's implementation tried to assume a few sensible things however to account for more western languages as he detailed on our wiki page. Anyways... Here's what KNIME does and provides, if it helps:
The Levenshtein distance is one of the most famous string metrics for measuring the difference between two strings. It is the minimum number of operations (i.e. deletions, insertions or substitutions) performed on a single character to transform one of the strings into the other. The maximal distance of two distances is bounded by the length of the longer string. Implementation node: For performance issues and to ensure symmetry while featuring weighted operations are the longer strings always considered to be the 1. parameter.
Examples
levenshtein("", "knime") = 5
levenshtein("knime", "kime") = 1
levenshtein("knime", "kinme") = 2
levenshtein("city constance", "constance city") = 10
Configuration
· Deletion Weight: deletions are weighted according to the given value.
· Insertion Weight: insertions are weighted according to the given value.
· Exchange Weight: exchanges are weighted according to the given value.
· Normalize distance: the resulting distance is in the range [0,1].
· Uppercase input: transform all characters to uppercase before computing the distance. For performance issues it is preferable to uppercase the input in a precomputation step insteads of checking this option.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/CAChbWaMigpBrzzsN9aGHirvGVn8Z3tCPR9CWCOwRhj2SA5hc8g%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/00ac01d6ce69%246539e7f0%242fadb7d0%24%40gmail.com.
There’s no way, is there, of making a master list of correct names to measure distances against?
I wonder if there's a way that we can make this clearer to users.