Jaro-winkler Score

3 views
Skip to first unread message

Rynn Bronaugh

unread,
Jul 25, 2024, 12:13:03 AM7/25/24
to rivisosa

I was not able to understand what the difference is between the two. It seems Levenshtein gives the number of edits between two strings, and Jaro-Winkler provides a normalized score between 0.0 to 1.0.

Levenshtein counts the number of edits (insertions, deletions, or substitutions) needed to convert one string to the other. Damerau-Levenshtein is a modified version that also considers transpositions as single edits. Although the output is the integer number of edits, this can be normalized to give a similarity value by the formula

jaro-winkler score


Download Ziphttps://bytlly.com/2zMBgW



The Jaro algorithm is a measure of characters in common, being no more than half the length of the longer string in distance, with consideration for transpositions. Winkler modified this algorithm to support the idea that differences near the start of the string are more significant than differences near the end of the string. Jaro and Jaro-Winkler are suited for comparing smaller strings like words and names.

Deciding which to use is not just a matter of performance. It's important to pick a method that is suited to the nature of the strings you are comparing. In general though, both of the algorithms you mentioned can be expensive, because each string must be compared to every other string, and with millions of strings in your data set, that is a tremendous number of comparisons. That is much more expensive than something like computing a phonetic encoding for each string, and then simply grouping strings sharing identical encodings.

with the slowest taking 2 to 3 times as long as the fastest. Of course these times are dependent on the lengths of the strings and the implementations, and there are ways to optimize these algorithms that may not have been used.

Sanctions List Search will first look for potential matches based on the first letter of input search terms and by checking for matches at least 50% or more similar based on edit distance (edit distance is the minimum number of operations required to transform the input string of characters into the string that it is being compared to on the list). Sanctions List Search then uses two matching logic algorithms, and two matching logic techniques to calculate the score. The two algorithms are Jaro-Winkler, a string difference algorithm, and Soundex, a phonetic algorithm. The first technique involves using the Jaro-Winkler algorithm to compare the entire name string entered against full name strings of potential match entries on OFAC's sanctions lists. The second technique involves splitting the name string entered into multiple name parts (for example, John Doe would be split into two name parts). Each name part is then compared to name parts on all of OFAC's sanctions lists using the Jaro-Winkler and Soundex algorithms. The search calculates a score for each name part entered, and a composite score for all name parts entered. Sanctions List Search uses both techniques each time the search is run and returns the higher of the two scores in the Score column.

The Jaro-Winkler distance algorithm is a measure of the similarity between two strings. It is a variant of the Jaro similarity algorithm, which compares the two strings character by character and takes into account the number of matching characters and the number of transpositions needed to transform one string into the other. The Jaro-Winkler distance algorithm adds a prefix bonus to the Jaro similarity score, which gives additional weight to matching characters that appear at the beginning of the strings being compared. This helps the algorithm to more accurately measure the similarity between strings that may have similar but not necessarily identical prefixes. The resulting Jaro-Winkler distance ranges from 0, indicating that the two strings are completely different, to 1, indicating that the two strings are identical.

At Tilores we use the Jaro-Winkler distance algorithm as one of the potential data record matching algorithms for entity resolution. These can be combined with other matching algorithms to allow fine-tuned data matching and deduplication.

The first major improvement to performance came from the batch_jaro_winkler implementation by Dominik Bousquet. The idea behind this implementation is to create a lookup table for one of the datasets that has tuples of names and pointers to the records with that name. This lookup table is keyed on the letters which are present in the names. For example:

It would be extremely verbose for me to explain all of the small bitwise operations that are in use within the algorithm and would probably require an entire blog post of its own. If you are curious about the full implementation of the algorithm you can check out the code in our (relatively well-documented) repo on github.

In the end I was able to achieve a 10-15x speedup compared to the batch_jaro_winkler library, and a 40-50x speedup compared to common string comparison libraries. However there is one caveat: in some cases the algorithm over counts transpositions resulting in occasionally depressed scores from other implementations. This difference is relatively minor in the average error size is 0.002 or less when testing against names from the 1880 U.S. census. Our next steps with the algorithm will be to extend it to include computing Jaro Winkler scores on multiple names for the same record (first and last), which will be required before we integrate it into our larger linking pipeline.

In addition to returning results that are exact matches (when the match threshold slider bar is set to 100%), Sanctions List Search can also provide a broader set of results using fuzzy logic. This logic uses character and string matching as well as phonetic matching. Only the name field of Sanctions List Search invokes fuzzy logic when the tool is run. The other fields on the tool use character matching logic. For more information on what a true sanctions list match is, see FAQ 5. For more information on the slider bar, see FAQ 247.

The score field indicates the similarity between the name entered and resulting matches on one of OFAC's sanctions lists. It is calculated using two matching logic algorithms: one based upon phonetics, and a second based upon the similarity of the characters in the two strings. The slider bar defaults to a score of 100, which indicates an exact match. Lower scores indicate potential matches.

The minimum name score field limits the number of names returned by the search. A value of 100 will return only names that exactly match the characters entered into the name field. A value of 50 will return all names that are deemed to be 50% similar based upon the matching logic of the search tool. By lowering the match threshold the system will return a broader result set.

OFAC cannot make such a recommendation because each search has its own unique set of facts surrounding it. Users of Sanctions List Search must make their own match threshold determinations based upon their own internal risk assessments and established compliance practices.

Sanctions List Search is a free tool provided by OFAC to assist the public in complying with sanctions programs. It is intended to be used by individual users that are looking for potential matches on OFAC's sanctions lists. It should not be utilized by automated systems that are configured to continually run searches through the tool. For a copy of files that can be easily interpreted by automated systems and software programs, please see the list of XML, CSV, PIP, DEL, and FF files on the Specially Designated Nationals (SDN) and Consolidated Sanctions List pages.

Sanctions List Search will look for and return potential matches from the Specially Designated Nationals (SDN) and Consolidated Sanctions Lists. The user can look under the List column to see which list(s) a potential match is on. Please see the Consolidated List page for more detailed information on what is included in the Consolidated List.

OFAC's Sanctions List Search is updated frequently and always contains the latest versions of OFAC's sanctions lists. Like OFAC's other list-related publications, Sanctions List Search does not contain historical information. Names that have been removed from OFAC's Specially Designated Nationals (SDN) or Consolidated Lists are not included in Sanctions List Search. Likewise, targets that have been updated only appear with their most up to date entry information. For historical information about a target on one of OFAC's sanctions lists, please see our archive page.

Two strings with no matching characters at the start would keep the same score, but because ourstrings have letters in common at the beginning, this version of the metric has boosted our scorefrom 0.88 up to 0.92.

Data Scientist Researcher Team Leader

working at Ernst & Young and writing about Data Science and Visualization, on Machine Learning, Deep Learning and NLP. There are also some howto posts on tools and workflows.

4a15465005
Reply all
Reply to author
Forward
0 new messages