I ran some performance benchmarking this weekend on the current Levenshtein algorithm we use from TeamCohen on OpenRefine 3.4.1 release.
I think we could do better using the already provided Apache algorithm (which should be able to better take advantage directly of some CPU SSE4.2 instructions.
I even tried myself to nestle the org.apache.commons.text.similarity.LevenshteinDistance into the DistanceFactory but had problems and not entirely confident on Super's if it was needed or not. (my Java Fu sucks and will never get better).
So hopefully someone else can quickly nestle the Apache library into place or advise me how it could possibly be done?
(vectorizable meaning able to
utilize CPU intrinsic functions to increase performance within limits)
to see if intrinsic methods are being utilized or not and where in compiled code, you can add: -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining
and then watch the console output when performing Clustering operations as well as VisualVM CPU profiling
I would be more than happy to help anyone who can code, much better than my old eyes, what we need and to benchmark it, which is not for the faint of heart.