Hi all,
I need your help to undestand the fuzzy search functionality of the LuceneSail.
According to Lucene query parser syntax documentation (
https://lucene.apache.org/core/2_9_4/queryparsersyntax.html#Fuzzy%20Searches), Lucene supports fuzzy searches based on the Levensthein distance adding the "~" symbol to any word of the input of a fulltext query. They also say that an additional parameter can be used to specify
the required similarity value (between 0 and 1).
My issue is if I specify a value for this parameter I'd need to retrieve the concrete value obtained for each result and I don't find any site explaining how it is computed (specially how the Levensthein distance is converted to a percentage).
I've tried to calculate it for myself with this formula:
similarity=(longerLength(x,y) - levenshteinDistance(x,y)) / longerLength(x,y)
And it doesn't match with the tests I made (e.g: I obtain 0.6 for a concrete result but it still appears if I use a similarity value filter of 0.7). All of my queries are ANDed, i.e. I require all the words of the input to appear in the results, and I apply the same similarity value to all of the words.
It seems that all the results I'm obtaining are getting the same similarity value because if I specify a value higher than a certain threshold they are all filtered out. And all the tests with a similarity value lower than the threshold are getting the same number of results.
Having this in mind I've also tried to calculate the average of the matches in pairs of words, e.g:
query = "aaa ccc" (effectively it is "+aaa~0.5 +ccc~0.5")
document= "aba ccd eee"
I calculated (similarity("aaa" , "aba") + similarity ("ccc" , "ccd")) / 2
And the result doesn't match either with the behaviour of the Lucene query engine.
Thank you very much for your attention, it would be enough if you link me a website where it is explained how does the Lucene query engine works for fuzzy searches.
Regards,
Enrique.