Lucene fuzzy searches percentage

Enrique Alonso

unread,

Oct 7, 2016, 7:19:30 AM10/7/16

to RDF4J Users

Hi all,

I need your help to undestand the fuzzy search functionality of the LuceneSail.
According to Lucene query parser syntax documentation (https://lucene.apache.org/core/2_9_4/queryparsersyntax.html#Fuzzy%20Searches), Lucene supports fuzzy searches based on the Levensthein distance adding the "~" symbol to any word of the input of a fulltext query. They also say that an additional parameter can be used to specify the required similarity value (between 0 and 1).

My issue is if I specify a value for this parameter I'd need to retrieve the concrete value obtained for each result and I don't find any site explaining how it is computed (specially how the Levensthein distance is converted to a percentage).
I've tried to calculate it for myself with this formula:

similarity=(longerLength(x,y) - levenshteinDistance(x,y)) / longerLength(x,y)

And it doesn't match with the tests I made (e.g: I obtain 0.6 for a concrete result but it still appears if I use a similarity value filter of 0.7). All of my queries are ANDed, i.e. I require all the words of the input to appear in the results, and I apply the same similarity value to all of the words.
It seems that all the results I'm obtaining are getting the same similarity value because if I specify a value higher than a certain threshold they are all filtered out. And all the tests with a similarity value lower than the threshold are getting the same number of results.
Having this in mind I've also tried to calculate the average of the matches in pairs of words, e.g:

query = "aaa ccc" (effectively it is "+aaa~0.5 +ccc~0.5")
document= "aba ccd eee"

I calculated (similarity("aaa" , "aba") + similarity ("ccc" , "ccd")) / 2
And the result doesn't match either with the behaviour of the Lucene query engine.

Thank you very much for your attention, it would be enough if you link me a website where it is explained how does the Lucene query engine works for fuzzy searches.
Regards,
Enrique.

Jeen Broekstra

unread,

Oct 9, 2016, 5:42:56 PM10/9/16

to rdf4j...@googlegroups.com

On 7 Oct. 2016, at 22:19, Enrique Alonso <ealo...@gmail.com> wrote:

Hi all,

I need your help to undestand the fuzzy search functionality of the LuceneSail.
According to Lucene query parser syntax documentation (https://lucene.apache.org/core/2_9_4/queryparsersyntax.html#Fuzzy%20Searches), Lucene supports fuzzy searches based on the Levensthein distance adding the "~" symbol to any word of the input of a fulltext query. They also say that an additional parameter can be used to specify the required similarity value (between 0 and 1).

That wiki is out of date (it documents Lucene version 2.9.4, which is very old). In newer versions of Lucene, you simply specify the actual required edit distance, as an integer between 0 and 2. See for example the Lucene 5 documentation:

https://lucene.apache.org/core/5_0_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Fuzzy_Searches

I quote:

"An additional (optional) parameter can specify the maximum number of edits allowed. The value is between 0 and 2, For example:

roam~1

The default that is used if the parameter is not given is 2 edit distances.

Previously, a floating point value was allowed here. This syntax is considered deprecated and will be removed in Lucene 5.0"

HTH,

Jeen

Enrique Alonso

unread,

Oct 10, 2016, 8:21:55 AM10/10/16

to RDF4J Users

Thank you very much for the information.
I still find an odd behaviour considering that explanation on how the similarity value works in fuzzy searches.
In a test I've made I obtained 936 results setting the number of edits to 1, and 966 if I set it to an integer of 2 or higher (although >2 will still be 2 right?). Those results are what I expect, however I've tried with values lower than 1 and the results are a bit confusing (I know it doesn't make sense to use those values, but are the ones I've been trying before because I thought they were the only allowed values and I obtained different amount of results in different ranges between 0 and 1):
If I use a value between 0 and 0.715 I obtained 966 results.
If I use a value between 0.715 and 0.85 I obtained 936 results
If I use a value between 0.85 and 1 I obtained 0 results.

Also, if I use a decimal value higher than 1 I get a Parse.Exception expressing that fractional edit dinstances are not allowed.
It is not a very important issue because I could force the users to only set integer values (0, 1 or 2), but I would like to understand completely the behaviour of Lucene because I'd like to add to each result its similarity value (not the score, but a similarity percentage based on the edit distance).

I'm using Lucene 5.1.0 libraries, although I'm still using sesame 2.8.11 (still have to migrate my tool to RDF4j), I don't know if that could cause to still process the floating point value in the fuzzy searches.

Thanks in advance and best regards,
Enrique.

Reply all

Reply to author

Forward