Hi, I'm indexing brazilian portuguese text that contains accents. To remove them I used the asciifolding filter. My "test" index settings is as follows:
Hi Clint, i take advantage of this thread for a quite similar
question:
I've been performing some searches (ES 0.18.5) using accents as well
(in spanish) but, instead of using the 'asciifolding' filter at a
indexing time, I'd want to get similar results using fuzzy queries
(apart from getting also results for similar words).
I want to search docs using one or more words, based on a free text
field, called 'title', and I'd like to get the same results for both,
for instance, "bateria" and "batería" words. The query is:
AFAIK a word with one accented letter is at distance '1' from the same
word with no accent, is this correct? If so, the query should consider
all docs that contains, in this case "batería" and "batería", right?
Right now I'm getting different number of results and I'm not sure
what could be the reason
Thanks in advance
Frederic
On 16 ene, 16:05, Clinton Gormley <cl...@traveljury.com> wrote:
> I've been performing some searches (ES 0.18.5) using accents as well > (in spanish) but, instead of using the 'asciifolding' filter at a > indexing time, I'd want to get similar results using fuzzy queries > (apart from getting also results for similar words).
> I want to search docs using one or more words, based on a free text > field, called 'title', and I'd like to get the same results for both, > for instance, "bateria" and "batería" words. The query is:
> AFAIK a word with one accented letter is at distance '1' from the same > word with no accent, is this correct? If so, the query should consider > all docs that contains, in this case "batería" and "batería", right?
If you change the "fuzziness" factor to 0.5, it will probably work. I don't understand exactly what that number represents, so can't give you more than a trial-and-error approach :)
That said, using a fuzzy query for this type of search is a lot heavier than analyzing your text properly at index time.
>If you change the "fuzziness" factor to 0.5, it will probably work.
Not really actually as a factor of 0.7 should be enough for matching
words at a distance of 1.
>I don't understand exactly what that number represents, so can't give you
>more than a trial-and-error approach :)
Just for the sake of providing info about this topic (this is what I
know so far, most likely Kimchy or some other Lucene expert will know
the right answer):
The 'fuzziness' factor refers to the 'minimunSimilarity' parameter of
a Lucene FuzzyQuery (http://lucene.apache.org/java/3_2_0/api/all/org/ apache/lucene/search/Query.html): for a minimumSimilarity of 0.7, a
term of the same length as the query term is considered similar to the
query term if the edit distance between both terms is less than
length(term)*(1-0.7)
Thus, LD between "bateria" and "batería" is 1 (just one char change)
and length('batería')*0.3 = 2.1 > 1
>That said, using a fuzzy query for this type of search is a lot heavier
>than analyzing your text properly at index time.
Totally agree, it's just that in my case I need to work in an already
productive system with 50M docs indexed, so I cannot recreate the
index for changing the 'title' field analyzer.
The only idea I have so far, is to add another field to the type with
an 'asciifolding' analyzer, populate that field for all docs and
switch the field in which the searches are performing to the new one.
Thanks for your great support,
Frederic
On 17 ene, 08:45, Clinton Gormley <cl...@traveljury.com> wrote:
> > I've been performing some searches (ES 0.18.5) using accents as well
> > (in spanish) but, instead of using the 'asciifolding' filter at a
> > indexing time, I'd want to get similar results using fuzzy queries
> > (apart from getting also results for similar words).
> > I want to search docs using one or more words, based on a free text
> > field, called 'title', and I'd like to get the same results for both,
> > for instance, "bateria" and "batería" words. The query is:
> > AFAIK a word with one accented letter is at distance '1' from the same
> > word with no accent, is this correct? If so, the query should consider
> > all docs that contains, in this case "batería" and "batería", right?
> If you change the "fuzziness" factor to 0.5, it will probably work. I
> don't understand exactly what that number represents, so can't give you
> more than a trial-and-error approach :)
> That said, using a fuzzy query for this type of search is a lot heavier
> than analyzing your text properly at index time.
> Totally agree, it's just that in my case I need to work in an already > productive system with 50M docs indexed, so I cannot recreate the > index for changing the 'title' field analyzer. > The only idea I have so far, is to add another field to the type with > an 'asciifolding' analyzer, populate that field for all docs and > switch the field in which the searches are performing to the new one.
> > Totally agree, it's just that in my case I need to work in an already
> > productive system with 50M docs indexed, so I cannot recreate the
> > index for changing the 'title' field analyzer.
> > The only idea I have so far, is to add another field to the type with
> > an 'asciifolding' analyzer, populate that field for all docs and
> > switch the field in which the searches are performing to the new one.