Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Problem searching queries with accents
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  10 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Felipe Hummel  
View profile  
 More options Jan 16 2012, 11:24 am
From: Felipe Hummel <felipehum...@gmail.com>
Date: Mon, 16 Jan 2012 08:24:06 -0800 (PST)
Local: Mon, Jan 16 2012 11:24 am
Subject: Problem searching queries with accents

Hi, I'm indexing brazilian portuguese text that contains accents. To remove
them I used the asciifolding filter. My "test" index settings is as follows:

{

    "test": {

        "settings": {

            "index.analysis.analyzer.default.filter.0": "standard",

            "index.analysis.analyzer.default.tokenizer": "standard",

            "index.analysis.analyzer.default.filter.1": "lowercase",

            "index.analysis.analyzer.default.filter.2": "stop",

            "index.analysis.analyzer.default.filter.3": "asciifolding",

            "index.number_of_shards": "1",

            "index.number_of_replicas": "0"

        }

    }

}

I indexed the word "não". When I search "nao" (no accent) the document is
retrieved. If I search for "não" no document is retrieved.

Something wrong with my configuration?

I'm using Curl to query elasticsearch.

Thanks

Felipe Hummel


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Clinton Gormley  
View profile  
 More options Jan 16 2012, 1:18 pm
From: Clinton Gormley <cl...@traveljury.com>
Date: Mon, 16 Jan 2012 19:18:59 +0100
Local: Mon, Jan 16 2012 1:18 pm
Subject: Re: Problem searching queries with accents

> I indexed the word "não". When I search "nao" (no accent) the document
> is retrieved. If I search for "não" no document is retrieved.

How are you searching?  I bet you're using a 'term' query, which isn't
analyzed.  Change that to a 'text' query, and it should work

clint


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Felipe Hummel  
View profile  
 More options Jan 16 2012, 1:59 pm
From: Felipe Hummel <felipehum...@gmail.com>
Date: Mon, 16 Jan 2012 10:59:37 -0800 (PST)
Local: Mon, Jan 16 2012 1:59 pm
Subject: Re: Problem searching queries with accents

That is right!

Actually I was also testing with the form:

http://localhost:9200/test/test1/_search?q=não

I suppose it just gets converted to a TermQuery. Because the following
query:

http://localhost:9200/test/teste1/_search?q=não+something

yields the right results.

Thanks

Felipe Hummel


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Clinton Gormley  
View profile  
 More options Jan 16 2012, 2:05 pm
From: Clinton Gormley <cl...@traveljury.com>
Date: Mon, 16 Jan 2012 20:05:55 +0100
Local: Mon, Jan 16 2012 2:05 pm
Subject: Re: Problem searching queries with accents

> Actually I was also testing with the form:
> http://localhost:9200/test/test1/_search?q=não

Actually, that gets converted to a query_string query against the _all
field, which should have worked.

I wonder if it was a problem with your encoding.

Does this work?

curl -XGET 'http://127.0.0.1:9200/test/test1/_search?pretty=1&q=n%C3%A3o

clint


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Felipe Hummel  
View profile  
 More options Jan 16 2012, 2:09 pm
From: Felipe Hummel <felipehum...@gmail.com>
Date: Mon, 16 Jan 2012 11:09:02 -0800 (PST)
Local: Mon, Jan 16 2012 2:09 pm
Subject: Re: Problem searching queries with accents

You're right, it must be some encoding problem. The url encoded version
works as expected.

Felipe Hummel


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Frederic  
View profile  
 More options Jan 16 2012, 3:25 pm
From: Frederic <focampo...@gmail.com>
Date: Mon, 16 Jan 2012 12:25:35 -0800 (PST)
Local: Mon, Jan 16 2012 3:25 pm
Subject: Re: Problem searching queries with accents
Hi Clint, i take advantage of this thread for a quite similar
question:

I've been performing some searches (ES 0.18.5) using accents as well
(in spanish)  but, instead of using the 'asciifolding' filter at a
indexing time, I'd want to get similar results using fuzzy queries
(apart from getting also results for similar words).

I want to search docs using one or more words, based on a free text
field, called 'title', and I'd like to get the same results for both,
for instance, "bateria" and "batería" words. The query is:

      "query" : {
        "text" : {
          "title" : {
            "query" : "batería",
            "type" : "boolean",
            "operator" : "AND",
            "fuzziness" : "0.7",
            "max_expansions" : 3
          }
        }
      }

AFAIK a word with one accented letter is at distance '1' from the same
word with no accent, is this correct? If so, the query should consider
all docs that contains, in this case "batería" and "batería", right?

Right now I'm getting different number of results and I'm not sure
what could be the reason

Thanks in advance

Frederic

On 16 ene, 16:05, Clinton Gormley <cl...@traveljury.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Clinton Gormley  
View profile  
 More options Jan 17 2012, 6:45 am
From: Clinton Gormley <cl...@traveljury.com>
Date: Tue, 17 Jan 2012 12:45:04 +0100
Local: Tues, Jan 17 2012 6:45 am
Subject: Re: Problem searching queries with accents

If you change the "fuzziness" factor to 0.5, it will probably work.  I
don't understand exactly what that number represents, so can't give you
more than a trial-and-error approach :)

That said, using a fuzzy query for this type of search is a lot heavier
than analyzing your text properly at index time.

clint


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Frederic  
View profile  
 More options Jan 17 2012, 11:05 am
From: Frederic <focampo...@gmail.com>
Date: Tue, 17 Jan 2012 08:05:27 -0800 (PST)
Local: Tues, Jan 17 2012 11:05 am
Subject: Re: Problem searching queries with accents
Thanks for your answer Clint, some comments:

>If you change the "fuzziness" factor to 0.5, it will probably work.

Not really actually as a factor of 0.7 should be enough for matching
words at a distance of 1.

>I don't understand exactly what that number represents, so can't give you
>more than a trial-and-error approach :)

Just for the sake of providing info about this topic (this is what I
know so far, most likely Kimchy or some other Lucene expert will know
the right answer):

The 'fuzziness' factor refers to the 'minimunSimilarity' parameter of
a Lucene FuzzyQuery (http://lucene.apache.org/java/3_2_0/api/all/org/
apache/lucene/search/Query.html): for a minimumSimilarity of 0.7, a
term of the same length as the query term is considered similar to the
query term if the edit distance between both terms is less than
length(term)*(1-0.7)

Where the distance value is based on an implementation of the'
Levenshtein Distance' algorithm (http://www.merriampark.com/ld.htm).

Thus, LD between "bateria" and "batería" is 1 (just one char change)
and length('batería')*0.3 = 2.1 > 1

>That said, using a fuzzy query for this type of search is a lot heavier
>than analyzing your text properly at index time.

Totally agree, it's just that in my case I need to work in an already
productive system with 50M docs indexed, so I cannot recreate the
index for changing the 'title' field analyzer.
The only idea I have so far, is to add another field to the type with
an 'asciifolding' analyzer, populate that field for all docs and
switch the field in which the searches are performing to the new one.

Thanks for your great support,

Frederic

On 17 ene, 08:45, Clinton Gormley <cl...@traveljury.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Clinton Gormley  
View profile  
 More options Jan 17 2012, 11:26 am
From: Clinton Gormley <cl...@traveljury.com>
Date: Tue, 17 Jan 2012 17:26:06 +0100
Local: Tues, Jan 17 2012 11:26 am
Subject: Re: Problem searching queries with accents

> Totally agree, it's just that in my case I need to work in an already
> productive system with 50M docs indexed, so I cannot recreate the
> index for changing the 'title' field analyzer.
> The only idea I have so far, is to add another field to the type with
> an 'asciifolding' analyzer, populate that field for all docs and
> switch the field in which the searches are performing to the new one.

You may want to take a look at multi-fields:

http://www.elasticsearch.org/guide/reference/mapping/multi-field-type...

clint


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Frederic  
View profile  
 More options Jan 17 2012, 1:37 pm
From: Frederic <focampo...@gmail.com>
Date: Tue, 17 Jan 2012 10:37:51 -0800 (PST)
Local: Tues, Jan 17 2012 1:37 pm
Subject: Re: Problem searching queries with accents
That's exactly what I need. Thanks a lot

Fred
On 17 ene, 13:26, Clinton Gormley <cl...@traveljury.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »