Exact phrase search via si.query()

131 views
Skip to first unread message

Chung Wu

unread,
Apr 18, 2012, 1:40:33 AM4/18/12
to python-...@googlegroups.com
Hi all,

Solr/Sunburnt newbie here :-)  I was reading the documentation on how si.query() performs an exact phrase search here: http://opensource.timetric.com/sunburnt/queryingsolr.html#searching-your-solr-instance  

But looking at this:

>>> si.query(name='united states of america').options()
{'q': u'name:united\\ states\\ of\\ america'}

It looks like the Lucene query isn't a phrase search, but actually will match any document that has any of "united", "states", "of", "america".

Is that right?  Am I doing this wrong?

Thanks!
Chung

Toby White

unread,
Apr 18, 2012, 1:30:47 PM4/18/12
to python-...@googlegroups.com
Hi there,

Sunburnt is behaving correctly; that Lucene query is a phrase search -
the spaces are escaped. The following two queries are exactly
equivalent in Lucene:

name:"united states of america"
name:united\ states\ of\ america

If you check, they should give the same result, they search for any
document containing the string of characters 'united states of
america' in the name field. This is different from:

name:united name:states name:of name:america

which would give the result you describe, or even worse:

name:united states of america

which would search for 'united' in the name field, and the words
'states', 'of', 'america' separately in the default search field.

Toby

Chung Wu

unread,
Apr 18, 2012, 3:04:16 PM4/18/12
to python-...@googlegroups.com
Hi Toby,

Thanks!  I guess it must be because I set autoGeneratePhraseQueries="false"

If I query [test:united\ america], then:

<str name="rawquerystring">test:united\ america</str>
<str name="querystring">test:united\ america</str>
<str name="parsedquery">test:united test:america</str>
<str name="parsedquery_toString">test:united test:america</str>

Which matches "united states of america", even though it shouldn't.

But if I use quotes instead and query [test:"united america"], then:

<str name="rawquerystring">test:"united america"</str>
<str name="querystring">test:"united america"</str>
<str name="parsedquery">PhraseQuery(test:"united america")</str>
<str name="parsedquery_toString">test:"united america"</str>

Which is an exact phrase query as expected.

If I set autoGeneratePhraseQueries="true", then using quotes or escaped spaces both look like the latter example.

Is that expected behavior?

Thanks!

Chung

Toby White

unread,
Apr 18, 2012, 5:04:18 PM4/18/12
to python-...@googlegroups.com
Ah - I'd never encountered the autoGeneratePhraseQueries setting
before - I didn't know about this aspect of solr behaviour.

The Lucene query language is frustratingly under-specified - my
experimentation had led me to believe that quotes or escaped strings
were entirely equivalent.

I found it considerably easier and more consistent to generate
appropriate strings by escaping them, rather than trying to write
logic for when double-quotes should be used, which is why I did it
this way. Maybe sunburnt should check for the status of the
autoGeneratePhraseQueries setting and bail out if it's not set
appropriately.

How did you run into this issue?

Toby

Chung Wu

unread,
Apr 18, 2012, 6:03:16 PM4/18/12
to python-...@googlegroups.com
autoGeneratePhraseQueries itself is underspecified -- I had a lot of trouble just figuring out what it meant! 

I believe that autoGeneratePhraseQueries="false" is actually the default in schema.xml, but only past a certain version of Lucene.  So it would be nice if query() actually generates a quoted phrase search instead of bailing :-)

Thanks!
Chung
Reply all
Reply to author
Forward
0 new messages