Some problems while trying to use a combination of queries in a complex search

549 views

Skip to first unread message

Viktor Kojouharov

unread,

Oct 5, 2014, 6:19:30 PM10/5/14

to bl...@googlegroups.com

Hello,

I've ran into a problem while trying to use a combination of Conjunction, Disjunction and Phrase queries. My data, besides the text fields, has an integer feed id, which specifies the parent feed for any particular data item. On an unrelated topic, I found no way to force bleve to index that int64 field, so I made a custom type and turned that field into a string. Anyway, my goal is to achieve roughly the following query:

("query term" && ("FeedId:1" || "FeedId:5" ...))

For the inner field queries, I build a disjunction query that contains a bunch of phrase queries. I combine that disjunction query with the actual search term into a conjunction query, and pass that to the request. The code can be seen here: https://github.com/urandom/readeef/blob/851615ec0d8b03990fcae8a5c324ab562390c1ca/search_index.go#L211

My problem is that the results don't match the expectations. As can be seen a couple of lines above that code, I ignore this complex query when I don't want to categorize the result in any particular feed id(s). That produces a correct result. If I pass all my feed ids, that should produce the same result, since the disjunction will have all possible variations of the FeedId field. That is not the case, however, as it produces a lot less results. Neither does only passing one feed id (a disjunction with only one phrase query in it) produce the correct result, since a bunch of results are missing, yet are visible when everything is searched.

Marty Schoch

unread,

Oct 6, 2014, 1:08:46 PM10/6/14

to bl...@googlegroups.com

Thanks for reporting this. I have further comments in-line below:

On Sun, Oct 5, 2014 at 3:19 PM, Viktor Kojouharov <vkojo...@gmail.com> wrote:

Hello,

I've ran into a problem while trying to use a combination of Conjunction, Disjunction and Phrase queries. My data, besides the text fields, has an integer feed id, which specifies the parent feed for any particular data item. On an unrelated topic, I found no way to force bleve to index that int64 field, so I made a custom type and turned that field into a string.

Yes, at this time our support for numeric fields only covers float64. With a little bit of work it could support any 64-bite or less numerical value. I've opened an issue to track this here:

https://github.com/blevesearch/bleve/issues/106

Anyway, my goal is to achieve roughly the following query:

("query term" && ("FeedId:1" || "FeedId:5" ...))

A query of this form should work, but I'm not sure why you've chose a PhraseQuery for the second part. Are your actual values for FeedId strings like "FeedId:1" or are they numbers formatted as strings, like "1"?

In either case, I still don't think it will work as you then used NewNumericFieldMapping() which was only intended to work with numbers. This is possibly a bug, but I'll need to think more about how that could be used.

Based on what you've described so far it seems like you only want to do exact matches on FeedIds, so leaving it as a string, and indexing it with the keyword analyzer will probably be best performance. Indexing it as a numeric value would give you the ability to do range matches, but if you don't need those, they also take up a lot more space in the index.

For the inner field queries, I build a disjunction query that contains a bunch of phrase queries. I combine that disjunction query with the actual search term into a conjunction query, and pass that to the request. The code can be seen here: https://github.com/urandom/readeef/blob/851615ec0d8b03990fcae8a5c324ab562390c1ca/search_index.go#L211

Generally when debugging these more complex queries its helpful to verify all the individual queries work as expected first. From just reading this code it looks like the phrase queries on feed ids aren't likely to work right now.

The thing I would try next is to change the mapping to index the FeedId field as text, with the keyword analyzer. Then change the queries inside the disjunction to be TermQueries for the specific FeedId values you want to match.

My problem is that the results don't match the expectations. As can be seen a couple of lines above that code, I ignore this complex query when I don't want to categorize the result in any particular feed id(s). That produces a correct result. If I pass all my feed ids, that should produce the same result, since the disjunction will have all possible variations of the FeedId field. That is not the case, however, as it produces a lot less results.

Is it a lot less results? Or no results at all? If its no results at all, that probably means the Phrase queries aren't matching anything. If its some results, but fewer than expected, then we'll have to dig deeper.

Neither does only passing one feed id (a disjunction with only one phrase query in it) produce the correct result, since a bunch of results are missing, yet are visible when everything is searched.

Please try some of the suggestions above, or clarify if any of my assumptions seem incorrect. Another thing I can do in parallel is try to create a test case for a similar type of query on the beer-search dataset.

marty

Viktor Kojouharov

unread,

Oct 7, 2014, 3:30:43 AM10/7/14

to bl...@googlegroups.com

Thanks for the reply. I'll break things down below:

On Monday, October 6, 2014 8:08:46 PM UTC+3, Marty Schoch wrote:

Thanks for reporting this. I have further comments in-line below:

On Sun, Oct 5, 2014 at 3:19 PM, Viktor Kojouharov <vkojo...@gmail.com> wrote:
Hello,

I've ran into a problem while trying to use a combination of Conjunction, Disjunction and Phrase queries. My data, besides the text fields, has an integer feed id, which specifies the parent feed for any particular data item. On an unrelated topic, I found no way to force bleve to index that int64 field, so I made a custom type and turned that field into a string.

Yes, at this time our support for numeric fields only covers float64. With a little bit of work it could support any 64-bite or less numerical value. I've opened an issue to track this here:

https://github.com/blevesearch/bleve/issues/106

Anyway, my goal is to achieve roughly the following query:

("query term" && ("FeedId:1" || "FeedId:5" ...))

A query of this form should work, but I'm not sure why you've chose a PhraseQuery for the second part. Are your actual values for FeedId strings like "FeedId:1" or are they numbers formatted as strings, like "1"?

Cause, that's honestly the only thing I found in the documentation with the word "field" and "query" in it :). I have no idea how the query string parser does a query of the sort `FeedId:"1"`.

So yeah, I'd appreciate some enlightenment on that part. Below you mention TermQueries, Am I supposed to construct a Term query with a term `FieldId:"1"`? Btw, the number seems to need quoting, or an error occurs.

In either case, I still don't think it will work as you then used NewNumericFieldMapping() which was only intended to work with numbers. This is possibly a bug, but I'll need to think more about how that could be used.

That was just for experimentation, to be honest, I have no idea what these field mappings do really. I'll remove it.

Based on what you've described so far it seems like you only want to do exact matches on FeedIds, so leaving it as a string, and indexing it with the keyword analyzer will probably be best performance. Indexing it as a numeric value would give you the ability to do range matches, but if you don't need those, they also take up a lot more space in the index.

I'll remove the field mapping and leave a default index mapping.

For the inner field queries, I build a disjunction query that contains a bunch of phrase queries. I combine that disjunction query with the actual search term into a conjunction query, and pass that to the request. The code can be seen here: https://github.com/urandom/readeef/blob/851615ec0d8b03990fcae8a5c324ab562390c1ca/search_index.go#L211

Generally when debugging these more complex queries its helpful to verify all the individual queries work as expected first. From just reading this code it looks like the phrase queries on feed ids aren't likely to work right now.

The thing I would try next is to change the mapping to index the FeedId field as text, with the keyword analyzer. Then change the queries inside the disjunction to be TermQueries for the specific FeedId values you want to match.

My problem is that the results don't match the expectations. As can be seen a couple of lines above that code, I ignore this complex query when I don't want to categorize the result in any particular feed id(s). That produces a correct result. If I pass all my feed ids, that should produce the same result, since the disjunction will have all possible variations of the FeedId field. That is not the case, however, as it produces a lot less results.

Is it a lot less results? Or no results at all? If its no results at all, that probably means the Phrase queries aren't matching anything. If its some results, but fewer than expected, then we'll have to dig deeper.

In my test, if searching without specifying the feed id resulted in 4 entries, limiting based on one or both (since the test has 2 feeds) would result in either less (but still some), or some of the entries that don't have the actual search term in them.

Viktor Kojouharov

unread,

Oct 7, 2014, 3:53:38 AM10/7/14

to bl...@googlegroups.com

Hi again,

I seem to have overlooked a crucial part of the Query API, specifically the SetField method. After removing the numeric mapping, changing to term query, and setting the field on those, i get what I assume are the correct results now.

Merely searching for a number (say, '33' appears in one of the indexed text fields) still produces the following error though: 'Parse Error - syntax error'

And if I may, I have a bit more query questions.

First, I have a few boolean fields in my data struct. Besides turning them into strings before indexing, is there any other way to get them indexed so that I may use them for some special queries?

And a bit related to that, how can I effectively prevent bleve from using the "FeedId", and potentially the boolean fields when searching for the supplied term, yet still use them for restricting the final result when I need to (like in my original problem with the Feed Ids)? Should I move to using multiple TermQuery objects, restricting them to my string fields, and do they handle multiple keyword and quoted text searches like the query string queries?

And finally, are there plans for providing a way to influence the result sorting, e.g.: specify a time.Time field to be used as the primary sorting 'column', and changing the direction?

Marty Schoch

unread,

Oct 9, 2014, 2:56:29 PM10/9/14

to bl...@googlegroups.com

Sorry for the long delay in replying, I've been traveling for the Couchbase Connect conference, but will be getting back to a more normal schedule next week.

On Tue, Oct 7, 2014 at 12:53 AM, Viktor Kojouharov <vkojo...@gmail.com> wrote:

Hi again,

I seem to have overlooked a crucial part of the Query API, specifically the SetField method. After removing the numeric mapping, changing to term query, and setting the field on those, i get what I assume are the correct results now.

Yes, generally the non-compound queries can restrict themselves to a single field. If it isn't specified it will use a default field, typically _all. The default field can be changed in the mapping.

Merely searching for a number (say, '33' appears in one of the indexed text fields) still produces the following error though: 'Parse Error - syntax error'

My guess is that this is a bug introduced when I changed the grammar to better handle range queries. I have opened a bug for this here:

https://github.com/blevesearch/bleve/issues/108

And if I may, I have a bit more query questions.

First, I have a few boolean fields in my data struct. Besides turning them into strings before indexing, is there any other way to get them indexed so that I may use them for some special queries?

Right now there is no special indexing of boolean fields, but its something we should consider for the future. I've opened an issue to track this here:

https://github.com/blevesearch/bleve/issues/109

And a bit related to that, how can I effectively prevent bleve from using the "FeedId", and potentially the boolean fields when searching for the supplied term, yet still use them for restricting the final result when I need to (like in my original problem with the Feed Ids)? Should I move to using multiple TermQuery objects, restricting them to my string fields, and do they handle multiple keyword and quoted text searches like the query string queries?

Based on what you've described so far, I'm assuming the outer query was something a user would input (either directly or indirectly). So, that probably makes sense to leave as a Query String query since it gives some flexibility. The inner queries where you are trying to do exact matching on one or more specific terms makes sense as a Term query.

Generally the Term query is for doing an exact match. Not something you do a lot directly when searching, but it is the primitive on which everything else is built. Term Queries do not handle multiple key words or quoted strings. To search for multiple keywords you would use a Disjunction query containing multiple Term queries. Quoted text searches mean different things in different contexts, but if the intention is to search for phrases (occurrences of terms in a sequence) then you should use a Phrase Query.

Also, there are Match and MatchPhrase, these work like Term and Phrase queries, but they perform text analysis on the input, as opposed to looking for exactly what was given.

The Query String queries is a higher level construct that tries to parse out what a user means and build the correct lower-level queries.

And finally, are there plans for providing a way to influence the result sorting, e.g.: specify a time.Time field to be used as the primary sorting 'column', and changing the direction?

There are a couple of ways to possibly get there. One way would be to introduce a means of sorting results by something other than the score. Right now we have a Collector interface, and the only implementation is one which orders results by scores and keeps the top N. Another way to get this behavior would be to have some custom scoring that lets you influence the score by some other context/data. In this case you still might sort the results by score.

I sense we still haven't quite gotten the right mix of queries to accomplish what you're trying to do. But I think we're close. Don't hesitate to keep asking questions if its still not working for you.

marty

Viktor Kojouharov

unread,

Oct 13, 2014, 4:14:39 AM10/13/14

to bl...@googlegroups.com

Thanks for the reply

On Thursday, October 9, 2014 9:56:29 PM UTC+3, Marty Schoch wrote:

[snip]

And a bit related to that, how can I effectively prevent bleve from using the "FeedId", and potentially the boolean fields when searching for the supplied term, yet still use them for restricting the final result when I need to (like in my original problem with the Feed Ids)? Should I move to using multiple TermQuery objects, restricting them to my string fields, and do they handle multiple keyword and quoted text searches like the query string queries?

Based on what you've described so far, I'm assuming the outer query was something a user would input (either directly or indirectly). So, that probably makes sense to leave as a Query String query since it gives some flexibility. The inner queries where you are trying to do exact matching on one or more specific terms makes sense as a Term query.

Yes, precisely. Users might even find it useful to restrict the search on only the title or description, therefore the query string query is actually quite useful. I'm just wondering whether any utility fields (like the FeedId one), would be likely to cause a false positive (if the user searches for a number which just happens to be that feed id). If so, a way to make the query string query ignore specific fields would come in handy.

Generally the Term query is for doing an exact match. Not something you do a lot directly when searching, but it is the primitive on which everything else is built. Term Queries do not handle multiple key words or quoted strings. To search for multiple keywords you would use a Disjunction query containing multiple Term queries. Quoted text searches mean different things in different contexts, but if the intention is to search for phrases (occurrences of terms in a sequence) then you should use a Phrase Query.

Also, there are Match and MatchPhrase, these work like Term and Phrase queries, but they perform text analysis on the input, as opposed to looking for exactly what was given.

The Query String queries is a higher level construct that tries to parse out what a user means and build the correct lower-level queries.

And finally, are there plans for providing a way to influence the result sorting, e.g.: specify a time.Time field to be used as the primary sorting 'column', and changing the direction?

There are a couple of ways to possibly get there. One way would be to introduce a means of sorting results by something other than the score. Right now we have a Collector interface, and the only implementation is one which orders results by scores and keeps the top N. Another way to get this behavior would be to have some custom scoring that lets you influence the score by some other context/data. In this case you still might sort the results by score.

Yes, ideally for my use case, I would probably use use the score as the main sort, unless a user selects another method, such as the date. Then the score would preferably be the secondary sorting method for equality on the first.

I sense we still haven't quite gotten the right mix of queries to accomplish what you're trying to do. But I think we're close. Don't hesitate to keep asking questions if its still not working for you.

I have one more question for now. Some of my data is per user (such as whether an article has been read or not), whereas the search index only contains the common stuff (like the title and content). Is it possible to somehow provide a filtering function which gets executed before the final limited result set is returned. In my case, that function would further limit the result set, if for instance, the user wants to search for only unread articles. I could currently do that after the search result has been supplied, but that will result in an unexpected result count (such as only providing 14 hits for the first page, since the rest of the 50-14 hits have been filtered out, yet still having more pages).

marty

Marty Schoch

unread,

Oct 13, 2014, 11:14:30 AM10/13/14

to bl...@googlegroups.com

On Mon, Oct 13, 2014 at 4:14 AM, Viktor Kojouharov <vkojo...@gmail.com> wrote:

Thanks for the reply

On Thursday, October 9, 2014 9:56:29 PM UTC+3, Marty Schoch wrote:
[snip]

And a bit related to that, how can I effectively prevent bleve from using the "FeedId", and potentially the boolean fields when searching for the supplied term, yet still use them for restricting the final result when I need to (like in my original problem with the Feed Ids)? Should I move to using multiple TermQuery objects, restricting them to my string fields, and do they handle multiple keyword and quoted text searches like the query string queries?

Based on what you've described so far, I'm assuming the outer query was something a user would input (either directly or indirectly). So, that probably makes sense to leave as a Query String query since it gives some flexibility. The inner queries where you are trying to do exact matching on one or more specific terms makes sense as a Term query.

Yes, precisely. Users might even find it useful to restrict the search on only the title or description, therefore the query string query is actually quite useful. I'm just wondering whether any utility fields (like the FeedId one), would be likely to cause a false positive (if the user searches for a number which just happens to be that feed id). If so, a way to make the query string query ignore specific fields would come in handy.

If the FeedId is set to be included in the _all field, and the user searches for a term that accidentally matches a FeedId, then yes you could have some false positives. Given that you said you'd normally be explicitly adding additional query clauses to match the FeedId, and in those cases you'll explicitly set the field to FieldId, I'd recommend not including the FeedId in the _all field. You can do this by setting IncludeInAll to false on the field mapping object.

Generally the Term query is for doing an exact match. Not something you do a lot directly when searching, but it is the primitive on which everything else is built. Term Queries do not handle multiple key words or quoted strings. To search for multiple keywords you would use a Disjunction query containing multiple Term queries. Quoted text searches mean different things in different contexts, but if the intention is to search for phrases (occurrences of terms in a sequence) then you should use a Phrase Query.

Also, there are Match and MatchPhrase, these work like Term and Phrase queries, but they perform text analysis on the input, as opposed to looking for exactly what was given.

The Query String queries is a higher level construct that tries to parse out what a user means and build the correct lower-level queries.

And finally, are there plans for providing a way to influence the result sorting, e.g.: specify a time.Time field to be used as the primary sorting 'column', and changing the direction?

There are a couple of ways to possibly get there. One way would be to introduce a means of sorting results by something other than the score. Right now we have a Collector interface, and the only implementation is one which orders results by scores and keeps the top N. Another way to get this behavior would be to have some custom scoring that lets you influence the score by some other context/data. In this case you still might sort the results by score.

Yes, ideally for my use case, I would probably use use the score as the main sort, unless a user selects another method, such as the date. Then the score would preferably be the secondary sorting method for equality on the first.

I've opened an issue to track the feature for sorting by document fields instead of the score:

https://github.com/blevesearch/bleve/issues/110

I sense we still haven't quite gotten the right mix of queries to accomplish what you're trying to do. But I think we're close. Don't hesitate to keep asking questions if its still not working for you.

I have one more question for now. Some of my data is per user (such as whether an article has been read or not), whereas the search index only contains the common stuff (like the title and content). Is it possible to somehow provide a filtering function which gets executed before the final limited result set is returned. In my case, that function would further limit the result set, if for instance, the user wants to search for only unread articles. I could currently do that after the search result has been supplied, but that will result in an unexpected result count (such as only providing 14 hits for the first page, since the rest of the 50-14 hits have been filtered out, yet still having more pages).

Its an interesting idea. We certainly could implement it as you describe, but it seems to me like it could be very expensive (assuming you need the document id along with all its stored fields in this filter function.

If you're just filtering by document ID, have you tried adding a clause to exclude the document IDs that are already read? Currently this would have to be done by using the Boolean Query's MUST NOT clause to match the document IDs that were already read. It occurs to me that we lack a query mechanism for working directly with IDs, so you'd have to index the ID as well, which is kind of awkward.

Opening an issue for this too.

https://github.com/blevesearch/bleve/issues/111

This could work for somewhat long (say several thousdands) lists of articles read, but since the list of articles read typically grows without bound this may not work.

Another option, depending on how you track who read what would be to maintain a list of users which read articles inside the article document. Bleve does have some support for working with arrays, so you could then search on that field to only include documents which did not have an exact term match (user name) in the read_by_users field.

Obviously this too has some scalability issues as the number of users reading an article gets long.

Assuming both the list of articles read by a user, and the users which have read an article both become long lists over time, we'll need to index this information as a separate (but related and queryable) document. This will require more thought, but probably will look something like using a parent_id field in elasticsearch)

marty

Viktor Kojouharov

unread,

Oct 14, 2014, 3:50:11 AM10/14/14

to bl...@googlegroups.com

On Monday, October 13, 2014 6:14:30 PM UTC+3, Marty Schoch wrote:

On Mon, Oct 13, 2014 at 4:14 AM, Viktor Kojouharov <vkojo...@gmail.com> wrote:
Thanks for the reply

On Thursday, October 9, 2014 9:56:29 PM UTC+3, Marty Schoch wrote:
[snip]

And a bit related to that, how can I effectively prevent bleve from using the "FeedId", and potentially the boolean fields when searching for the supplied term, yet still use them for restricting the final result when I need to (like in my original problem with the Feed Ids)? Should I move to using multiple TermQuery objects, restricting them to my string fields, and do they handle multiple keyword and quoted text searches like the query string queries?

Based on what you've described so far, I'm assuming the outer query was something a user would input (either directly or indirectly). So, that probably makes sense to leave as a Query String query since it gives some flexibility. The inner queries where you are trying to do exact matching on one or more specific terms makes sense as a Term query.

Yes, precisely. Users might even find it useful to restrict the search on only the title or description, therefore the query string query is actually quite useful. I'm just wondering whether any utility fields (like the FeedId one), would be likely to cause a false positive (if the user searches for a number which just happens to be that feed id). If so, a way to make the query string query ignore specific fields would come in handy.

If the FeedId is set to be included in the _all field, and the user searches for a term that accidentally matches a FeedId, then yes you could have some false positives. Given that you said you'd normally be explicitly adding additional query clauses to match the FeedId, and in those cases you'll explicitly set the field to FieldId, I'd recommend not including the FeedId in the _all field. You can do this by setting IncludeInAll to false on the field mapping object.

That's interesting. I'll need to look at the docs again. Thanks

Generally the Term query is for doing an exact match. Not something you do a lot directly when searching, but it is the primitive on which everything else is built. Term Queries do not handle multiple key words or quoted strings. To search for multiple keywords you would use a Disjunction query containing multiple Term queries. Quoted text searches mean different things in different contexts, but if the intention is to search for phrases (occurrences of terms in a sequence) then you should use a Phrase Query.

Also, there are Match and MatchPhrase, these work like Term and Phrase queries, but they perform text analysis on the input, as opposed to looking for exactly what was given.

The Query String queries is a higher level construct that tries to parse out what a user means and build the correct lower-level queries.

And finally, are there plans for providing a way to influence the result sorting, e.g.: specify a time.Time field to be used as the primary sorting 'column', and changing the direction?

There are a couple of ways to possibly get there. One way would be to introduce a means of sorting results by something other than the score. Right now we have a Collector interface, and the only implementation is one which orders results by scores and keeps the top N. Another way to get this behavior would be to have some custom scoring that lets you influence the score by some other context/data. In this case you still might sort the results by score.

Yes, ideally for my use case, I would probably use use the score as the main sort, unless a user selects another method, such as the date. Then the score would preferably be the secondary sorting method for equality on the first.

I've opened an issue to track the feature for sorting by document fields instead of the score:

https://github.com/blevesearch/bleve/issues/110

I sense we still haven't quite gotten the right mix of queries to accomplish what you're trying to do. But I think we're close. Don't hesitate to keep asking questions if its still not working for you.

I have one more question for now. Some of my data is per user (such as whether an article has been read or not), whereas the search index only contains the common stuff (like the title and content). Is it possible to somehow provide a filtering function which gets executed before the final limited result set is returned. In my case, that function would further limit the result set, if for instance, the user wants to search for only unread articles. I could currently do that after the search result has been supplied, but that will result in an unexpected result count (such as only providing 14 hits for the first page, since the rest of the 50-14 hits have been filtered out, yet still having more pages).

Its an interesting idea. We certainly could implement it as you describe, but it seems to me like it could be very expensive (assuming you need the document id along with all its stored fields in this filter function.

In my case, the document ids will suffice, since they are equivalent to the article ids. And I can prepare a list of all article ids that are unread (much smaller than the read list) beforehand.

If you're just filtering by document ID, have you tried adding a clause to exclude the document IDs that are already read? Currently this would have to be done by using the Boolean Query's MUST NOT clause to match the document IDs that were already read. It occurs to me that we lack a query mechanism for working directly with IDs, so you'd have to index the ID as well, which is kind of awkward.

Opening an issue for this too.

https://github.com/blevesearch/bleve/issues/111

This could work for somewhat long (say several thousdands) lists of articles read, but since the list of articles read typically grows without bound this may not work.

Yes, once this issue is implemented, I'll try matching against unread ids. It should be fast, since such a list is generally below a thousand :)

Reply all

Reply to author

Forward

0 new messages