Hello,I've ran into a problem while trying to use a combination of Conjunction, Disjunction and Phrase queries. My data, besides the text fields, has an integer feed id, which specifies the parent feed for any particular data item. On an unrelated topic, I found no way to force bleve to index that int64 field, so I made a custom type and turned that field into a string.
Anyway, my goal is to achieve roughly the following query:("query term" && ("FeedId:1" || "FeedId:5" ...))
For the inner field queries, I build a disjunction query that contains a bunch of phrase queries. I combine that disjunction query with the actual search term into a conjunction query, and pass that to the request. The code can be seen here: https://github.com/urandom/readeef/blob/851615ec0d8b03990fcae8a5c324ab562390c1ca/search_index.go#L211
My problem is that the results don't match the expectations. As can be seen a couple of lines above that code, I ignore this complex query when I don't want to categorize the result in any particular feed id(s). That produces a correct result. If I pass all my feed ids, that should produce the same result, since the disjunction will have all possible variations of the FeedId field. That is not the case, however, as it produces a lot less results.
Neither does only passing one feed id (a disjunction with only one phrase query in it) produce the correct result, since a bunch of results are missing, yet are visible when everything is searched.
Thanks for reporting this. I have further comments in-line below:On Sun, Oct 5, 2014 at 3:19 PM, Viktor Kojouharov <vkojo...@gmail.com> wrote:Hello,I've ran into a problem while trying to use a combination of Conjunction, Disjunction and Phrase queries. My data, besides the text fields, has an integer feed id, which specifies the parent feed for any particular data item. On an unrelated topic, I found no way to force bleve to index that int64 field, so I made a custom type and turned that field into a string.Yes, at this time our support for numeric fields only covers float64. With a little bit of work it could support any 64-bite or less numerical value. I've opened an issue to track this here:Anyway, my goal is to achieve roughly the following query:("query term" && ("FeedId:1" || "FeedId:5" ...))
A query of this form should work, but I'm not sure why you've chose a PhraseQuery for the second part. Are your actual values for FeedId strings like "FeedId:1" or are they numbers formatted as strings, like "1"?
In either case, I still don't think it will work as you then used NewNumericFieldMapping() which was only intended to work with numbers. This is possibly a bug, but I'll need to think more about how that could be used.
Based on what you've described so far it seems like you only want to do exact matches on FeedIds, so leaving it as a string, and indexing it with the keyword analyzer will probably be best performance. Indexing it as a numeric value would give you the ability to do range matches, but if you don't need those, they also take up a lot more space in the index.
For the inner field queries, I build a disjunction query that contains a bunch of phrase queries. I combine that disjunction query with the actual search term into a conjunction query, and pass that to the request. The code can be seen here: https://github.com/urandom/readeef/blob/851615ec0d8b03990fcae8a5c324ab562390c1ca/search_index.go#L211Generally when debugging these more complex queries its helpful to verify all the individual queries work as expected first. From just reading this code it looks like the phrase queries on feed ids aren't likely to work right now.The thing I would try next is to change the mapping to index the FeedId field as text, with the keyword analyzer. Then change the queries inside the disjunction to be TermQueries for the specific FeedId values you want to match.
My problem is that the results don't match the expectations. As can be seen a couple of lines above that code, I ignore this complex query when I don't want to categorize the result in any particular feed id(s). That produces a correct result. If I pass all my feed ids, that should produce the same result, since the disjunction will have all possible variations of the FeedId field. That is not the case, however, as it produces a lot less results.Is it a lot less results? Or no results at all? If its no results at all, that probably means the Phrase queries aren't matching anything. If its some results, but fewer than expected, then we'll have to dig deeper.
Hi again,I seem to have overlooked a crucial part of the Query API, specifically the SetField method. After removing the numeric mapping, changing to term query, and setting the field on those, i get what I assume are the correct results now.
Merely searching for a number (say, '33' appears in one of the indexed text fields) still produces the following error though: 'Parse Error - syntax error'
And if I may, I have a bit more query questions.First, I have a few boolean fields in my data struct. Besides turning them into strings before indexing, is there any other way to get them indexed so that I may use them for some special queries?
And a bit related to that, how can I effectively prevent bleve from using the "FeedId", and potentially the boolean fields when searching for the supplied term, yet still use them for restricting the final result when I need to (like in my original problem with the Feed Ids)? Should I move to using multiple TermQuery objects, restricting them to my string fields, and do they handle multiple keyword and quoted text searches like the query string queries?
And finally, are there plans for providing a way to influence the result sorting, e.g.: specify a time.Time field to be used as the primary sorting 'column', and changing the direction?
And a bit related to that, how can I effectively prevent bleve from using the "FeedId", and potentially the boolean fields when searching for the supplied term, yet still use them for restricting the final result when I need to (like in my original problem with the Feed Ids)? Should I move to using multiple TermQuery objects, restricting them to my string fields, and do they handle multiple keyword and quoted text searches like the query string queries?Based on what you've described so far, I'm assuming the outer query was something a user would input (either directly or indirectly). So, that probably makes sense to leave as a Query String query since it gives some flexibility. The inner queries where you are trying to do exact matching on one or more specific terms makes sense as a Term query.
Generally the Term query is for doing an exact match. Not something you do a lot directly when searching, but it is the primitive on which everything else is built. Term Queries do not handle multiple key words or quoted strings. To search for multiple keywords you would use a Disjunction query containing multiple Term queries. Quoted text searches mean different things in different contexts, but if the intention is to search for phrases (occurrences of terms in a sequence) then you should use a Phrase Query.Also, there are Match and MatchPhrase, these work like Term and Phrase queries, but they perform text analysis on the input, as opposed to looking for exactly what was given.The Query String queries is a higher level construct that tries to parse out what a user means and build the correct lower-level queries.And finally, are there plans for providing a way to influence the result sorting, e.g.: specify a time.Time field to be used as the primary sorting 'column', and changing the direction?There are a couple of ways to possibly get there. One way would be to introduce a means of sorting results by something other than the score. Right now we have a Collector interface, and the only implementation is one which orders results by scores and keeps the top N. Another way to get this behavior would be to have some custom scoring that lets you influence the score by some other context/data. In this case you still might sort the results by score.
I sense we still haven't quite gotten the right mix of queries to accomplish what you're trying to do. But I think we're close. Don't hesitate to keep asking questions if its still not working for you.
marty
Thanks for the reply
On Thursday, October 9, 2014 9:56:29 PM UTC+3, Marty Schoch wrote:[snip]And a bit related to that, how can I effectively prevent bleve from using the "FeedId", and potentially the boolean fields when searching for the supplied term, yet still use them for restricting the final result when I need to (like in my original problem with the Feed Ids)? Should I move to using multiple TermQuery objects, restricting them to my string fields, and do they handle multiple keyword and quoted text searches like the query string queries?Based on what you've described so far, I'm assuming the outer query was something a user would input (either directly or indirectly). So, that probably makes sense to leave as a Query String query since it gives some flexibility. The inner queries where you are trying to do exact matching on one or more specific terms makes sense as a Term query.Yes, precisely. Users might even find it useful to restrict the search on only the title or description, therefore the query string query is actually quite useful. I'm just wondering whether any utility fields (like the FeedId one), would be likely to cause a false positive (if the user searches for a number which just happens to be that feed id). If so, a way to make the query string query ignore specific fields would come in handy.
Generally the Term query is for doing an exact match. Not something you do a lot directly when searching, but it is the primitive on which everything else is built. Term Queries do not handle multiple key words or quoted strings. To search for multiple keywords you would use a Disjunction query containing multiple Term queries. Quoted text searches mean different things in different contexts, but if the intention is to search for phrases (occurrences of terms in a sequence) then you should use a Phrase Query.Also, there are Match and MatchPhrase, these work like Term and Phrase queries, but they perform text analysis on the input, as opposed to looking for exactly what was given.The Query String queries is a higher level construct that tries to parse out what a user means and build the correct lower-level queries.And finally, are there plans for providing a way to influence the result sorting, e.g.: specify a time.Time field to be used as the primary sorting 'column', and changing the direction?There are a couple of ways to possibly get there. One way would be to introduce a means of sorting results by something other than the score. Right now we have a Collector interface, and the only implementation is one which orders results by scores and keeps the top N. Another way to get this behavior would be to have some custom scoring that lets you influence the score by some other context/data. In this case you still might sort the results by score.Yes, ideally for my use case, I would probably use use the score as the main sort, unless a user selects another method, such as the date. Then the score would preferably be the secondary sorting method for equality on the first.
I sense we still haven't quite gotten the right mix of queries to accomplish what you're trying to do. But I think we're close. Don't hesitate to keep asking questions if its still not working for you.I have one more question for now. Some of my data is per user (such as whether an article has been read or not), whereas the search index only contains the common stuff (like the title and content). Is it possible to somehow provide a filtering function which gets executed before the final limited result set is returned. In my case, that function would further limit the result set, if for instance, the user wants to search for only unread articles. I could currently do that after the search result has been supplied, but that will result in an unexpected result count (such as only providing 14 hits for the first page, since the rest of the 50-14 hits have been filtered out, yet still having more pages).
On Mon, Oct 13, 2014 at 4:14 AM, Viktor Kojouharov <vkojo...@gmail.com> wrote:Thanks for the reply
On Thursday, October 9, 2014 9:56:29 PM UTC+3, Marty Schoch wrote:[snip]And a bit related to that, how can I effectively prevent bleve from using the "FeedId", and potentially the boolean fields when searching for the supplied term, yet still use them for restricting the final result when I need to (like in my original problem with the Feed Ids)? Should I move to using multiple TermQuery objects, restricting them to my string fields, and do they handle multiple keyword and quoted text searches like the query string queries?Based on what you've described so far, I'm assuming the outer query was something a user would input (either directly or indirectly). So, that probably makes sense to leave as a Query String query since it gives some flexibility. The inner queries where you are trying to do exact matching on one or more specific terms makes sense as a Term query.Yes, precisely. Users might even find it useful to restrict the search on only the title or description, therefore the query string query is actually quite useful. I'm just wondering whether any utility fields (like the FeedId one), would be likely to cause a false positive (if the user searches for a number which just happens to be that feed id). If so, a way to make the query string query ignore specific fields would come in handy.If the FeedId is set to be included in the _all field, and the user searches for a term that accidentally matches a FeedId, then yes you could have some false positives. Given that you said you'd normally be explicitly adding additional query clauses to match the FeedId, and in those cases you'll explicitly set the field to FieldId, I'd recommend not including the FeedId in the _all field. You can do this by setting IncludeInAll to false on the field mapping object.
Generally the Term query is for doing an exact match. Not something you do a lot directly when searching, but it is the primitive on which everything else is built. Term Queries do not handle multiple key words or quoted strings. To search for multiple keywords you would use a Disjunction query containing multiple Term queries. Quoted text searches mean different things in different contexts, but if the intention is to search for phrases (occurrences of terms in a sequence) then you should use a Phrase Query.Also, there are Match and MatchPhrase, these work like Term and Phrase queries, but they perform text analysis on the input, as opposed to looking for exactly what was given.The Query String queries is a higher level construct that tries to parse out what a user means and build the correct lower-level queries.And finally, are there plans for providing a way to influence the result sorting, e.g.: specify a time.Time field to be used as the primary sorting 'column', and changing the direction?There are a couple of ways to possibly get there. One way would be to introduce a means of sorting results by something other than the score. Right now we have a Collector interface, and the only implementation is one which orders results by scores and keeps the top N. Another way to get this behavior would be to have some custom scoring that lets you influence the score by some other context/data. In this case you still might sort the results by score.Yes, ideally for my use case, I would probably use use the score as the main sort, unless a user selects another method, such as the date. Then the score would preferably be the secondary sorting method for equality on the first.I've opened an issue to track the feature for sorting by document fields instead of the score:I sense we still haven't quite gotten the right mix of queries to accomplish what you're trying to do. But I think we're close. Don't hesitate to keep asking questions if its still not working for you.I have one more question for now. Some of my data is per user (such as whether an article has been read or not), whereas the search index only contains the common stuff (like the title and content). Is it possible to somehow provide a filtering function which gets executed before the final limited result set is returned. In my case, that function would further limit the result set, if for instance, the user wants to search for only unread articles. I could currently do that after the search result has been supplied, but that will result in an unexpected result count (such as only providing 14 hits for the first page, since the rest of the 50-14 hits have been filtered out, yet still having more pages).Its an interesting idea. We certainly could implement it as you describe, but it seems to me like it could be very expensive (assuming you need the document id along with all its stored fields in this filter function.
If you're just filtering by document ID, have you tried adding a clause to exclude the document IDs that are already read? Currently this would have to be done by using the Boolean Query's MUST NOT clause to match the document IDs that were already read. It occurs to me that we lack a query mechanism for working directly with IDs, so you'd have to index the ID as well, which is kind of awkward.
Opening an issue for this too.This could work for somewhat long (say several thousdands) lists of articles read, but since the list of articles read typically grows without bound this may not work.