Searching WITH and without stopwords

283 views
Skip to first unread message

Peter Bengtsson

unread,
Sep 26, 2016, 9:40:48 AM9/26/16
to bleve
I'm indexing things like this:

englishTextFieldMapping := bleve.NewTextFieldMapping()
englishTextFieldMapping.Analyzer = en.AnalyzerName

songMapping := bleve.NewDocumentMapping()
songMapping.AddFieldMappingsAt("Text", englishTextFieldMapping)

indexMapping := bleve.NewIndexMapping()
indexMapping.AddDocumentMapping("song", songMapping)
indexMapping.DefaultAnalyzer = "en"

That makes it possible to search the "songs" by the "Text" field with the default English analyzer. So it does English stemming and English stop word removal. 
However, what I want to achieve is search both WITH stop words, but WITH stemming. In fact, the searching without stop words is going to be my "plan B". 

The stop words *might* matter. If someone types "what I'm doing is testing searching" I really want to query the database for "what I'm do is test search" (doing=>do and testing=>test and searching=>search when stemmed)
But I want to keep the "what I'm" and the "is test" and "test search" etc. 

Also, what I want to accomplish is to try this first and if it yields too few results, I'll try again but this time with stop words removed. 
The latter I can achieve myself by simply re-writing the query by manually removing the stop words so that they query becomes "test" (perhaps not a great example since there's so little left).
This way I can find results more accurately. 

I poked around in http://analysis.blevesearch.com/analysis trying to build my custom mapping but I quickly got lost. 

In PostgreSQL, where I'm migrating this app from, I'm doing something like this:

1) 
SELECT * FROM songs WHERE ts_vector('english_nostop', text) @@ plainto_tsquery('english_nostop', 'what I'm doing is testing searching') AND text ILIKE '%what I'm doing is testing searching%'...
2)
SELECT * FROM songs WHERE ts_vector('english_nostop', text) @@ plainto_tsquery('english_nostop', 'what I'm doing is testing searching')...
3)
SELECT * FROM songs WHERE ts_vector('english', text) @@ plainto_tsquery('english', 'what I'm doing is testing searching')...

This way I find super exact matches first, then just by stemming, then the same query with stemming and stop word removal. 
(The sorting is custom)


Peter

Marty Schoch

unread,
Sep 27, 2016, 9:07:43 AM9/27/16
to bl...@googlegroups.com
Hello,

First, here is an example of a custom analyzer that is just like "en", but it leaves stop words in.

err = indexMapping.AddCustomAnalyzer("enWithStopWords",
map[string]interface{}{
"type":      custom_analyzer.Name,
"tokenizer": unicode.Name,
"token_filters": []string{
en.PossessiveName,
lower_case_filter.Name,
porter.Name,
},
})
if err != nil {
return nil, err
}

Then to use it, you would change:

englishTextFieldMapping.Analyzer = en.AnalyzerName

to

englishTextFieldMapping.Analyzer = "enWithStopWords"

By doing this, the stop words will be put into the index.  This way at query time you can query both with or without the stop words.  However, most of the queries default to searching for multiple terms with an OR clause anyway, so if a search for "a b c d e" doesn't yield enough results, searching for "a c e" won't produce any more results.  Now, if you are instead joining these in a Conjunction (AND) query, then it will work as you expect.

Taking a step back, what I think you actually want is what is known as a Common Terms Query.  We have an open issue to add this in the future:


There is a link inside that issue to an Elasticsearch page describing the behavior in detail.  In short, it is a query which attempts to ignore the highest frequency terms (which would be candidates for stop word), EXCEPT in the cases where they matter.

marty


--
You received this message because you are subscribed to the Google Groups "bleve" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bleve+unsubscribe@googlegroups.com.
To post to this group, send email to bl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bleve/fabcaa86-08d5-44e7-be78-97f202d91aa3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Peter Bengtsson

unread,
Sep 28, 2016, 10:08:32 AM9/28/16
to bleve


On Tuesday, September 27, 2016 at 9:07:43 AM UTC-4, Marty Schoch wrote:
Hello,

First, here is an example of a custom analyzer that is just like "en", but it leaves stop words in.

err = indexMapping.AddCustomAnalyzer("enWithStopWords",
map[string]interface{}{
"type":      custom_analyzer.Name,
"tokenizer": unicode.Name,
"token_filters": []string{
en.PossessiveName,
lower_case_filter.Name,
porter.Name,
},
})
if err != nil {
return nil, err
}

Then to use it, you would change:

englishTextFieldMapping.Analyzer = en.AnalyzerName

to

englishTextFieldMapping.Analyzer = "enWithStopWords"

By doing this, the stop words will be put into the index.  This way at query time you can query both with or without the stop words.  However, most of the queries default to searching for multiple terms with an OR clause anyway, so if a search for "a b c d e" doesn't yield enough results, searching for "a c e" won't produce any more results.  Now, if you are instead joining these in a Conjunction (AND) query, then it will work as you expect.


That's interesting. So if I do a Match Phrase query for "a b c" it might return documents that only contain "a" and "b" but doesn't necessarily contain "c". 
Does that mean I have to break up the phrase "a b c" into "a","b","c" and join them with a ConjunctionQuery?
 
Taking a step back, what I think you actually want is what is known as a Common Terms Query.  We have an open issue to add this in the future:


There is a link inside that issue to an Elasticsearch page describing the behavior in detail.  In short, it is a query which attempts to ignore the highest frequency terms (which would be candidates for stop word), EXCEPT in the cases where they matter.


Read about the "common terms query" in ES and got so excited I put a :thumbsup: on the github issue :)

 
marty


To unsubscribe from this group and stop receiving emails from it, send an email to bleve+un...@googlegroups.com.

Marty Schoch

unread,
Sep 28, 2016, 10:25:33 AM9/28/16
to bl...@googlegroups.com
On Wed, Sep 28, 2016 at 10:08 AM, Peter Bengtsson <pet...@gmail.com> wrote:


On Tuesday, September 27, 2016 at 9:07:43 AM UTC-4, Marty Schoch wrote:
Hello,

First, here is an example of a custom analyzer that is just like "en", but it leaves stop words in.

err = indexMapping.AddCustomAnalyzer("enWithStopWords",
map[string]interface{}{
"type":      custom_analyzer.Name,
"tokenizer": unicode.Name,
"token_filters": []string{
en.PossessiveName,
lower_case_filter.Name,
porter.Name,
},
})
if err != nil {
return nil, err
}

Then to use it, you would change:

englishTextFieldMapping.Analyzer = en.AnalyzerName

to

englishTextFieldMapping.Analyzer = "enWithStopWords"

By doing this, the stop words will be put into the index.  This way at query time you can query both with or without the stop words.  However, most of the queries default to searching for multiple terms with an OR clause anyway, so if a search for "a b c d e" doesn't yield enough results, searching for "a c e" won't produce any more results.  Now, if you are instead joining these in a Conjunction (AND) query, then it will work as you expect.


That's interesting. So if I do a Match Phrase query for "a b c" it might return documents that only contain "a" and "b" but doesn't necessarily contain "c". 
Does that mean I have to break up the phrase "a b c" into "a","b","c" and join them with a ConjunctionQuery?

Oh sorry I didn't realize you were doing a Match Phrase query.  Any type of Phrase Query is really just a Conjunction (AND) of multiple Term queries, with additional constraints on the relative positions of the terms.

So, no doing a Match Phrase query on "a b c" will not return documents containing only "a" and "b".  My answer was assuming Match, not Match Phrase.

Going back to your original scenario, you should do the following:

1.  Define custom analyzer which will index the stop words.
2.  Configure the field to use this analyzer.
3.  Run match phrase query with phrase "a b c d e".
4.  If too few results, re-run match phrase query with "a b c d e", BUT also set the analyzer on the query to "en".  This will accomplish 2 things, first we remove the stop words for you, and second we adjust the positions to account for missing stop words.  So we'd be looking for just "a" AND "c" and "e" each separated by 1 token.
5.  It's possible that step 4 is still too restrictive because its still enforcing positional relationships, you might try just a Match query with those same terms.  In this case setting the analyzer to "en" is less important, since if the stop words don't match, that won't prevent the document from matching.

To do steps 4 and 5, set the Analyzer property on the MatchPhrase and Match queries (it defaults to the empty string, and in that case looks up the analyzer from the mapping), but by manually setting it, you can force it to use a different analyzer on the query input.

marty
Reply all
Reply to author
Forward
0 new messages