New to Bleve, question about searching

1,289 views
Skip to first unread message

Douglas Fils

unread,
Oct 24, 2015, 11:21:46 PM10/24/15
to bleve
Hello,
  I am new to Bleve.   I am interested since I am diving into Go and I have a large collection of JSON documents.   It looks like this might be a useful tool to search and rank these documents.   However, I have some questions

  • Is there a way to do substring searches in Bleve?  (or other types of include, exclude, etc )
  • I think I understand the comparison to Lucene and Solr, where Bleve is more like Lucene.  Is there a "solr" in the Bleve sphere yet?  I guess Bleve explorer is something like this just as a demo.
Thanks
Doug


Marty Schoch

unread,
Oct 25, 2015, 4:32:35 PM10/25/15
to bl...@googlegroups.com
On Sat, Oct 24, 2015 at 11:18 PM, Douglas Fils <drf...@gmail.com> wrote:
Hello,
  I am new to Bleve.   I am interested since I am diving into Go and I have a large collection of JSON documents.   It looks like this might be a useful tool to search and rank these documents.   However, I have some questions

Welcome!
 
  • Is there a way to do substring searches in Bleve?  (or other types of include, exclude, etc )
Can you elaborate a bit more on the kind of search you have in mind?  The term 'substring search' could be taken to mean several different things.
 
  • I think I understand the comparison to Lucene and Solr, where Bleve is more like Lucene.  Is there a "solr" in the Bleve sphere yet?  I guess Bleve explorer is something like this just as a demo.
That is correct, Bleve (like Lucene) is just a library.  Solr/Elasticsearch are servers that offer distributed search services.

Bleve-explorer is just a toy to demonstrate how you can expose search as a service in a simple way.  It doesn't have any of the things you'd expect from a production service, and it doesn't do any sharding/distributed search.

A project to look at is cbft: https://github.com/couchbase/cbft

This is the project where my employer is developing distributed search integrated with Couchbase Server.  You may be thinking, but I don't want something that integrates with Couchbase Server...  That's OK, we've actually built several abstractions in place so that is just one of the ways it can be used.  The part where data comes in from Couchbase is pluggable, that part could just as easily be fitted with a piece that takes data in through a REST API.  The index sharding and query scatter/gather parts can all be reused.  So, this is not an out-of-the-box solution by any means, but if I were working on this, this is certainly where I would begin.

marty

Douglas Fils

unread,
Oct 25, 2015, 7:58:47 PM10/25/15
to bleve
Marty,
  Thanks!  appreciate the welcome.  I have come to really appreciate the open and friendly nature of Go communities.

  Most of my JSON documents are JSON-LD.  They are schema.org/Dataset and Datacatalog for example. We are also making JSON in line with the CSVW approach (http://www.w3.org/2013/csvw/wiki/Main_Page)  using the JSON approaches.   The ability to do ranked searches across keywords and even range searches across numerical facets like geologic age will be great.  Much of what I have seen in the YouTube videos and in the documents looks right along what we would want to do.    

  When I said "sub string" I was thinking thinking about some specific examples I can see our scientists making..

like searching "sediment" and wanting "sedimentary" to match (perhaps stemming I think you call it makes that)

or using paleo and wanting paleontological

Are there approaches then that address typo's or things like that in searches?  (like the "did you mean" stuff a person gets with Google.  

Also, we have a large wordnet like structure of terms and similarities..   so if there was a means to leverage that in Bleve searches it would be great.  These would be cases of things like synonyms for concepts or different words that mean close (though perhaps not exactly) the same thing.

I will go look at CBFT now..   thanks for the reference!

Doug

Marty Schoch

unread,
Oct 26, 2015, 11:45:16 AM10/26/15
to bl...@googlegroups.com
On Sun, Oct 25, 2015 at 7:58 PM, Douglas Fils <drf...@gmail.com> wrote:
Marty,
  Thanks!  appreciate the welcome.  I have come to really appreciate the open and friendly nature of Go communities.

  Most of my JSON documents are JSON-LD.  They are schema.org/Dataset and Datacatalog for example. We are also making JSON in line with the CSVW approach (http://www.w3.org/2013/csvw/wiki/Main_Page)  using the JSON approaches.   The ability to do ranked searches across keywords and even range searches across numerical facets like geologic age will be great.  Much of what I have seen in the YouTube videos and in the documents looks right along what we would want to do.    

  When I said "sub string" I was thinking thinking about some specific examples I can see our scientists making..

like searching "sediment" and wanting "sedimentary" to match (perhaps stemming I think you call it makes that)

or using paleo and wanting paleontological

Ah, so yes Bleve supports a text analysis pipeline.  Stemming is one of many transformations you can apply during index/searching to successfully handle cases like this.
 
Are there approaches then that address typo's or things like that in searches?  (like the "did you mean" stuff a person gets with Google.

We haven't done much in this area yet.  Some of the same techniques/data-structures used for auto-complete/suggestions can be used for this, but they're not implemented either (lots of interest though).

One thing you could do today is if a search returns 0 results, run another search, this time using a fuzzy query for the same terms.  This is more expensive, but might yield some helpful results to further guide the search.

 
Also, we have a large wordnet like structure of terms and similarities..   so if there was a means to leverage that in Bleve searches it would be great.  These would be cases of things like synonyms for concepts or different words that mean close (though perhaps not exactly) the same thing.

We have an open issue to implement a synonym token filter.  This would allow you to either expand or contract synonyms depending on what you're trying to do.


marty

Douglas Fils

unread,
Mar 3, 2016, 2:13:30 PM3/3/16
to bleve
Marty,
  So 4 months later and things have moved along with the effort I am involved with.   I'm ready now to revisit this.  I just used 

mapping := bleve.NewIndexMapping()
index, err := bleve.New("csvw.bleve", mapping)
and then a 
err = index.Index(item.URL, item)

Basically right of the example on blevesearch.com.  I indexed about 18K documents as a representation of the full document set.  

I can do a simple search but I have a question with fuzzy search

I have
        query := bleve.NewMatchQuery("JanusXrfSample")
search := bleve.NewSearchRequest(query)
searchResults, err := index.Search(search)
if err != nil {
fmt.Printf("this is error %v \n", err)
}
fmt.Printf("Results %v\n\n", searchResults)

// fuzzy search
fuzzyq := bleve.NewFuzzyQuery("JanusXrfSample")
fsearch := bleve.NewSearchRequest(fuzzyq)
fsearchResults, err := index.Search(fsearch)
if err != nil {
fmt.Printf("this is error %v \n", err)
}
fmt.Printf("Fuzzy Results %v\n\n", fsearchResults)


The Match query works fine locating the exact matches (48 of them).  However, I get 0 results from the fuzzy search.  Why would that be?  If the fuzzy search is looking for things closely spelled to the search term wouldn't an exact spelling also work?

I'm interested in people being able to give me something like set of keywords, like:  "Janus XRF" and then being able to get to this term.  Also things like "Thermal Con" being able to match up to "Thermal Conductivity".  

Questions...
1) Am I creating the index correctly or are there options I should be doing there to build a better index?

2) Is there some guide or example about the various queries listed at http://www.blevesearch.com/docs/Query/ and examples of how they work in connecting terms to indexes?

Thanks much for any time and guidance, I'm looking forward to diving to Bleve more now.  


Take care
Doug

Marty Schoch

unread,
Mar 4, 2016, 7:28:32 AM3/4/16
to bl...@googlegroups.com
See responses inline...

On Thu, Mar 3, 2016 at 2:13 PM, Douglas Fils <drf...@gmail.com> wrote:

The Match query works fine locating the exact matches (48 of them).  However, I get 0 results from the fuzzy search.  Why would that be?  If the fuzzy search is looking for things closely spelled to the search term wouldn't an exact spelling also work?

With what you've shown here I would expect the fuzzy search to also match.  Is it possible for you to create a smaller testcase that reproduces the problem.  Maybe indexing just a single document that exhibits the problem.  If not, is it possible to provide me with the index that shows this behavior with the code above?
 

I'm interested in people being able to give me something like set of keywords, like:  "Janus XRF" and then being able to get to this term.  Also things like "Thermal Con" being able to match up to "Thermal Conductivity".  

It's important to understand that underneath the hood (after analysis and fuzzy term generation) it's still an exact term match.  This means that 'JanusXrfSample' is never going to match 'Janus XRF'.  Because the second one is seen as 1 term and second one is seen as 2 terms.  Further both of those 2 terms are quite far away from the first term as measured by edit distance (used for fuzzy matching).  Similarly, 'Thermal' would be an exact match in your example, but getting 'Con' to match 'Conductivity' just using fuzzy matching is also problematic.

I don't know of any good approach for the case where you get 'JanusXrfSample' without any whitespace.
The 'Con' for 'Conductivity' case could be handled a couple of ways.

1.  If people in your domain frequently abbreviate "Conductivity" as "Con" you could index it as a synonym.  (not currently supported in bleve)
2.  You could index ngrams of terms.  This is allows for partial matching of terms, but gets very expensive to enough ngrams of conductivity to match 'con'.
 

Questions...
1) Am I creating the index correctly or are there options I should be doing there to build a better index?

There is no simple answer to this question.  Full-text indexes are not really right or wrong, they just produce better or worse results for the queries you care about.
 

2) Is there some guide or example about the various queries listed at http://www.blevesearch.com/docs/Query/ and examples of how they work in connecting terms to indexes?

I'm not aware of anything, but most of the queries we support are straightforward.
TermQuery - exact match on term
MatchQuery - input text is first analyzed into terms, then term query is run on the generated terms
Phrase - term query on the elements of the phrase, must also occur at exact position offsets matching the phrase
Phrase Match - input text is first analyzed, then phrase query is run on the generated terms
Fuzzy - fuzzy variations are generated from the input term(s) based on the provide edit distance, then term query is run on each
Regex/Wildcard - dictionary is scanned to find terms matching pattern, term query is run on each
Conjunction/Disjunction/Boolean - apply boolean logic to the queries above
Query String - parser builds boolean query from the input string per parsing rules
Date/Numeric Range Query - convert number/date into encoded 64-bit float terms (at multiple precisions) term queries run for smallest set of ranges

As you can see, each query we support ends up being a set of term queries combined in certain ways.  If I missed one, or you have more questions about any of them let me know.

marty

Douglas Fils

unread,
Mar 4, 2016, 7:46:15 AM3/4/16
to bleve
Marty,
  Thanks..  I think I can get you the code and the index.  I will see about places I can put them that is convenient for you to get them from.  The index is big, about 1GB so I can't just toss it up easily somewhere.  I will see how well it bzips.  

  I sat down with the tablet and reviewed Bleve docs in more details last night prior to your response.  That combined with what you wrote is starting to lift the fog.  I see now what you are telling me and I have simple solutions I think to this.  We are in parallel developing vocabularies in SKOS and also simple word net style approaches as well in this effort.  I can place the common abbreviations in that (like tcon, thermal conductivity, thermal con, etc) and then put each of these in the keyword section of the schema.org (JSON-LD) document (and CSVW) I am indexing into bleve.  Then the index will contain all the names and alternative names I know of.  Then it's likely the fuzzy only needs to try and resolve the simple typo or close spelling issue which it is designed for.   There is always the outlier, but I can deal with that and perhaps work them into the index as I discover them.  

 I think I am starting to understand this all better now and looking forward to what I can do.  I'll see about getting some of this better info into the index and doing some tests.

I really appreciate your time as I try and wrap my head around all this.  I'm sure I will have more 101 questions. 

Take care & thanks
Doug
Reply all
Reply to author
Forward
0 new messages