Bleve benchmarks (against Lucene)

2,001 views
Skip to first unread message

bouncyball

unread,
Apr 10, 2016, 9:38:46 PM4/10/16
to bleve
Hi there,
I'm considering using Bleve for a new application and was wondering if anyone has done any benchmarking with Lucene.
Thanks!

Marty Schoch

unread,
Apr 10, 2016, 9:46:17 PM4/10/16
to bl...@googlegroups.com
We have the bleve-bench project, which is a bit of a mess right now.


The intention is to create tests which are somewhat comparable to the lucene nightly bench http://home.apache.org/~mikemccand/lucenebench/

At the moment we only run an indexing throughput test.  We do run it nightly and plot the results:


We haven't really publicized the output yet because it's still highly variable and the machine we test on is a bit questionable.

So with all that out of the way, I would characterize performance against lucene as:

Indexing is somewhat slower, but recent tests with the "moss" kv layer are very promising.
Index size is larger, and even more so if you chose to "store" fields not just index them.  We lag way behind in this area.
If you index fields and don't store them, we're starting to close the gap.

Searching is universally slower than lucene.  This is an architectural limitation tied to the way we index data.  We'll try to do better in bleve 2.x but it may be hard to make substantial improvements here.

marty 

--
You received this message because you are subscribed to the Google Groups "bleve" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bleve+un...@googlegroups.com.
To post to this group, send email to bl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bleve/b1db5218-5d31-49b0-a016-ed215166a773%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

bouncyball

unread,
Apr 11, 2016, 2:46:02 AM4/11/16
to bleve
Thanks for the quick reply!
By moss, I assume you're referring to this? https://github.com/couchbase/moss

Can you expand a little bit on the architectural limitations for faster searching?
Thanks.

Marty Schoch

unread,
Apr 11, 2016, 8:48:49 AM4/11/16
to bl...@googlegroups.com
On Mon, Apr 11, 2016 at 2:46 AM, bouncyball <tim....@gmail.com> wrote:
Thanks for the quick reply!
By moss, I assume you're referring to this? https://github.com/couchbase/moss

Yes, moss lets us search data before its been written to disk.  It fits into our existing KV store API between bleve an any existing KV store.  Initial testing shows that it all KV stores except rocksdb benefit from having it in place.  It needs a lot more testing before we recommend it generally, but we're shipping a product using it at Couchbase.
 
Can you expand a little bit on the architectural limitations for faster searching?

Sure, so all search is built on top of the term search.  A terms search requires that we return all documents that use a particular term.

The indexing scheme we chose stores each of these in a separate row in the KV store.  We chose this format because putting all that information into a single row would mean that when indexing documents we would have to keep reading/rewriting the same rows over and over.  We preferred to make indexing throughput higher.  Putting each of these in separate rows was the simplest way to accomplish that.

We also were banking on 2 other things.

1.  Many KV stores have optimizations for repeated content in consecutive rows.
2.  We thought reading consecutive rows would be "good enough"

It turns out that in practice, we waste significant space repeating so much information (thus our indexes are larger than lucene).  And for large datasets, with high frequency terms, searching might include scanning a hundred thousand rows, and that isn't fast enough in any kv store.

We have some ideas for different indexing formats that could allow consolidating this information in the background.  This would reduce wasted space and speed up searches.  But, most likely all this will have to wait.  A lot of people have found the current bleve good enough, in spite of these limitations.  So we're going to focus on shipping a 1.0 that largely works like it does today.

marty


 
Reply all
Reply to author
Forward
0 new messages