Suitability for non-changing corpus of large text files?

43 views
Skip to first unread message

Joshua

unread,
May 25, 2021, 5:42:10 PM5/25/21
to bleve
Hi there,

I'd like to create a simple tool for indexing directories which contain large text (or HTML, or PDF, but I'd convert them) files,  thousands of words typically (books/articles).
I'd likely only be doing simple queries, words or phrases, with tokenisation would be nice.

I was wondering, if Bleve would perform well here? It sounds like there's lots and lots of work for updating documents and the like and since mine is a fixed set of texts I thought maybe the value proposition would be different.

I suppose there's 2 questions:
1) Do you see anything wrong with using bleve to index (by word) 100's of n thousand word documents
2) Do you think I may be better off writing something myself if I'm not using a large chunk of bleve's functionality, or is tokenisation/search/etc a hard problem?

Thanks!

Marty Schoch

unread,
May 25, 2021, 5:53:46 PM5/25/21
to bl...@googlegroups.com
Welcome, see my responses below:

On Tue, May 25, 2021 at 5:42 PM Joshua <joshua.o...@gmail.com> wrote:
Hi there,

I'd like to create a simple tool for indexing directories which contain large text (or HTML, or PDF, but I'd convert them) files,  thousands of words typically (books/articles).
I'd likely only be doing simple queries, words or phrases, with tokenisation would be nice.

I was wondering, if Bleve would perform well here? It sounds like there's lots and lots of work for updating documents and the like and since mine is a fixed set of texts I thought maybe the value proposition would be different.

Much of the complexity in Bleve does come from supporting an index that you can search while you're still indexing, but you can simply avoid much of that complexity by using Bleve in a slightly different way.  First, I would suggest using the IndexBuilder functionality, it allows you to build a more optimal index up front, with the trade-off that you cannot search while indexing.  You can read more about it here: https://github.com/blevesearch/bleve/pull/1282

Second, because you won't be indexing documents after the initial build, I would recommend opening the index read-only.  In addition to ensuring you don't accidentally modify the index, it will prevent Bleve from spinning up the maintenance goroutines, so the runtime is significantly less complex.
 
I suppose there's 2 questions:
1) Do you see anything wrong with using bleve to index (by word) 100's of n thousand word documents

I don't see anything wrong with this.  For testing purposes we frequently index a set of 5M wikipedia articles, so the size shouldn't be a problem (even if the documents are larger).
 
2) Do you think I may be better off writing something myself if I'm not using a large chunk of bleve's functionality, or is tokenisation/search/etc a hard problem?

I think there is value in using Bleve, as it should handle all the requirements you listed here, and it will let you focus on the problem you're trying to solve.

That said, I would never discourage someone from learning more about the underlying technologies.  In my opinion they are not that complicated and you gain an appreciation for the different engineering trade-offs made in the various solutions.

Let us know if you have more questions.

marty
Reply all
Reply to author
Forward
0 new messages