Scorch index size over time

365 views
Skip to first unread message

Andrew Webber

unread,
Jun 9, 2018, 2:09:02 PM6/9/18
to bleve
Given 1000 objects, indexing with or without batch, results in large directory size differences.
- Batch 10.3 MB
- Without Batch 1.6 GB

My concerns is given a long running process, indexing objects as they arrive dynamically within a system, it is less likely that large amounts of objects could be indexed in batch (excluding re-building the index).

Example below, 
forgive me if I have made a petty mistake/misunderstanding:

package main

import (
"fmt"
"log"
"os"

)

type documentIndex struct {
ID      string `json:"id"`
Type    string `json:"type"`
Content string `json:"content"`
}

type Indexer interface {
Index(id string, data interface{}) error
}

func main() {
os.RemoveAll("example.scorch")
writeInBatch := false
// open a new index
keywordFieldMapping := bleve.NewTextFieldMapping()
keywordFieldMapping.Store = false
keywordFieldMapping.IncludeInAll = false
keywordFieldMapping.IncludeTermVectors = false
keywordFieldMapping.Analyzer = keyword.Name

docMapping := bleve.NewDocumentMapping()

// name
docMapping.AddFieldMappingsAt("name", keywordFieldMapping)
// content
docMapping.AddFieldMappingsAt("content",
keywordFieldMapping)

mapping := bleve.NewIndexMapping()
mapping.AddDocumentMapping("content", docMapping)

// mapping.DefaultAnalyzer = "en"
mapping.TypeField = "type"

writeTestData := func(indexer Indexer) {
for i := 0; i < 1000; i++ {
keyID := fmt.Sprintf("%v", i)
data := documentIndex{
ID:      keyID,
Type:    "content",
Content: fmt.Sprintf(content, i),
}

indexer.Index(keyID, data)
}
}

index, err := bleve.NewUsing("example.scorch", mapping, scorch.Name, scorch.Name, nil)
if err != nil {
fmt.Println(err)
return
}

if writeInBatch {
batch := index.NewBatch()
writeTestData(batch)
// results in 10.3 MB of data
err = index.Batch(batch)
if err != nil {
log.Fatal(err)
}
} else {
// results in 3.5 GB of data
writeTestData(index)
}

// search for some text
query := bleve.NewMatchQuery("travelling")
search := bleve.NewSearchRequest(query)
search.Highlight = bleve.NewHighlight()
searchResults, err := index.Search(search)
if err != nil {
fmt.Println(err)
return
}
fmt.Println(searchResults)

log.Printf("%+v", index.Stats())
}

const content = `Sentiments two occasional affronting solicitude travelling and one contrasted. Fortune day out married parties. Happiness remainder joy but earnestly for off. Took sold add play may none him few. If as increasing contrasted entreaties be. Now summer who day looked our behind moment coming. Pain son rose more park way that. An stairs as be lovers uneasy. Another journey chamber way yet females man. Way extensive and dejection get delivered deficient sincerity gentleman age. Too end instrument possession contrasted motionless. Calling offence six joy feeling. Coming merits and was talent enough far. Sir joy northward sportsmen education. Discovery incommode earnestly no he commanded if. Put still any about manor heard. food%i.`

Marty Schoch

unread,
Jun 9, 2018, 2:25:26 PM6/9/18
to bl...@googlegroups.com
Hi,

Thanks for reporting this.  At Couchbase most of our usage is with batching, so it is possible we've missed something here.  That being said, what I think is happening is that your test code here is exiting as soon as the indexing is complete.  The way bleve/scorch works is that we first try to make the data you index searchable as soon as possible.  Right now we do that at the expense of initial index size.  The data coming in without any batching will take up considerably more space when we initially index it.

That said, there is a background process which merges these tiny segments (with 1 doc each) into larger segments (following a typical logarithmic staircase pattern).  I think what is happening here is that when the program ends, we have approximately 1000 tiny segments still on disk.  If you were to wait longer (simple sleep would work to try and verify this), the merger would continue to do work, combining these smaller segments into larger ones.  These newer larger segments should have less overhead, and the overall size of the index will come down.

There will still be some differences though.  Doing all 1000 in a single batch will result in exactly 1 segment, whereas with 1000 tiny segments, the merge policy will most likely stop before merging everything into one segment.

As a next step, I'd suggest adding a sleep for maybe 10 seconds, to see if you observe a smaller index for the non-batch case.

marty

--
You received this message because you are subscribed to the Google Groups "bleve" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bleve+unsubscribe@googlegroups.com.
To post to this group, send email to bl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bleve/c0da407c-2cc7-4117-a08f-82fca942bd6b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Andrew Webber

unread,
Jun 9, 2018, 2:43:17 PM6/9/18
to bl...@googlegroups.com
Marty, 

thanks for your lighting fast and detailed response.

I can indeed confirm, after adding a sleep for 30 seconds, the index directory shrunk to 11MB. Of further interest was that the number of files shrank from ~1000 to 2.

many kind regards,

Andrew

To unsubscribe from this group and stop receiving emails from it, send an email to bleve+un...@googlegroups.com.

To post to this group, send email to bl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bleve/c0da407c-2cc7-4117-a08f-82fca942bd6b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "bleve" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bleve+un...@googlegroups.com.

To post to this group, send email to bl...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages