Using Krati in IndexTank

71 views
Skip to first unread message

ChrisLamprecht

unread,
Apr 23, 2012, 4:39:04 PM4/23/12
to kr...@googlegroups.com
Hi Jingwei & Krati users,

I'm working on integrating Krati into the IndexTank search engine project for document storage.  Currently all documents are stored compressed in RAM.  It would be much more efficient to use a datastore such as BDB or Krati to reduce the RAM requirements.  I've read the discussions on this message group, and I'm not sure on the best way to configure Krati for these requirements:

* A hosted search service
* Each search index (and Krati store) runs in its own JVM
* Number of documents (key/value pairs) can range from very small (less than 25,000) to 50 million or more
* Key size is always under 1024 bytes, and almost always much smaller (under 64 bytes). We can assume keys will fit in main memory for now.
* Value size can range from around 100 bytes up to 100KB, but is most often in the range 300 - 10000 bytes.
* Read and write frequency can range from almost none to tens per second

The goal is to minimize RAM requirements, while keeping read/write performance at a reasonable level.

So I think my questions are:
- which DataStore to use (IndexedDataStore, DynamicDataStore?)
- which Segment type to use (MappedSegment, ChannelSegment, WriteBufferSegment?)
- what configuration parameters to use (initLevel, segment size, etc) for the DataStore

Any input is appreciated!

-Chris

Jingwei

unread,
Apr 23, 2012, 6:58:01 PM4/23/12
to Krati
Hi Chis,

Interesting. I just had a weekend hack for storing and retrieving json
objects using Krati. Take a look at https://github.com/jingwei/jsonstore

You may find it helpful for setting up a krati based web services or
understanding how to configure IndexedDataStore.

Besides this, please see my reply inlined.

On Apr 23, 1:39 pm, ChrisLamprecht <clampre...@gmail.com> wrote:
> Hi Jingwei & Krati users,
>
> I'm working on integrating Krati into the IndexTank search engine project
> for document storage.  Currently all documents are stored compressed in
> RAM.  It would be much more efficient to use a datastore such as BDB or
> Krati to reduce the RAM requirements.  I've read the discussions on this
> message group, and I'm not sure on the best way to configure Krati for
> these requirements:
>
> * A hosted search service
> * Each search index (and Krati store) runs in its own JVM
> * Number of documents (key/value pairs) can range from very small (less
> than 25,000) to 50 million or more
> * Key size is always under 1024 bytes, and almost always much smaller
> (under 64 bytes). We can assume keys will fit in main memory for now.
> * Value size can range from around 100 bytes up to 100KB, but is most often
> in the range 300 - 10000 bytes.
> * Read and write frequency can range from almost none to tens per second
>
> The goal is to minimize RAM requirements, while keeping read/write
> performance at a reasonable level.
>
> So I think my questions are:
> - which DataStore to use (IndexedDataStore, DynamicDataStore?)

IndexedDataStore is a better solution. It is much more efficient at
handling updates.
It holds all the keys in main memory.

> - which Segment type to use (MappedSegment, ChannelSegment,
> WriteBufferSegment?)

WriteBufferSegment is a good choice. As its name suggests, it buffers
writes in memory and has goot write throughput.

> - what configuration parameters to use (initLevel, segment size, etc) for
> the DataStore

I prefer to use segments of size 64MB or 128MB.

The initLevel is a bit complicated, you can use StoreConfig with the
specified initialCapacity, which is much clearer. The initialCapacity
cannot be changed once
the underlying store is created. So please choose this parameter
according to the estimation of your data sets.

ChrisLamprecht

unread,
Apr 24, 2012, 2:47:14 AM4/24/12
to kr...@googlegroups.com
Thanks Jingwei.  I did some testing today and had good results with reducing RAM requirements by 30-40% on the two search corpuses I tested with.

I had a question about how to further reduce the initial overhead of Krati.  As soon as Krati is initialized, but before adding any documents, it appears to have an overhead of around 150-180MB.  Here is how I'm initializing Krati:

StoreConfig config = new StoreConfig(cacheDirectory, 2000000);
config.setBatchSize(10000);
config.setNumSyncBatches(5);
config.setSegmentFactory(new WriteBufferSegmentFactory(64));
config.setSegmentFileSizeMB(64);
myStore = new IndexedDataStore(config);


I did a 2nd test where I changed the initialCapacity to 500000, and lowered the batchSize to 2000, and the memory overhead dropped some.  I'm curious what other parameters can lower the overhead (does lowering segment file size lower overhead?).  

thanks,
-Chris

Jingwei

unread,
Apr 24, 2012, 6:56:27 PM4/24/12
to Krati
Hi Chris

The initialCapacity requires initialCapacity * 8 bytes memory.

The WriteBufferSegmentFactory allocates WriteBufferSegment(s) of size
64MB. Since it uses memory buffer for append operations, at any time,
there may be up to 2-3 WriteBufferSegment(s) with 64MB memory for
each. So you looking at 120MB - 180MB memory footprint.

There are several ways for reducing the memory usage.

1. Use other segment factory such ChannelSegmentFactory (write
throughput decreases as a result)
2. Make segmentFileSizeMB smaller (say 32)

Thanks.

Jingwei

On Apr 23, 11:47 pm, ChrisLamprecht <clampre...@gmail.com> wrote:
> Thanks Jingwei.  I did some testing today and had good results with
> reducing RAM requirements by 30-40% on the two search corpuses I tested
> with.
>
> I had a question about how to further reduce the initial overhead of
> Krati.  As soon as Krati is initialized, but before adding any documents,
> it appears to have an overhead of around 150-180MB.  Here is how I'm
> initializing Krati:
>
> StoreConfig config = new StoreConfig(cacheDirectory, 2000000);
>
> > config.setBatchSize(10000);
> > config.setNumSyncBatches(5);
> > config.setSegmentFactory(new WriteBufferSegmentFactory(64));
> > config.setSegmentFileSizeMB(64);
> > myStore = new IndexedDataStore(config);
>
> I did a 2nd test where I changed the initialCapacity to 500000, and lowered
> the batchSize to 2000, and the memory overhead dropped some.  I'm curious
> what other parameters can lower the overhead (does lowering segment file
> size lower overhead?).
>
> thanks,
> -Chris
>
>
>
>
>
>
>
> On Monday, April 23, 2012 5:58:01 PM UTC-5, Jingwei wrote:
>
> > Hi Chis,
>
> > Interesting. I just had a weekend hack for storing and retrieving json
> > objects using Krati. Take a look athttps://github.com/jingwei/jsonstore

Jingwei

unread,
Apr 24, 2012, 7:50:53 PM4/24/12
to Krati
Hi Chris

Since there is not much write traffic, ChannelSegmentFactory will suit
your needs better and its uses less memory.

Thanks.

Jingwei

Jingwei

unread,
Apr 26, 2012, 1:01:39 AM4/26/12
to Krati
Hi Chris,

We released krati 0.4.5 today. You may want to try this version. BTW,
how is your integration going?

Best,

Jingwei

ChrisLamprecht

unread,
Apr 26, 2012, 1:31:51 AM4/26/12
to kr...@googlegroups.com
Hi Jingwei,

Thanks, I'll update to 0.4.5 (I was using 0.4.3).  The testing is going very well.  I'm testing with 32MB segments and ChannelSegments.  On most search indexes so far, it's reducing memory usage from 35-50%.  Obviously the larger the document size, the more krati helps.  I'm now cleaning up my code that integrates it into IndexTank, and I'll be testing it more, and ultimately submitting a pull request to the indextank-engine project to use krati.

One question.  Since I'm essentially using Krati as a disk-based store but with a cache, my main tuning parameter will be how much RAM to allocate to krati for "caching" (avoiding disk reads).  A search index with lots of search traffic might get a larger cache.  (At least this is how I think it used to work when IndexTank used BDB).  So which Krati parameter(s) would I use to adjust this amount of RAM it should get - would it be the segment size? 

thanks,
-chris

Jingwei

unread,
Apr 26, 2012, 1:58:01 AM4/26/12
to Krati
Hi Chris,

Krati does not support memory tuning. It relies on NIO and OS page
cache.

There are four types of SegmentFactory.

MemorySegmentFactory : all segments are in memory.

MappedSegmentFactory : all segments are in mmap.

ChannelSegmentFactory : all segments are in NIO file channel.

WriteBufferSegmentFactory: all segments except the current appending
segment are in NIO file channel. The current appending segment is in
memory.

Each SegmentFactory gives distinct performance. We use them
differently on production depending on the application requirements.

For your question, the WriteBufferSegmentFactory with 32MB segment
file size seems to be a good option.

Thanks.

Jingwei

Jingwei

unread,
Apr 26, 2012, 2:00:08 AM4/26/12
to Krati
Since your write throughput is not high, I think ChannelSegmentFactory
with 32MB/64MB is good too.

Jingwei
Reply all
Reply to author
Forward
0 new messages