trying to replicate performance claims

69 views
Skip to first unread message

mhgrove

unread,
Feb 4, 2011, 2:06:06 PM2/4/11
to Krati
Hi. I'm working on an application and I am considering using krati,
but I'm having some problems reproducing the performance numbers
stated on the web site. I suspect I've just misconfigured it, so I'm
hoping someone on here can point me in the right direction.

Currently the application stores the data in a Multimap (from google
collections), but as we're trying larger input sizes, we want to put
the hash table on disk rather than holding it in memory. The keys for
the map are longs, and I wrote a simple serialization class to
serialize and deserialize the lists of objects to arrays of bytes and
back. I was able to swap in the krati backed hash table, and my code
works, but the performance was very different from what is claimed on
the site.

I distilled down some of my code to measure *just* the cost of
populating the hash table originally -- the method is simple, it looks
up the current value in the krati store, if it exists, it appends the
new data to the existing value and puts it back in the store,
otherwise it inserts a new key value pair.

The write performance for creating a small hash table (10k elements)
was approximately .57 writes per ms (or 1.75ms per write), and the
read to check if the key is already present occurred at roughly 9 per
ms (or each read took about .1 ms). The serializing of the values to
bytes to be stored took an average of 0.04ms per value, which was less
than 2% of the total time to build the hash table.

The 10k elements are randomly generated, and tend to have about 1k
unique keys, so the values for the test were usually about 10 values
per keys, and each value serialized out to at most 170 bytes.

This is how I am creating the Krati store:

ds = new DynamicDataStore(Files.createTempDir(), new
WriteBufferSegmentFactory(128));

I settled on the WriteBuffer segments as they tended to give me the
best performance, the MemorySegmentFactory took approximately 10x more
time and the ChannelSegment was about 1.5x slower than the
MemorySegment.

Is there a better way to construct a store? I cannot specify the
capacity, I do not know the size ahead of time. I anticipate
eventually needing to put in the ballpark of 50-100M elements into the
store. Once the store is created I will not modify the data, I will
only perform reads from the store.

Am I doing something wrong here, or is krati not well suited for this
type of use?

Thanks.

Michael

Jingwei

unread,
Feb 8, 2011, 9:17:33 PM2/8/11
to Krati
Hello Michael,

In the case you just described, DynamicDataStore won't really help
unless you have explicitly specified the initial capacity (roughly 2
times #keys). Due to the same reason, we developed another data store
called IndexedDataStore. We strongly suggest you try this one.

This basic idea of IndexedDataStore is to put keys in memory and put
data in I/O cache or on disk. This reduces data movement among
segments caused by hash collisions. The constructor interface is like
the following

IndexedDataStore(File homeDir,
int batchSize,
int numSyncBatches,
int indexInitLevel,
int indexSegmentFileSizeMB,
SegmentFactory indexSegmentFactory,
int storeInitLevel,
int storeSegmentFileSizeMB,
SegmentFactory storeSegmentFactory)

homeDir - the store home directory.
batchSize - update batch size (e.g. 10000, persist redo log every
10000 updates)
numSyncBaches - the number of update batches needed to sync redo logs
with indexes.

indexInitLevel - linear hashing level for indexes. (e.g.
indexInitLevel 11 gives a initial capacity of 2^11 * 64K, which is
2^27, roughly 128 million keys)
indexSegmentFileSizeMB - index segment file size in MB (e.g. 32)
indexSegmentFactory - index segment factory (e.g. MemorySegmentFactory
is the best option)

storeInitLevel - linear hashing level for real data store. (e.g.
storeInitLevel 4 gives you a capacity of 2^4*64K, which gives roughly
1 million data items )
storeSegmentFileSizeMB - store segment file size in MB (e.g. 256)
storeSegmentFactory - store segment factory (e.g.
WriteBufferSegmentFactory)

Give, the case you have described, I would suggest to construct the
following IndexedDataStore.

new IndexedDataStore(
Files.createTempDir(),
10000,
5,
11, 64, new MemorySegmentFactory(),
5, 256, new WriteBufferSegmentFactory()).

The indexInitLevel is critical to write performance. The number 11
specifies enough hash space at the index level.

Look forward to hearing new performance numbers from you.

Thanks.

-jingwei

Mike Grove

unread,
Feb 9, 2011, 8:21:48 AM2/9/11
to kr...@googlegroups.com
This is the configuration you suggested:
ds = new IndexedDataStore(Files.createTempDir(), 10000, 5, 11, 64, new MemorySegmentFactory(), 5, 256, new WriteBufferSegmentFactory(256));

Reads were about 50% faster with this setup than using the dynamic store, averaging .0563ms per get, which is about 18 reads per ms.  Writes however were nearly twice as slow, averaging 3.3ms per write.  Both numbers are still slower than that your site says the store can produce.

I made a couple tweaks to the setup, trying this configuration:

ds = new IndexedDataStore(Files.createTempDir(), 100000, 50, 13, 64, new MemorySegmentFactory(), 10, 256, new WriteBufferSegmentFactory(256));

Reads slowed down, averaging .0641ms per read, but writes sped up and were about 2.3ms per write.  In both cases, reads are a good bit faster than with the DynamicDataStore, but writes are quite a bit slower.

I'm running the test with 8g of memory to the JVM to create a data store of only 10k elements.  Lastest Java 1.6 on my OSX box.

I do not need the contents of the store after I'm done using it for processing, it's not saved or re-used between invocations, so I dont need any redo logs or anything that would aid in recovery in the event of a failure.  Is there any way to disable that feature?

Also, what config can I use to disable logging information to the console, I'm getting a fair amount of INFO messages that I'm sure are not helping throughput.

Thanks.

Mike
Reply all
Reply to author
Forward
0 new messages