Binary Tree Tuning

109 views
Skip to first unread message

Loren Van Spronsen

unread,
Mar 18, 2010, 3:43:06 AM3/18/10
to Tokyo Cabinet Users
Hey Everyone,

So, I'm relatively new to the tokyo cabinet thing, but I've really
been diving in head-first. I've been tearing through C-code like mad
and I've already written my own .NET client for the entire Tokyo
Tyrant API.

Things are going really well, and I'm ready to start doing some
serious benchmarking to see just how TC/TT is going to handle our
data. A lot of people seem to be having troubles getting TC/TT to
write fast enough, and I'm hoping someone can help me get around
degradation of write performance.

Now, TC/TT really appealed to me b/c it seemed to suit the small-key/
small-value model that we have very well. On average, out keys will be
about 16-20 bytes (they will all be the same, still not fully decided
on key models) and our values will almost all be 4 or 8 bytes (float/
double values).

We're hoping to use the TC/TT B-tree storage engine, so that we can
read ranges of values relatively fast, read performance isn't a huge
deal, but I would expect TC to be able to handle sequential reads of
key ranges very well if it is stored in B-tree pages.

I've read through this page:
http://korrespondence.blogspot.com/2009/09/tokyo-tyrant-tuning-parameters.html
Which explains a few things about the TC b-tree tuning options, but
I'm hoping someone can give me a little more info into what the
options actually. As I understand it:

The TC B-tree engine allocated a hash table of size #bnum and
indexes
into that hash table to find the appropriate non-leaf node

The TC B-tree engine allocates #bnum non-leaf pages, each of
which contains
#nmemb, which are indexed to find the appropriate leaf-node

The TC B-tree engine allocated #bnum * #nmemb leaf-pages,
each of which contains
#lmemb members for storing actual key/value pairs

A maximum of #ncnum non-leaf nodes will be stored in memory
at any given time

A maximum of #lcnum non-leaf nodes will be stored in memory
at any given time

#apow indicates the alignment for key/value pairs (or just
values???), which in my case
should be set to 6, since my key/value pairs will never
exceed 64 bytes?

No idea about #fpow, can anyone offer me some help there?

No idea about #xmsiz, the link above makes it sound like it
might be the total
amount of cache memory available to TC, can anyone clear that
up?

I heard someone on SO mention that disabling ext3 journalling
helped speed of their database
significantly, can anyone attest to that?

No, we're planning to throw some relatively powerful servers at some
very large datasets. Could anyone provide a relatively well tuned
configuration for a server with perhaps, 1.2 TB of storage ( RAID 10
of 15K RPM SAS drives) and 24GB of RAM? How many records could you
feasibly store on such a server and still obtain response times of < .
5sec, while always undergoing large insert volume

Hopefully I'm not asking too much, but this is the first real source
of info on TC/TT I've found


Vinicius Tinti

unread,
Mar 18, 2010, 8:29:01 AM3/18/10
to tokyocabi...@googlegroups.com
This is a recurrent topic on this group. I am seeking for this answer for
a long time ... My advice is tune with a test-driven your application. I my
case a good number was using the recommended specification on TC's
specification page and in the bottom of TT's page there is another one
witch applies on TC too. If you plan to have a very low latency do not
use any compress, it took about 90% of the time in my case.

The TC's presentation and the format specification shows the
organization of the data. They are a might be a good way for understand
how to tune.

Sorry, I am frustrated because I can't help more than that.
Good luck!



--
You received this message because you are subscribed to the Google Groups "Tokyo Cabinet Users" group.
To post to this group, send email to tokyocabi...@googlegroups.com.
To unsubscribe from this group, send email to tokyocabinet-us...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/tokyocabinet-users?hl=en.




--
Vinicius Tinti -> vinici...@gmail.com, ti...@dcc.ufmg.br, ti...@comp.eng.br

"Here Goes Nothing"
"Beware of bugs in the above code; I have only proved it correct, not tried it."
"Only those who attempt the absurd can achieve the impossible."
"Never underestimate the power of very stupid people in large groups."
"Simplicity is the ultimate sophistication"
"Beware, I know Ruby-FU"

Hamish Allan

unread,
Mar 18, 2010, 8:35:31 AM3/18/10
to tokyocabi...@googlegroups.com
On Thu, Mar 18, 2010 at 7:43 AM, Loren Van Spronsen
<loren.va...@gmail.com> wrote:

>         No idea about #fpow, can anyone offer me some help there?
>
>         No idea about #xmsiz, the link above makes it sound like it
> might be the total
>         amount of cache memory available to TC, can anyone clear that
> up?

There's an article here:

http://blog.grayproductions.net/articles/tokyo_cabinets_keyvalue_database_types

which has a little about tuning parameters at the bottom.

Hope this helps,
Hamish

Loren Van Spronsen

unread,
Mar 18, 2010, 8:37:36 PM3/18/10
to Tokyo Cabinet Users
Heyya,

Thanks for the posts everybody. So I've implemented a quick test box
in order to play around with some of the tuning parameters of TC.
Things have worked out great so far, using a few tokyo tyrant
implementations, I'm achieving 60K inserts per second against a
standard run-of-the-mill workstation. I'm hoping that will scale up
too 200-300k per second as we move up to a server class machine.

I've hit 800 million records in my database as I'm writing this, with
the database consuming about 23 GB of disk space. Keeping in mind that
the test system contains only a single 7200RPM sata hard-drive, 2GB of
RAM and a dual core CPU, I'm quite impressed that it is able to
respond to read requests on average within 20-25ms WHILE undergoing
60K writes per second. (actually, the write speed dropped about 20-25%
while I was doing 10000 random reads, but even that is respectable).

For anyone interested, my tuning parameters are set as follows:

#opts=l
It took me a while to figure out that I had to rebuild TC with --
enable-off64 to bypass the 2GB default limit (even with opts=l set) on
i386 systems, which by default don't support 64 bit file pointers.
Once this option was set, the database took off like a champ, and
hasn't looked back at all. I haven't set any compression features
because I'm using binary keys and values, so I'm not expecting to
achieve much in the way of compression regardless. However, it is
worth mentioning that the math for my database works out pretty much
exactly to what is predicted by TC, I'm getting 800 million records in
23GB of space = 28.75 bytes per record.I'm using 16 byte keys and 8
byte values consistently, add in the 5 byte overhead predicted by TC
and you get 29 bytes, which is almost exactly what I'm getting. This
is a perfectly acceptably number for me, so I'm not too worried about
compression

#bnum=100,000,000 (commas added for readability)
When I restarted TT after changing bnum to 100 million, I made a point
of watching it allocate the initial database file and RAM size, which
stopped at about 800MB. This essentially means that the initial hash
table stores NOTHING but the 64 bit (100million * 8 bytes = 800MB)
file pointer. Which is pretty impressive. Even assuming this entire
hash table is loaded into RAM this is only 800MB of ram being used
(right now the entire server is using 918MB of RAM, so I can probably
even boost this up slightly if I decide to restart the test). When I
get my production servers (which are going to have 24GB of RAM and be
100% dedicated boxes), I will probably boost this number to 1 and a
half billion or so, and expect it to use about 12GB of RAM for the
hash table, leaving me 10GB or so of RAM to play around with the node
caches

#lmemb=512 and #nmemb=512
One of the primary operations we will be performing against this
database will be reading ranges of values, sometimes in very large
ranges. For this reason, I'm more sure about my choice to make #lmemb
larger. When TC loads the entire leaf node (which contains the actual
records?) it should be able to read records sequentially. Since we are
hoping to read between a max of between 10,000 and 20,000 records at a
time, I'll probably increase #lmemb to 1024 or 2048 on a production
system. With faster server hard-drives, this shouldn't increase read
delay by much, but should provide much more efficient access to large
sequential blocks. Also, since TC uses msync and memory mapped files,
I would assume that increasing the size of leaf nodes while doing
sequential inserts of records with similar keys could potentially
greatly improve write performance. I'm going to have to do some
serious though about #nmemb, I'm not at all sure whether increasing it
any more will have any significant effect on speed. I think I will
balance it in the end such that #lmemb * #nmemb is about 100,000,
since that is about the absolute maximum number of records we will
read sequentially, but I'm not even too sure about that.

#xmsiz=50,000,000
I can't say anything about whether this was used significantly by TC,
still not sure quite what it is used for, hopefully this will become
clearer as I keep playing around with it.

I didn't change either of the cache values, so they were still at
their defaults of 1024 leaf-nodes and 512 non-leaf-nodes. Its
interesting to note that:
1024 leaf-nodes * 512 records per nodes * 29 bytes per record =
15MB
512 non-leaf-nodes * 512 records per node * 8 bytes per file
pointer = 2MB

If I'm not mistaken, I was only using 17MB of memory for cache.
Considering that I was writing a grand total of about 2MB / second,
this seems somewhat inadequate. In future tests, I will probably bump
these numbers up significantly. Perhaps up to 20,000 leaf nodes (about
300MB of RAM) and 25,000 non-leaf-nodes ( about 100 MB of RAM ), just
to see what sort of difference that makes.

Out of curiosity, does anyone have any information on what sort of
caching strategies TC uses for cached pages?

Thanks for your help, and hopefully some of my information was useful
in response
Loren Van Spronsen


On Mar 18, 5:35 am, Hamish Allan <ham...@gmail.com> wrote:
> On Thu, Mar 18, 2010 at 7:43 AM, Loren Van Spronsen
>

> <loren.vanspron...@gmail.com> wrote:
> >         No idea about #fpow, can anyone offer me some help there?
>
> >         No idea about #xmsiz, the link above makes it sound like it
> > might be the total
> >         amount of cache memory available to TC, can anyone clear that
> > up?
>
> There's an article here:
>

> http://blog.grayproductions.net/articles/tokyo_cabinets_keyvalue_data...

Vinicius Tinti

unread,
Mar 18, 2010, 8:56:44 PM3/18/10
to tokyocabi...@googlegroups.com
Hi,

Now who needs to say thanks it's me, thank you very much for sharing this information!!!
Also I have an idea, how about create an topic called "My tune of TC/TT"? There we will
to post information like: environment, tune options, performance and others.

What about it?

Thanks.

PS: I really need to say thanks!

--
You received this message because you are subscribed to the Google Groups "Tokyo Cabinet Users" group.
To post to this group, send email to tokyocabi...@googlegroups.com.
To unsubscribe from this group, send email to tokyocabinet-us...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/tokyocabinet-users?hl=en.

Tom Chen

unread,
Mar 18, 2010, 8:51:32 PM3/18/10
to tokyocabi...@googlegroups.com
Hi Loren,

Thanks for the info that you provided. Your benchmarks match what I'm getting against a xeon box with raid 0, but i'm inserting 1KB values with 10 byte key. 

Which client bindings are you using? I'm using the memcache protocol, and I can get it running up to 30k ops a second. 

Here is my params i use to launch the server.

ttserver -port 21201 -thnum 8 -uas /var/lib/ttserver/cats.tch#opts=ld#mode=wc#bnum=1000000000

I will be curious to see what you end up using for xmsiz settings, and what others provide on how ttserver uses the memory cache. 

Tom




--
You received this message because you are subscribed to the Google Groups "Tokyo Cabinet Users" group.
To post to this group, send email to tokyocabi...@googlegroups.com.
To unsubscribe from this group, send email to tokyocabinet-us...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/tokyocabinet-users?hl=en.




--
Tom Chen
Software Architect
GOGII, Inc
t...@gogii.net
650-468-6318

Loren Van Spronsen

unread,
Mar 18, 2010, 9:09:29 PM3/18/10
to Tokyo Cabinet Users
Hey Vinicius,

Sounds like a good idea, but what I really think we need is specific
examples of what each tuning parameter will effect. I'm going to get
setting up a small farm of high powered TC/TT machines over the next
couple of months, so I expect to be pouring through a lot of the C-
code for the BDB implementation. When I feel a little more comfortable
with some more of the options (most of what I've gone through now is
just simple benchmarking and common sense), I'll whip up a topic for
what I've noticed and how I feel tuning different parameters will
effect different environments.

Until then, your idea sounds good, I'm still at work, but I might get
that done tonight, who knows. My database just hit a billion rows,
which is a bit of a milestone :)

I don't suppose anyone has any information on file systems and TC/TT.
Right now I'm just using ext4 (I've played around with disable and
enabling journaling, which as of yet hasn't seemed to make a
tremendous difference), but has anyone has particular success with any
common linux file system?

Loren Van Spronsen

> > tokyocabinet-us...@googlegroups.com<tokyocabinet-users%2Bunsubs cr...@googlegroups.com>


> > .
> > For more options, visit this group at
> >http://groups.google.com/group/tokyocabinet-users?hl=en.
>
> --

> Vinicius Tinti -> viniciusti...@gmail.com, ti...@dcc.ufmg.br,

Vinicius Tinti

unread,
Mar 18, 2010, 9:37:01 PM3/18/10
to tokyocabi...@googlegroups.com
Hi,

We can create a little set of test, share, run them and report for the community. In
my work, we have a small cluster of high power machines.

I have only used TC/TT with ext3 and ext4 but I have never worried about it.
It is  very important and could lead to better or poor performance.

About file systems I now that:

ext3 - is to slow for small files, it don't have any optimizations for them. It also
keep better consistence of the files than the file system it self.

ext4 - (I don't now very much) I have read that the developers have fix the problem
of small files and increase the performance (take a look on phoronix http://www.phoronix.com/
for benchmarks). Google is migrating (or have migrated) for ext4
http://arstechnica.com/open-source/news/2010/01/google-upgrading-to-ext4-hires-former-linux-foundation-cto.ars
before they use ext2 (interesting not?).

reiserfs - Oposite of ext3. Optimized for small files and not for large. Is easier get
files corrupts than the file system it self.

ext2 - fast simple and without security. The security level must be in application.
A simple solution that I have read in somewhere is creating a slave on a security
file system like the above.

spadfs - Hash based file system that I have studied. I like the idea of it. According
to the author it is as safe as a journaling file system and as fast as a not journaling
file system. You can get the code here: http://artax.karlin.mff.cuni.cz/~mikulas/spadfs/

hadoop - I guess in this case is better use cassandra instead tokyo.

I have only used the file systems above and a little of ZFS (a great on too and only for
Solaris and FreeBSD). Linux ZFS like file system is under development it is called BRTFS.

Cheers.
Vinicius.

To unsubscribe from this group, send email to tokyocabinet-us...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/tokyocabinet-users?hl=en.




--
Vinicius Tinti -> vinici...@gmail.com, ti...@dcc.ufmg.br, ti...@comp.eng.br
Message has been deleted

Loren Van Spronsen

unread,
Mar 19, 2010, 12:54:04 AM3/19/10
to Tokyo Cabinet Users
Hey Vinicius,

I'll definitely be posting more test results as things progress,
although some of them will probably be VM based.

Thanks for the info on file systems, some of it is new to me. Things
like resierfs, spadfs and hadoop don't really seem practical for TC/
TT, since it uses a single large file.

I think the safest bets are definitely ext2/ext3/ext4 and ZFS/BRTFS, I
might post some comparison results when I get the opportunity to run
more tests against some VMs... Possibly tonight yet...

Thanks for all of your responses guys,
Loren Van Spronsen

On Mar 18, 6:37 pm, Vinicius Tinti <viniciusti...@gmail.com> wrote:
> Hi,
>
> We can create a little set of test, share, run them and report for the
> community. In
> my work, we have a small cluster of high power machines.
>
> I have only used TC/TT with ext3 and ext4 but I have never worried about it.
> It is  very important and could lead to better or poor performance.
>
> About file systems I now that:
>
> ext3 - is to slow for small files, it don't have any optimizations for them.
> It also
> keep better consistence of the files than the file system it self.
>
> ext4 - (I don't now very much) I have read that the developers have fix the
> problem

> of small files and increase the performance (take a look on phoronixhttp://www.phoronix.com/
> for benchmarks). Google is migrating (or have migrated) for ext4http://arstechnica.com/open-source/news/2010/01/google-upgrading-to-e...

> ...
>
> read more »

Loren Van Spronsen

unread,
Mar 19, 2010, 12:56:44 AM3/19/10
to Tokyo Cabinet Users
Hey Tom,

I would guess that the largest difference between the performance I'm
getting on a single drive and you're getting on a raid 0 has to do
with the size of the values.

That being said, I'm using a custom set of .NET bindings I created in
about 3 hours yesterday (implemented the vast majority of the binary
protocol). I would give the binary protocol a shot, it is VERY
minimal
and absolutely optimized for speed. However, I reckon that hard drive
and IO speed are your bottleneck, just like they are mine. If you are
like me and are using binary keys and very compact (not necessarily
small, but dense) values, I might suggest trying to disable
compression, I noticed a relatively significant increase in IO
throughput when I disabled compression features. I'm only using 24
bytes total for a single record, so it wasn't a big deal for me.

I explained my parameters for starting the server above, I believe 8
worker threads is the default, so that hasn't changed... Would you
mind explaining what the -uas switch is for? My complete set of
parameters would be something along the lines of (commas added for
readability):

ttserver -port 1978 -thnum 8 /var/ttserver/
casket.tcb#opts=l#bnum=100,000,000#lmemb=512#nmemb=512

I'll add more information about the xmsiz and cache values that I'm
using when I learn a little more about how they impact performance.
Just for comparison, would you mind mentioning which linux distro,
file system, and speed of hard-drive you are using? I'm just trying
to
get a decent grasp of what I can expect from TC/TT in any given
environment

Thanks,
Loren Van Spronsen

> > tokyocabinet-us...@googlegroups.com<tokyocabinet-users%2Bunsubs cr...@googlegroups.com>

Loren Van Spronsen

unread,
Mar 19, 2010, 1:10:35 AM3/19/10
to Tokyo Cabinet Users
Hey Tom,

I couple of other things came to mind too... Which API call are you
using for inserts? (put vs putnr vs ...). I don't really like the misc
calls bundled with TT, so I've been writing server additions in C for
doing things like this. I've built my own custom mput call, and I'm
inserting keys/values in groups of about 3000.

Also, one thing that might be incredibly important is that my key
inserts are sequential. I'm inserting 3000 SEQUENTIAL keys into my
database at a time, which could artificially inflate my write
benchmarks...

I think I had one other thing, but I can't recall it now, maybe later
Loren Van Spronsen

On Mar 18, 9:56 pm, Loren Van Spronsen <loren.vanspron...@gmail.com>

Tom Chen

unread,
Mar 19, 2010, 1:11:12 AM3/19/10
to tokyocabi...@googlegroups.com
Hi Loren,

I believe the -uas is for replication log. 

My cpu info below:

Intel(R) Xeon(R) CPU           E5504  @ 2.00GHz

For HDs
I'm using 2 SATA rpm drives in a RAID 0.

I suspect I can push higher throughput by switching to the binary protocol bindings for java, but I wanted to stick with memcache protcol to give me a drop in replacement for some of the items we have running in production. 

tom


To unsubscribe from this group, send email to tokyocabinet-us...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/tokyocabinet-users?hl=en.

Tom Chen

unread,
Mar 19, 2010, 1:14:43 AM3/19/10
to tokyocabi...@googlegroups.com
Interesting note.

My sequential inserts max out at 2k, however I can spin up 7 more threads that can output 2-3k/s sets in a random range from 1 to 20 million entries. My profiling shows my total ops at about 30k/s. 

I'm using java xmemcache protocol, I've gotten the same performance with the spy memcache protocol. 

Tom


To unsubscribe from this group, send email to tokyocabinet-us...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/tokyocabinet-users?hl=en.

Loren Van Spronsen

unread,
Mar 19, 2010, 2:51:07 AM3/19/10
to Tokyo Cabinet Users
Hey Tom,

So I've set up a virtual machine at home and I'm in the process of
running between 15 and 20 automated test runs of 20 million inserts
for various tuning parameters.

One of the first things I tested was the insertion of values using the
standard put command, which happens to be very similar to the memcache
set command. Both of these commands send the key/value pair to the
server and then WAIT FOR A RESPONSE. My home machine ( 2 * 640GB 7200
RPM SATA II harddrives in RAID 0 w/ intel core i7 and 12GB of RAM, 4GB
dedicated to the virtual machine) is quite similar to your box, and I
noticed that when I ran using this insertion method, I maxed out at
pretty much exactly 1k inserts per second. When I replaced this
operation with putnr (a binary command that puts without waiting for a
response), I achieved about 200 times that (yeah, I was hitting 200k
inserts per second on my virtual machine).

You may want to look into benchmarking the binary putnr command... I
looked through the memcache spec and it doesn't appear to have a
similar command. If you still want to wait for a response, consider
using the binary putlist command, which puts multiple records at a
time before waiting for a response. The key here is avoiding network
latency, which is a significant issue if you are inserting 1000
records a second...

This also makes sense that you could increase your throughput by
upping the number of threads. Each thread would independently be
waiting for the server to response to its last insert, and would
proceed once it had that response. This allows you to wait for N
responses at the same time, and proceed when any of them are received.

Hope that helps you somewhat,
- Loren Van Spronsen

On Mar 18, 10:14 pm, Tom Chen <t...@gogii.net> wrote:
> Interesting note.
>
> My sequential inserts max out at 2k, however I can spin up 7 more threads
> that can output 2-3k/s sets in a random range from 1 to 20 million entries.
> My profiling shows my total ops at about 30k/s.
>
> I'm using java xmemcache protocol, I've gotten the same performance with the
> spy memcache protocol.
>
> Tom
>
> On Thu, Mar 18, 2010 at 10:10 PM, Loren Van Spronsen <
>
>
>

> ...
>
> read more »

Vinicius Tinti

unread,
Mar 19, 2010, 5:13:31 AM3/19/10
to tokyocabi...@googlegroups.com
Hi,

Another thing that I use is a simple multiplexer. Instead using a a single database,
lets suppose that I use five. Than before any operation in any of database I need to
hash the key to discover witch database is should be with a hash table and a hash
function (ex: hash key 'loren' it returns 5 so it should be on database 5). I don't really
now of it is the better way to boost databases but it avoids one database become
huge. Huge databases are slower than small ones (it makes sence but again I
can't prove). TT seams to have a build in functionality but I have no idea how to use:

in ttserver section: "-mul num : specify the division number of the multiple database mechanism."

I don't really now what does it means. Another very important thing that may
help, your hash funtion needs to be a very good one, fast, good range and low
with cost. Consistent Hash may help on adding and removing databases. I want
to benchmark also (one day I will, I hope) kumofs that Flinn has reported to
this group http://github.com/etolabo/kumofs .

Cheers,
Vinicius


--
You received this message because you are subscribed to the Google Groups "Tokyo Cabinet Users" group.
To post to this group, send email to tokyocabi...@googlegroups.com.
To unsubscribe from this group, send email to tokyocabinet-us...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/tokyocabinet-users?hl=en.




--

Flinn Mueller

unread,
Mar 19, 2010, 9:57:47 AM3/19/10
to tokyocabi...@googlegroups.com
On Mar 19, 2010, at 5:13 AM, Vinicius Tinti wrote:

Hi,

Another thing that I use is a simple multiplexer. Instead using a a single database,
lets suppose that I use five. Than before any operation in any of database I need to
hash the key to discover witch database is should be with a hash table and a hash
function (ex: hash key 'loren' it returns 5 so it should be on database 5). I don't really
now of it is the better way to boost databases but it avoids one database become
huge. Huge databases are slower than small ones (it makes sence but again I
can't prove). TT seams to have a build in functionality but I have no idea how to use:

in ttserver section: "-mul num : specify the division number of the multiple database mechanism."

-mul splits the file into n number of files.  As I understand it, this splits your DB into multiple DBs and provides simple hashing to manage what DB your key goes into.  This is transparent to a Tokyo Tyrant client since it's build right in to the TCADB interface.  That's entirely internal though and not available as a user function.

I don't really now what does it means. Another very important thing that may
help, your hash funtion needs to be a very good one, fast, good range and low
with cost. Consistent Hash may help on adding and removing databases. I want
to benchmark also (one day I will, I hope) kumofs that Flinn has reported to
this group http://github.com/etolabo/kumofs .

Tokyo Cabinet also has more complex hash ring functionality in TCCHIDX.  If you were building your own Tokyo Tyrant client I believe you could use TCCHIDX but again you still must deal with managing a hash ring which is always a pain point on the client.  Another pain point with the native C Tokyo Tyrant interface I've run into is how TCRDB handles (or doesn't handle) a dead connection.  When you've got a ring of clients you've need to know when one of them is dead so you can manage your ring.

Loren Van Spronsen

unread,
Mar 19, 2010, 4:28:13 PM3/19/10
to Tokyo Cabinet Users
Hey Guys,

Thanks Vinicius and Flinn for the extra info on mul. I was originally
thinking of implementing a hash ring to split data across multiple
files, but it appears TT handles that for me, one more point for TT...

A little more information after doing some quick tests on the mul
option. Mul multiples all your settings by whichever value you supply,
so if you're trying to estimate RAM usage, make sure you divide by
your Mul number.

For example, I opened "ttserver -mul 10 /var/ttserver/
casket.tcb#opts=l#bnum=10,000,000#nmemb=512#lmemb=512", which caused
TC/TT to create a directory at /var/ttserver/casket.tcb and fill it
with 10 files. Each of these 10 files was about 81MB in size. Now, if
we do some quick math, we get 10,000,000 * 8 bytes = 80,000,000 which
is the size of our file. So, we see that TC/TT hasn't divided any of
our options by 10, but has opened 10 databases that all have these
options. This can be verified by looking at ttserver's memory usage,
which is approximately 900MB.

That being said, I like the idea of opening the databases in multiple
files, which should make things a little nicer on the file system and
the memory mapped files. The idea now is to split the data across 15
databases, and set bnum=100,000,000 for each of them (once I get my
production server).

Now, I'm running quick tests against a VM running on my laptop, so
these numbers are probably going to be significantly lower than any
decent production machine, but hopefully the ratio is relatively
constant. While running an insert test against this vm with
bnum=10,000,000#lmemb=512#nmemb=512 and mul=10, I was able to insert
7,000,000 records in about one minute. I timed it in groups of 700,000
records, each group had more or less similar times (didn't see any
noticeable increase in times between 0-700,000 records and 6,300,000
and 7,000,000 records).

Running the same insert test against a VM with
#bnum=100,000,000#lmemb=512#nmemb=512 and mul=1 (similar
configuration, just in a single file), produced a result that was
about 53 seconds, which is slightly faster. The insert batches were
consistently a little bit faster. Which seems to lead towards thinking
that single files might be better for smaller databases, (I'm only
using 7,000,000 records)...

Running the same two tests against the same VM, but inserting
70,000,000 rows, produces 582 seconds for 69,120,000 inserts with
mul=10 (with fairly constant insert times for batches of 700,000
records) and 588 seconds for 69,120,000 inserts with mul=10 and
bnum=100,000,000. Overall the times were relatively similar between
the two runs, however, I noticed that the batch times were much more
consistent with the mul=10 run. The mul=10 run was pretty constant
with 5.5-ish second batches, with the occasional spike up to 7.5-8.0
seconds (in which I'm assuming TC is doing an msync). The mul=1 run on
the other hand, showed much more variance between test runs, with
ranges from 4.8-8.5 pretty much scattered throughout. I'm not quite
sure how this plays out, but I think the consistency aspect sells me
on using a multiple database for huge databases, as I'm still
relatively sure that it will increase the efficiency of the msync
operations...

Hope that helps some people
Loren Van Spronsen

On Mar 19, 6:57 am, Flinn Mueller <theflinns...@gmail.com> wrote:
> On Mar 19, 2010, at 5:13 AM, Vinicius Tinti wrote:
>
> > Hi,
>
> > Another thing that I use is a simple multiplexer. Instead using a a single database,
> > lets suppose that I use five. Than before any operation in any of database I need to
> > hash the key to discover witch database is should be with a hash table and a hash
> > function (ex: hash key 'loren' it returns 5 so it should be on database 5). I don't really
> > now of it is the better way to boost databases but it avoids one database become
> > huge. Huge databases are slower than small ones (it makes sence but again I
> > can't prove). TT seams to have a build in functionality but I have no idea how to use:
>
> > in ttserver section: "-mul num : specify the division number of the multiple database mechanism."
>
> -mul splits the file into n number of files.  As I understand it, this splits your DB into multiple DBs and provides simple hashing to manage what DB your key goes into.  This is transparent to a Tokyo Tyrant client since it's build right in to the TCADB interface.  That's entirely internal though and not available as a user function.
>
> > I don't really now what does it means. Another very important thing that may
> > help, your hash funtion needs to be a very good one, fast, good range and low
> > with cost. Consistent Hash may help on adding and removing databases. I want
> > to benchmark also (one day I will, I hope) kumofs that Flinn has reported to
> > this grouphttp://github.com/etolabo/kumofs.
>
> Tokyo Cabinet also has more complex hash ring functionality in TCCHIDX.  If you were building your own Tokyo Tyrant client I believe you could use TCCHIDX but again you still must deal with managing a hash ring which is always a pain point on the client.  Another pain point with the native C Tokyo Tyrant interface I've run into is how TCRDB handles (or doesn't handle) a dead connection.  When you've got a ring of clients you've need to know when one of them is dead so you can manage your ring.
>
>
>
>
>
> > Cheers,
> > Vinicius
>

> ...
>
> read more »

Loren Van Spronsen

unread,
Mar 23, 2010, 3:05:58 PM3/23/10
to Tokyo Cabinet Users
Hey Guys,

Just one more follow-up on my test box (the one with the single
7200RPM hard-drive and 2GB of RAM and the dual-core CPU). It's
database is now up to about 1.4 billion records, each with a 16 byte
key and a 8 byte value. I'm still able to do inserts at 40-50 thousand
per second, and I just got a get-range system up and running. I'm able
to retrieve about 36,000 sequential keys/values in about half a
second, which is pretty spectacular considering the machine. Thats
more than 1MB of raw database in the database coming back in half a
second. I'm hoping that speeds up even further when I move up to
server-class hardware

Thanks for all your help,
Loren Van Spronsen

On Mar 19, 1:28 pm, Loren Van Spronsen <loren.vanspron...@gmail.com>

> ...
>
> read more »

Makoto

unread,
Mar 28, 2010, 6:22:14 PM3/28/10
to Tokyo Cabinet Users
Hi,

Since there was a topic about multiple abstract database (-mul), I did
quick translation of Mikio's blog article related to the topic.

http://tokyocabinetwiki.pbworks.com/29_using_multiple_abstract_database

There were certain parts which I did not understand while I was
translating, so some of them may not make much sense.
Let me know if there are any parts hard to understand. I will try to
revise.

Thanks.

Makoto


On Mar 23, 8:05 pm, Loren Van Spronsen <loren.vanspron...@gmail.com>

> ...
>
> read more »

Flinn Mueller

unread,
Mar 28, 2010, 7:23:54 PM3/28/10
to tokyocabi...@googlegroups.com
This is great, the translation makes sense in the context I've understood the tcadbmul code.

The parts on @ and % are a little unclear. Reading tcadbmulmisc code it appears if you don't prefix getlist and putlist the command will execute on all ADB files, prefixing will use the consistent hashed key to execute the command only on the ADB file relevant to the argument.

Makoto Inoue

unread,
Mar 30, 2010, 8:23:05 AM3/30/10
to tokyocabi...@googlegroups.com
Hi, Flinn.

Thank you for your input. I updated the wiki with command examples.  

Thanks.

Makoto

Mike Dierken

unread,
Apr 11, 2010, 8:41:56 PM4/11/10
to tokyocabi...@googlegroups.com
How (and how often) are you calling 'sync' to have ttserver flush
in-memory changes to disk?

Loren Van Spronsen

unread,
Apr 12, 2010, 11:38:59 AM4/12/10
to Tokyo Cabinet Users
Hey Mike,

Haven't been calling sync as of yet (not manually at least), I suppose
I should throw that at the end of all my test cases to get actual
results... I might re-run some tests with that in place, work has
taken a bit of a divergence from the tokyo tyrant side of things, so
it might be a month or so before I get into actually implementing my
production database....

Loren Van Spronsen

> ...
>
> read more »

Reply all
Reply to author
Forward
0 new messages