So, I'm relatively new to the tokyo cabinet thing, but I've really
been diving in head-first. I've been tearing through C-code like mad
and I've already written my own .NET client for the entire Tokyo
Tyrant API.
Things are going really well, and I'm ready to start doing some
serious benchmarking to see just how TC/TT is going to handle our
data. A lot of people seem to be having troubles getting TC/TT to
write fast enough, and I'm hoping someone can help me get around
degradation of write performance.
Now, TC/TT really appealed to me b/c it seemed to suit the small-key/
small-value model that we have very well. On average, out keys will be
about 16-20 bytes (they will all be the same, still not fully decided
on key models) and our values will almost all be 4 or 8 bytes (float/
double values).
We're hoping to use the TC/TT B-tree storage engine, so that we can
read ranges of values relatively fast, read performance isn't a huge
deal, but I would expect TC to be able to handle sequential reads of
key ranges very well if it is stored in B-tree pages.
I've read through this page:
http://korrespondence.blogspot.com/2009/09/tokyo-tyrant-tuning-parameters.html
Which explains a few things about the TC b-tree tuning options, but
I'm hoping someone can give me a little more info into what the
options actually. As I understand it:
The TC B-tree engine allocated a hash table of size #bnum and
indexes
into that hash table to find the appropriate non-leaf node
The TC B-tree engine allocates #bnum non-leaf pages, each of
which contains
#nmemb, which are indexed to find the appropriate leaf-node
The TC B-tree engine allocated #bnum * #nmemb leaf-pages,
each of which contains
#lmemb members for storing actual key/value pairs
A maximum of #ncnum non-leaf nodes will be stored in memory
at any given time
A maximum of #lcnum non-leaf nodes will be stored in memory
at any given time
#apow indicates the alignment for key/value pairs (or just
values???), which in my case
should be set to 6, since my key/value pairs will never
exceed 64 bytes?
No idea about #fpow, can anyone offer me some help there?
No idea about #xmsiz, the link above makes it sound like it
might be the total
amount of cache memory available to TC, can anyone clear that
up?
I heard someone on SO mention that disabling ext3 journalling
helped speed of their database
significantly, can anyone attest to that?
No, we're planning to throw some relatively powerful servers at some
very large datasets. Could anyone provide a relatively well tuned
configuration for a server with perhaps, 1.2 TB of storage ( RAID 10
of 15K RPM SAS drives) and 24GB of RAM? How many records could you
feasibly store on such a server and still obtain response times of < .
5sec, while always undergoing large insert volume
Hopefully I'm not asking too much, but this is the first real source
of info on TC/TT I've found
--
You received this message because you are subscribed to the Google Groups "Tokyo Cabinet Users" group.
To post to this group, send email to tokyocabi...@googlegroups.com.
To unsubscribe from this group, send email to tokyocabinet-us...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/tokyocabinet-users?hl=en.
> No idea about #fpow, can anyone offer me some help there?
>
> No idea about #xmsiz, the link above makes it sound like it
> might be the total
> amount of cache memory available to TC, can anyone clear that
> up?
There's an article here:
http://blog.grayproductions.net/articles/tokyo_cabinets_keyvalue_database_types
which has a little about tuning parameters at the bottom.
Hope this helps,
Hamish
Thanks for the posts everybody. So I've implemented a quick test box
in order to play around with some of the tuning parameters of TC.
Things have worked out great so far, using a few tokyo tyrant
implementations, I'm achieving 60K inserts per second against a
standard run-of-the-mill workstation. I'm hoping that will scale up
too 200-300k per second as we move up to a server class machine.
I've hit 800 million records in my database as I'm writing this, with
the database consuming about 23 GB of disk space. Keeping in mind that
the test system contains only a single 7200RPM sata hard-drive, 2GB of
RAM and a dual core CPU, I'm quite impressed that it is able to
respond to read requests on average within 20-25ms WHILE undergoing
60K writes per second. (actually, the write speed dropped about 20-25%
while I was doing 10000 random reads, but even that is respectable).
For anyone interested, my tuning parameters are set as follows:
#opts=l
It took me a while to figure out that I had to rebuild TC with --
enable-off64 to bypass the 2GB default limit (even with opts=l set) on
i386 systems, which by default don't support 64 bit file pointers.
Once this option was set, the database took off like a champ, and
hasn't looked back at all. I haven't set any compression features
because I'm using binary keys and values, so I'm not expecting to
achieve much in the way of compression regardless. However, it is
worth mentioning that the math for my database works out pretty much
exactly to what is predicted by TC, I'm getting 800 million records in
23GB of space = 28.75 bytes per record.I'm using 16 byte keys and 8
byte values consistently, add in the 5 byte overhead predicted by TC
and you get 29 bytes, which is almost exactly what I'm getting. This
is a perfectly acceptably number for me, so I'm not too worried about
compression
#bnum=100,000,000 (commas added for readability)
When I restarted TT after changing bnum to 100 million, I made a point
of watching it allocate the initial database file and RAM size, which
stopped at about 800MB. This essentially means that the initial hash
table stores NOTHING but the 64 bit (100million * 8 bytes = 800MB)
file pointer. Which is pretty impressive. Even assuming this entire
hash table is loaded into RAM this is only 800MB of ram being used
(right now the entire server is using 918MB of RAM, so I can probably
even boost this up slightly if I decide to restart the test). When I
get my production servers (which are going to have 24GB of RAM and be
100% dedicated boxes), I will probably boost this number to 1 and a
half billion or so, and expect it to use about 12GB of RAM for the
hash table, leaving me 10GB or so of RAM to play around with the node
caches
#lmemb=512 and #nmemb=512
One of the primary operations we will be performing against this
database will be reading ranges of values, sometimes in very large
ranges. For this reason, I'm more sure about my choice to make #lmemb
larger. When TC loads the entire leaf node (which contains the actual
records?) it should be able to read records sequentially. Since we are
hoping to read between a max of between 10,000 and 20,000 records at a
time, I'll probably increase #lmemb to 1024 or 2048 on a production
system. With faster server hard-drives, this shouldn't increase read
delay by much, but should provide much more efficient access to large
sequential blocks. Also, since TC uses msync and memory mapped files,
I would assume that increasing the size of leaf nodes while doing
sequential inserts of records with similar keys could potentially
greatly improve write performance. I'm going to have to do some
serious though about #nmemb, I'm not at all sure whether increasing it
any more will have any significant effect on speed. I think I will
balance it in the end such that #lmemb * #nmemb is about 100,000,
since that is about the absolute maximum number of records we will
read sequentially, but I'm not even too sure about that.
#xmsiz=50,000,000
I can't say anything about whether this was used significantly by TC,
still not sure quite what it is used for, hopefully this will become
clearer as I keep playing around with it.
I didn't change either of the cache values, so they were still at
their defaults of 1024 leaf-nodes and 512 non-leaf-nodes. Its
interesting to note that:
1024 leaf-nodes * 512 records per nodes * 29 bytes per record =
15MB
512 non-leaf-nodes * 512 records per node * 8 bytes per file
pointer = 2MB
If I'm not mistaken, I was only using 17MB of memory for cache.
Considering that I was writing a grand total of about 2MB / second,
this seems somewhat inadequate. In future tests, I will probably bump
these numbers up significantly. Perhaps up to 20,000 leaf nodes (about
300MB of RAM) and 25,000 non-leaf-nodes ( about 100 MB of RAM ), just
to see what sort of difference that makes.
Out of curiosity, does anyone have any information on what sort of
caching strategies TC uses for cached pages?
Thanks for your help, and hopefully some of my information was useful
in response
Loren Van Spronsen
On Mar 18, 5:35 am, Hamish Allan <ham...@gmail.com> wrote:
> On Thu, Mar 18, 2010 at 7:43 AM, Loren Van Spronsen
>
> <loren.vanspron...@gmail.com> wrote:
> > No idea about #fpow, can anyone offer me some help there?
>
> > No idea about #xmsiz, the link above makes it sound like it
> > might be the total
> > amount of cache memory available to TC, can anyone clear that
> > up?
>
> There's an article here:
>
> http://blog.grayproductions.net/articles/tokyo_cabinets_keyvalue_data...
--
You received this message because you are subscribed to the Google Groups "Tokyo Cabinet Users" group.
To post to this group, send email to tokyocabi...@googlegroups.com.
To unsubscribe from this group, send email to tokyocabinet-us...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/tokyocabinet-users?hl=en.
--
You received this message because you are subscribed to the Google Groups "Tokyo Cabinet Users" group.
To post to this group, send email to tokyocabi...@googlegroups.com.
To unsubscribe from this group, send email to tokyocabinet-us...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/tokyocabinet-users?hl=en.
Sounds like a good idea, but what I really think we need is specific
examples of what each tuning parameter will effect. I'm going to get
setting up a small farm of high powered TC/TT machines over the next
couple of months, so I expect to be pouring through a lot of the C-
code for the BDB implementation. When I feel a little more comfortable
with some more of the options (most of what I've gone through now is
just simple benchmarking and common sense), I'll whip up a topic for
what I've noticed and how I feel tuning different parameters will
effect different environments.
Until then, your idea sounds good, I'm still at work, but I might get
that done tonight, who knows. My database just hit a billion rows,
which is a bit of a milestone :)
I don't suppose anyone has any information on file systems and TC/TT.
Right now I'm just using ext4 (I've played around with disable and
enabling journaling, which as of yet hasn't seemed to make a
tremendous difference), but has anyone has particular success with any
common linux file system?
Loren Van Spronsen
> > tokyocabinet-us...@googlegroups.com<tokyocabinet-users%2Bunsubs cr...@googlegroups.com>
> > .
> > For more options, visit this group at
> >http://groups.google.com/group/tokyocabinet-users?hl=en.
>
> --
> Vinicius Tinti -> viniciusti...@gmail.com, ti...@dcc.ufmg.br,
To unsubscribe from this group, send email to tokyocabinet-us...@googlegroups.com.
I'll definitely be posting more test results as things progress,
although some of them will probably be VM based.
Thanks for the info on file systems, some of it is new to me. Things
like resierfs, spadfs and hadoop don't really seem practical for TC/
TT, since it uses a single large file.
I think the safest bets are definitely ext2/ext3/ext4 and ZFS/BRTFS, I
might post some comparison results when I get the opportunity to run
more tests against some VMs... Possibly tonight yet...
Thanks for all of your responses guys,
Loren Van Spronsen
On Mar 18, 6:37 pm, Vinicius Tinti <viniciusti...@gmail.com> wrote:
> Hi,
>
> We can create a little set of test, share, run them and report for the
> community. In
> my work, we have a small cluster of high power machines.
>
> I have only used TC/TT with ext3 and ext4 but I have never worried about it.
> It is very important and could lead to better or poor performance.
>
> About file systems I now that:
>
> ext3 - is to slow for small files, it don't have any optimizations for them.
> It also
> keep better consistence of the files than the file system it self.
>
> ext4 - (I don't now very much) I have read that the developers have fix the
> problem
> of small files and increase the performance (take a look on phoronixhttp://www.phoronix.com/
> for benchmarks). Google is migrating (or have migrated) for ext4http://arstechnica.com/open-source/news/2010/01/google-upgrading-to-e...
> ...
>
> read more »
I would guess that the largest difference between the performance I'm
getting on a single drive and you're getting on a raid 0 has to do
with the size of the values.
That being said, I'm using a custom set of .NET bindings I created in
about 3 hours yesterday (implemented the vast majority of the binary
protocol). I would give the binary protocol a shot, it is VERY
minimal
and absolutely optimized for speed. However, I reckon that hard drive
and IO speed are your bottleneck, just like they are mine. If you are
like me and are using binary keys and very compact (not necessarily
small, but dense) values, I might suggest trying to disable
compression, I noticed a relatively significant increase in IO
throughput when I disabled compression features. I'm only using 24
bytes total for a single record, so it wasn't a big deal for me.
I explained my parameters for starting the server above, I believe 8
worker threads is the default, so that hasn't changed... Would you
mind explaining what the -uas switch is for? My complete set of
parameters would be something along the lines of (commas added for
readability):
ttserver -port 1978 -thnum 8 /var/ttserver/
casket.tcb#opts=l#bnum=100,000,000#lmemb=512#nmemb=512
I'll add more information about the xmsiz and cache values that I'm
using when I learn a little more about how they impact performance.
Just for comparison, would you mind mentioning which linux distro,
file system, and speed of hard-drive you are using? I'm just trying
to
get a decent grasp of what I can expect from TC/TT in any given
environment
Thanks,
Loren Van Spronsen
> > tokyocabinet-us...@googlegroups.com<tokyocabinet-users%2Bunsubs cr...@googlegroups.com>
I couple of other things came to mind too... Which API call are you
using for inserts? (put vs putnr vs ...). I don't really like the misc
calls bundled with TT, so I've been writing server additions in C for
doing things like this. I've built my own custom mput call, and I'm
inserting keys/values in groups of about 3000.
Also, one thing that might be incredibly important is that my key
inserts are sequential. I'm inserting 3000 SEQUENTIAL keys into my
database at a time, which could artificially inflate my write
benchmarks...
I think I had one other thing, but I can't recall it now, maybe later
Loren Van Spronsen
On Mar 18, 9:56 pm, Loren Van Spronsen <loren.vanspron...@gmail.com>
To unsubscribe from this group, send email to tokyocabinet-us...@googlegroups.com.
To unsubscribe from this group, send email to tokyocabinet-us...@googlegroups.com.
So I've set up a virtual machine at home and I'm in the process of
running between 15 and 20 automated test runs of 20 million inserts
for various tuning parameters.
One of the first things I tested was the insertion of values using the
standard put command, which happens to be very similar to the memcache
set command. Both of these commands send the key/value pair to the
server and then WAIT FOR A RESPONSE. My home machine ( 2 * 640GB 7200
RPM SATA II harddrives in RAID 0 w/ intel core i7 and 12GB of RAM, 4GB
dedicated to the virtual machine) is quite similar to your box, and I
noticed that when I ran using this insertion method, I maxed out at
pretty much exactly 1k inserts per second. When I replaced this
operation with putnr (a binary command that puts without waiting for a
response), I achieved about 200 times that (yeah, I was hitting 200k
inserts per second on my virtual machine).
You may want to look into benchmarking the binary putnr command... I
looked through the memcache spec and it doesn't appear to have a
similar command. If you still want to wait for a response, consider
using the binary putlist command, which puts multiple records at a
time before waiting for a response. The key here is avoiding network
latency, which is a significant issue if you are inserting 1000
records a second...
This also makes sense that you could increase your throughput by
upping the number of threads. Each thread would independently be
waiting for the server to response to its last insert, and would
proceed once it had that response. This allows you to wait for N
responses at the same time, and proceed when any of them are received.
Hope that helps you somewhat,
- Loren Van Spronsen
On Mar 18, 10:14 pm, Tom Chen <t...@gogii.net> wrote:
> Interesting note.
>
> My sequential inserts max out at 2k, however I can spin up 7 more threads
> that can output 2-3k/s sets in a random range from 1 to 20 million entries.
> My profiling shows my total ops at about 30k/s.
>
> I'm using java xmemcache protocol, I've gotten the same performance with the
> spy memcache protocol.
>
> Tom
>
> On Thu, Mar 18, 2010 at 10:10 PM, Loren Van Spronsen <
>
>
>
> ...
>
> read more »
: "-mul num : specify the division number of the multiple database mechanism." http://github.com/etolabo/kumofs .
--
You received this message because you are subscribed to the Google Groups "Tokyo Cabinet Users" group.
To post to this group, send email to tokyocabi...@googlegroups.com.
To unsubscribe from this group, send email to tokyocabinet-us...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/tokyocabinet-users?hl=en.
Hi,
Another thing that I use is a simple multiplexer. Instead using a a single database,
lets suppose that I use five. Than before any operation in any of database I need to
hash the key to discover witch database is should be with a hash table and a hash
function (ex: hash key 'loren' it returns 5 so it should be on database 5). I don't really
now of it is the better way to boost databases but it avoids one database become
huge. Huge databases are slower than small ones (it makes sence but again I
can't prove). TT seams to have a build in functionality but I have no idea how to use:
in ttserver section: "-mul num: specify the division number of the multiple database mechanism."
I don't really now what does it means. Another very important thing that may
help, your hash funtion needs to be a very good one, fast, good range and low
with cost. Consistent Hash may help on adding and removing databases. I want
to benchmark also (one day I will, I hope) kumofs that Flinn has reported to
this grouphttp://github.com/etolabo/kumofs .
Thanks Vinicius and Flinn for the extra info on mul. I was originally
thinking of implementing a hash ring to split data across multiple
files, but it appears TT handles that for me, one more point for TT...
A little more information after doing some quick tests on the mul
option. Mul multiples all your settings by whichever value you supply,
so if you're trying to estimate RAM usage, make sure you divide by
your Mul number.
For example, I opened "ttserver -mul 10 /var/ttserver/
casket.tcb#opts=l#bnum=10,000,000#nmemb=512#lmemb=512", which caused
TC/TT to create a directory at /var/ttserver/casket.tcb and fill it
with 10 files. Each of these 10 files was about 81MB in size. Now, if
we do some quick math, we get 10,000,000 * 8 bytes = 80,000,000 which
is the size of our file. So, we see that TC/TT hasn't divided any of
our options by 10, but has opened 10 databases that all have these
options. This can be verified by looking at ttserver's memory usage,
which is approximately 900MB.
That being said, I like the idea of opening the databases in multiple
files, which should make things a little nicer on the file system and
the memory mapped files. The idea now is to split the data across 15
databases, and set bnum=100,000,000 for each of them (once I get my
production server).
Now, I'm running quick tests against a VM running on my laptop, so
these numbers are probably going to be significantly lower than any
decent production machine, but hopefully the ratio is relatively
constant. While running an insert test against this vm with
bnum=10,000,000#lmemb=512#nmemb=512 and mul=10, I was able to insert
7,000,000 records in about one minute. I timed it in groups of 700,000
records, each group had more or less similar times (didn't see any
noticeable increase in times between 0-700,000 records and 6,300,000
and 7,000,000 records).
Running the same insert test against a VM with
#bnum=100,000,000#lmemb=512#nmemb=512 and mul=1 (similar
configuration, just in a single file), produced a result that was
about 53 seconds, which is slightly faster. The insert batches were
consistently a little bit faster. Which seems to lead towards thinking
that single files might be better for smaller databases, (I'm only
using 7,000,000 records)...
Running the same two tests against the same VM, but inserting
70,000,000 rows, produces 582 seconds for 69,120,000 inserts with
mul=10 (with fairly constant insert times for batches of 700,000
records) and 588 seconds for 69,120,000 inserts with mul=10 and
bnum=100,000,000. Overall the times were relatively similar between
the two runs, however, I noticed that the batch times were much more
consistent with the mul=10 run. The mul=10 run was pretty constant
with 5.5-ish second batches, with the occasional spike up to 7.5-8.0
seconds (in which I'm assuming TC is doing an msync). The mul=1 run on
the other hand, showed much more variance between test runs, with
ranges from 4.8-8.5 pretty much scattered throughout. I'm not quite
sure how this plays out, but I think the consistency aspect sells me
on using a multiple database for huge databases, as I'm still
relatively sure that it will increase the efficiency of the msync
operations...
Hope that helps some people
Loren Van Spronsen
On Mar 19, 6:57 am, Flinn Mueller <theflinns...@gmail.com> wrote:
> On Mar 19, 2010, at 5:13 AM, Vinicius Tinti wrote:
>
> > Hi,
>
> > Another thing that I use is a simple multiplexer. Instead using a a single database,
> > lets suppose that I use five. Than before any operation in any of database I need to
> > hash the key to discover witch database is should be with a hash table and a hash
> > function (ex: hash key 'loren' it returns 5 so it should be on database 5). I don't really
> > now of it is the better way to boost databases but it avoids one database become
> > huge. Huge databases are slower than small ones (it makes sence but again I
> > can't prove). TT seams to have a build in functionality but I have no idea how to use:
>
> > in ttserver section: "-mul num : specify the division number of the multiple database mechanism."
>
> -mul splits the file into n number of files. As I understand it, this splits your DB into multiple DBs and provides simple hashing to manage what DB your key goes into. This is transparent to a Tokyo Tyrant client since it's build right in to the TCADB interface. That's entirely internal though and not available as a user function.
>
> > I don't really now what does it means. Another very important thing that may
> > help, your hash funtion needs to be a very good one, fast, good range and low
> > with cost. Consistent Hash may help on adding and removing databases. I want
> > to benchmark also (one day I will, I hope) kumofs that Flinn has reported to
> > this grouphttp://github.com/etolabo/kumofs.
>
> Tokyo Cabinet also has more complex hash ring functionality in TCCHIDX. If you were building your own Tokyo Tyrant client I believe you could use TCCHIDX but again you still must deal with managing a hash ring which is always a pain point on the client. Another pain point with the native C Tokyo Tyrant interface I've run into is how TCRDB handles (or doesn't handle) a dead connection. When you've got a ring of clients you've need to know when one of them is dead so you can manage your ring.
>
>
>
>
>
> > Cheers,
> > Vinicius
>
> ...
>
> read more »
Just one more follow-up on my test box (the one with the single
7200RPM hard-drive and 2GB of RAM and the dual-core CPU). It's
database is now up to about 1.4 billion records, each with a 16 byte
key and a 8 byte value. I'm still able to do inserts at 40-50 thousand
per second, and I just got a get-range system up and running. I'm able
to retrieve about 36,000 sequential keys/values in about half a
second, which is pretty spectacular considering the machine. Thats
more than 1MB of raw database in the database coming back in half a
second. I'm hoping that speeds up even further when I move up to
server-class hardware
Thanks for all your help,
Loren Van Spronsen
On Mar 19, 1:28 pm, Loren Van Spronsen <loren.vanspron...@gmail.com>
> ...
>
> read more »
Since there was a topic about multiple abstract database (-mul), I did
quick translation of Mikio's blog article related to the topic.
http://tokyocabinetwiki.pbworks.com/29_using_multiple_abstract_database
There were certain parts which I did not understand while I was
translating, so some of them may not make much sense.
Let me know if there are any parts hard to understand. I will try to
revise.
Thanks.
Makoto
On Mar 23, 8:05 pm, Loren Van Spronsen <loren.vanspron...@gmail.com>
> ...
>
> read more »
The parts on @ and % are a little unclear. Reading tcadbmulmisc code it appears if you don't prefix getlist and putlist the command will execute on all ADB files, prefixing will use the consistent hashed key to execute the command only on the ADB file relevant to the argument.
Haven't been calling sync as of yet (not manually at least), I suppose
I should throw that at the end of all my test cases to get actual
results... I might re-run some tests with that in place, work has
taken a bit of a divergence from the tokyo tyrant side of things, so
it might be a month or so before I get into actually implementing my
production database....
Loren Van Spronsen
> ...
>
> read more »