Is rethinkdb for me? Very high write performance of small documents.

5288 views
Skip to first unread message

s...@heather.sh

unread,
Nov 13, 2014, 6:11:50 AM11/13/14
to reth...@googlegroups.com, mjm...@york.ac.uk
Hi there,

I'm trying to find a database solution for a system I'm working on, but couldn't find specific metrics on rethinkdb.

We're looking to store JSON documents, maximum 100 bytes in size (although usually less) in near real time (anything written needs to be query-able within 2 seconds of been inserted).

We're looking to be able to support 500,000 inserts per second, across multiple shards (each insert is a max 100byte JSON document) and then to be able to do query this data from the JSON (I.e. the key for the data can be within the JSON, rather than just the id of the document in the database).

We also want to be able to read these documents at a rate of around 2million get's per second. But I think this is less of an ask than the inserts :)

I couldn't see any specific metrics for rethinkdb, and getting our test harness setup takes quite a bit of time. I was wondering if anyone who knows rethinkdb well could let me know their thoughts on this, whether they think rethinkdb is suitable and <i>roughly</i> how many shards they estimate we would need for this performance.

We're attracted to rethinkdb because of it's elasticity - we need a system that's scaleable, both up and down, and like how simple it is to add servers to a cluster and rebalance/redistribute documents between shards.

Thanks!

Sam

Daniel Mewes

unread,
Nov 13, 2014, 12:10:28 PM11/13/14
to reth...@googlegroups.com
Hi Sam,

RethinkDB sounds like a good fit feature wise. In the default mode, every write is immediately visible to subsequent reads. Alternatively you can use "outdated" reads (see the `useOutdated` flag http://www.rethinkdb.com/api/javascript/table/ ) to boost read scalability. In that case you can get a bit of a delay, but typically less than a second.

We haven't published any benchmarks yet because we are still optimizing many aspects of RethinkDB and performing good benchmarks is not easy. We are going to do so soon.
I think the only way to find out is to run a test yourself with your workload and data, since performance is also going to depend considerably on the details of your workload.

Very roughly speaking I expect that you will need at least 10 shards on fast machines (with SSD storage) to get in the right throughput range.


http://rethinkdb.com/docs/troubleshooting/#my-insert-queries-are-slow.-how-can-i-speed-t might also be interesting for you when you set up your benchmark.


- Daniel



--
You received this message because you are subscribed to the Google Groups "RethinkDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rethinkdb+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sam Heather

unread,
Nov 13, 2014, 1:51:20 PM11/13/14
to reth...@googlegroups.com, mjm...@york.ac.uk
Hi Daniel,

Thank you for the reply - useful information!  We're currently weighing up couchbase vs rethinkdb.  

Best,

Sam

Sam Heather

unread,
Nov 17, 2014, 10:21:20 AM11/17/14
to reth...@googlegroups.com
Hi Daniel,

We've run a benchmark on rethinkdb today vs the same benchmark on MongoDB.  I'd like to get your opinion on the results please?

I've pasted the simple code below that we are using to do our inserts.  On MongoDB, we consistently get between 15,000 and 25,000 inserts per second.  However, on rethinkdb, we get much less.  The average is probably around 1k (70% of the time, inserts per second is 0, then in the gaps there's spikes up to 3000 inserts per second).  We're using all the optimisations suggested in the link you gave in your last message.  Do you have any thoughts as to why rethinkdb is performing so much slower than mongo, on the same hardware?  A screenshot of the performance graph is pasted below.

Thanks,

Sam


import rethinkdb as r
r.connect('db2.pwserv.me', 28015).repl()
#r.db('test').table_create('tv_shows').run()
x = {
        "position": {
                "type": "Point",
                "coordinates": [100.001, 100.001]
        },
        "sessionId": "15",
        "type": "ping",
        "userTime": 1416085847,
        "serverTime": 1416085839
}
for i in range(0, 100000):
        r.table('tv_shows').insert(x).run(durability='soft', noreply=True)



dan...@rethinkdb.com

unread,
Nov 17, 2014, 6:56:57 PM11/17/14
to reth...@googlegroups.com
Hi Sam,

are you using an SSD or rotational hard drive?

I haven't got around to actually run your code yet. I will do that later.

Here are a few things out of my head that can have an impact:

- Queries on a single connection are run one after another in RethinkDB. You'll probably get much better throughput by using multiple concurrent connections.
- Another optimization is batching writes together. You can pass an array of documents to insert, which is much more efficient (good batch sizes are 100-500 documents at a time). This might not reflect your actual production workload though?
- It's possible that the writes are disk bound. Which write concern are you using with MongoDB? By default, MongoDB might accumulate a much larger amount of unwritten data in RAM than RethinkDB before queries start getting slower. Unless the disks are very slow, I doubt this is the bottleneck here though.
- What is the cache size RethinkDB is running with? You can find the cache size of a running RethinkDB server in its log (through the web UI). It will be dynamically configured on startup. I recommend setting it manually by specifying the `--cache-size <size in MB>` command line argument when starting RethinkDB. It should probably be at least half the RAM in your machine.


Best,
  Daniel

dan...@rethinkdb.com

unread,
Nov 17, 2014, 10:17:25 PM11/17/14
to reth...@googlegroups.com
Hi Sam,

I've tried your test code in the meantime and I can reproduce the behavior you are seeing. Increasing concurrency doesn't seem to help either.

I think this is a bug in RethinkDB, as it doesn't combine write transactions before sending them to disk the way it should.
We will investigate this and try to provide a fix as soon as possible. I'll let you know once we have an update.

Thank you for bringing this to our attention. Performance optimization is still an ongoing process in RethinkDB, and this is going to help us make it better.


-  Daniel




On Monday, November 17, 2014 7:21:20 AM UTC-8, Sam Heather wrote:

dan...@rethinkdb.com

unread,
Nov 18, 2014, 2:36:04 PM11/18/14
to reth...@googlegroups.com
Sam, I've opened https://github.com/rethinkdb/rethinkdb/issues/3348 on our issue tracker to keep track of the performance problem.

If you're on SSDs, you probably can boost your insert throughput by using multiple concurrent connections by the way. At least that worked in my tests.

pzoln...@gmail.com

unread,
Nov 27, 2014, 8:45:14 AM11/27/14
to reth...@googlegroups.com
On Tuesday, November 18, 2014 8:36:04 PM UTC+1, dan...@rethinkdb.com wrote:
> Sam, I've opened https://github.com/rethinkdb/rethinkdb/issues/3348 on our issue tracker to keep track of the performance problem.
>
> If you're on SSDs, you probably can boost your insert throughput by using multiple concurrent connections by the way. At least that worked in my tests.

I am bulk importing via rethinkdb import 3.6mio documents like the one below - getting no more than 1000 writes/s.

{ "tlc": "AAH", "mhc": "AAH00002", "room": "DZ", "ext_code": "AAH305", "currency": "EUR", "meal": "F", "checkin": "20141125", "adt": 2, "chd": 0, "prices":[{"stay": 1, "total": 42.38},{"stay": 2, "total": 84.77},{"stay": 3, "total": 127.16},{"stay": 4, "total": 169.54}]}

Also, doing a .count() afterwards take 22s!

That doesn't seem right... I am on Mac with an SSD drive

dan...@rethinkdb.com

unread,
Dec 1, 2014, 1:33:04 PM12/1/14
to reth...@googlegroups.com, pzoln...@gmail.com
Hi,
such a slow insert throughput is indeed odd. It's possible that `rethinkdb import` doesn't chose the batch size very well. Can you try increasing the number of concurrent clients for `rethinkdb import` through the `--clients NUM_CLIENTS` command line option? 64 might be a good value to try.
If this doesn't help, there might be something else wrong with either `rethinkdb import` or the server. We would have to investigate that.

Another thing worth checking is the server's cache size. If you go to the web UI and the "Logs" page, you will find an entry such as "Using cache size of ... MB". RethinkDB by default configures its cache size based on the available RAM on the machine when it is started, but that's not always reliable. You might get better results by increasing the cache size through the `--cache-size <MB>` parameter to the RethinkDB server. Good values are in the range of half the amount of RAM in your machine.

A small cache could also explain the slow count, as it might have been necessary to load a lot of data from the SSD for the query.


Best,
  Daniel

thomas....@gmail.com

unread,
Jun 3, 2015, 11:30:59 AM6/3/15
to reth...@googlegroups.com, mjm...@york.ac.uk
Hi,

Does anybody here know if a solution to this was provided in the end ?
The post finishes with no answer, fix or oficial rethinkdb comparison to get us definitive response for this. I'm working to adopt RethinkDB but the performance concern has been a blocker for stakeholders.

Daniel Mewes

unread,
Jun 3, 2015, 3:41:34 PM6/3/15
to reth...@googlegroups.com
I don't know for sure what caused this. My best guess would be that the cache size was too small.
We've also switched to kqueue on OS X in 1.16 which would speed up the count assuming that it was in memory.

We regularly get write throughputs much higher than this and will be publishing an official performance report soon.

Best,
  Daniel

ulpian...@gmail.com

unread,
Jun 16, 2015, 12:12:47 PM6/16/15
to reth...@googlegroups.com
I too would be extremely interested in this. I ran some benchmarks a few weeks/months ago against Mongo with 10 million documents and rethink was about the same 70% mark slower (I was testing joins with rethink). *with no indexing on neither db

* I ran the query through the rethinkdb web console

I want to now test it against mongo v3 as they state on their site that the new mongo version is much faster than 2.6. I expect this will now be much faster than rethink.

I really want to use rethinkdb but I need the performance.

Ulpian

Daniel Mewes

unread,
Jun 16, 2015, 7:25:40 PM6/16/15
to reth...@googlegroups.com
Hi Ulpian, that sounds like it might be a somewhat different workload.
How did you perform the join? How big were the involved data sets?

I recommend measuring performance with one of the client drivers, and not directly in the Data Explorer. The Data Explorer isn't that great for performance testing, since a) for certain queries it only retrieves the first part of the result set, which will impact the timing and b) the Data Explorer itself is very slow in encoding queries and decoding query results, which in some cases can have a quite significant impact on the total execution time.

- Daniel

Samuel Goldenbaum

unread,
Jun 18, 2015, 7:38:09 AM6/18/15
to reth...@googlegroups.com, ulpian...@gmail.com
Ran some tests:
  • 10m docs inserted
  • JavaScript drivers
  • batched 200 records at a time - best practice from docs
  • durability: soft (don't wait for write acknowledgement)
  • Single server
Rethink 2.03:         29m 03s
ArangoDB 2.5.5     5m 52s

Daniel Mewes

unread,
Jun 19, 2015, 5:57:03 PM6/19/15
to reth...@googlegroups.com
Hi Samuel, thanks for sharing your results!

Do you know by any chance how ArangoDB does caching and whether it limits the amount of unpersisted data? The only thing I could find was this post https://www.arangodb.com/2012/03/avocadodb-memory-management-and-consistency/ which mentions that ArangoDB is using memory mapped files, but that post is rather old and might be outdated.

The reason I wonder about this is because it's not clear to me if ArangoDB would accumulate large amounts of unsaved data in memory before actually writing them to disk or not. I'm not saying that this is necessarily the reason for the performance difference. Just something worth keeping in mind.

On what hardware did you run the benchmark, and how large were the individual documents? Is there a chance you could put the benchmark script(s) up somewhere so we can investigate this?

- Daniel

Samuel Goldenbaum

unread,
Jun 20, 2015, 5:27:38 AM6/20/15
to reth...@googlegroups.com
Hi Daniel

Busy with some tests in more scenarios and will share the results with you first. From what I understand, Arango does use memory mapped files and loads entire indexes into memory for performance. I am however testing writes at this stage, as they are most NB for one of my given scenario. Testing with both disc sync acknowledgement on and off. Acknowledging disc sync before returning to client increases the result by a few seconds overall only.

Will share more detail later

sjmu...@gmail.com

unread,
Sep 8, 2015, 3:57:46 AM9/8/15
to RethinkDB
Any updates here? Like others, it's really tough for us to consider rethinkdb when performance is lagging so far behind other solutions. We need to know if there are any reliable workarounds that are currently available to improve write throughput, and/or upcoming perf enhancements arriving in future 2.x releases.

Daniel Mewes

unread,
Sep 8, 2015, 3:07:15 PM9/8/15
to RethinkDB Google Group
We have done a couple of improvements in the meantime, though you still need to use a few different tricks from MongoDB to get the same insert rate.

With a single client using Sam's (s...@heather.sh) benchmark script from the first post, the insert rate on SSDs and Linux with RethinkDB 2.1.3 is around 3K/s.
Using 10 concurrent clients it climbs to an average of 14K inserts per second.

Similar speeds should be achievable by batching inserts in a single client, so I expect that inserting 10m documents like in Samuel Goldenbaum's test will take in the range of ~12 minutes.

Of course the exact speed is going to depend a lot on the underlying storage, the hardware used, and the size of the documents being inserted.


On Tue, Sep 8, 2015 at 12:57 AM, <sjmu...@gmail.com> wrote:
Any updates here? Like others, it's really tough for us to consider rethinkdb when performance is lagging so far behind other solutions. We need to know if there are any reliable workarounds that are currently available to improve write throughput, and/or upcoming perf enhancements arriving in future 2.x releases.

Samuel Heather

unread,
Jan 21, 2018, 10:45:04 AM1/21/18
to RethinkDB
Hi Daniel,
Would it be possible for you to edit the above post and remove my email address please?
Thanks,
Sam


On Tuesday, 8 September 2015 20:07:15 UTC+1, Daniel Mewes wrote:
We have done a couple of improvements in the meantime, though you still need to use a few different tricks from MongoDB to get the same insert rate.

With a single client using Sam's (<redacted>) benchmark script from the first post, the insert rate on SSDs and Linux with RethinkDB 2.1.3 is around 3K/s.
Using 10 concurrent clients it climbs to an average of 14K inserts per second.

Similar speeds should be achievable by batching inserts in a single client, so I expect that inserting 10m documents like in Samuel Goldenbaum's test will take in the range of ~12 minutes.

Of course the exact speed is going to depend a lot on the underlying storage, the hardware used, and the size of the documents being inserted.

V

unread,
Feb 1, 2020, 4:56:00 AM2/1/20
to RethinkDB
Is there any update on this issue?

Gábor Boros

unread,
Feb 1, 2020, 5:27:06 AM2/1/20
to RethinkDB
Hello V,

Although I was not there at that time, what I can see is the following:

In this thread Daniel referenced a post about performance report which was never linked here. I found the post on the blog and it is available at https://rethinkdb.com/blog/rethinkdb-performance-report. Other that that, more than 3 years passed since. As far as I know - but correct me if I'm wrong - the performance became better and a lot more feature arrived. Although I can not run the benchmarks, based on the feedback of the last several months on discord, the DB is performing well.

Regards,
Gabor

V

unread,
Feb 2, 2020, 4:45:26 AM2/2/20
to reth...@googlegroups.com
Thank you for the info! I hope they update it with new performance numbers

You received this message because you are subscribed to a topic in the Google Groups "RethinkDB" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rethinkdb/euV3X6vJ4i0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rethinkdb+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rethinkdb/63f896b0-3ce3-4f3e-8669-d86bce7bab71%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages