From the guys at Foundation DB (which I hadn't heard of). I've
provided a few sections of the article here, omitting about half of
it. Well worth checking out the parts I didn't include here.
When we started building FoundationDB five years ago, my co-founder and
I wrote down our big hairy goals for the distributed key-value database
we were dreaming up:
Fast, strict ACID transactions
Even during failure scenarios
Even across multiple keys on different nodes
Read latencies below 500 microseconds
Commit (write) latencies below 2 milliseconds
Scalability to 100,000,000 random reads per second
Scalability to 10,000,000 random writes per second (!)
I knew we were setting ourselves up for a big challenge. Seven years
prior I had prototyped an analytics tool on top of an RDMBS. I don’t
think I ever got it to run faster than about 100 writes per second. I’m
sure with the exotic hardware of the day I could have hit 1,000.
Unfortunately, I needed a lot more and eventually gave up on RDBMSs,
instead building a custom column-oriented database optimized for table
A new architecture
Today, we ship FoundationDB Key-Value Store 3.0 and change all that.
For 3.0 we are delivering a breakthrough all-new “transaction engine”
with a scalable design and no single master machine in the path of
transactions. The transaction engine is the heart of FoundationDB,
executing all transactions, checking them against each other, and
ensuring speedy application of all writes. There are three major
sub-components in the transaction engine, each of which has been
rewritten for 3.0:
Proxies stage incoming client transactions as they go through the
transaction commit process and provide global backpressure to deal with
filling queues during saturating workloads.
Resolvers keep track of the complex modification history of keys and
ranges of keys, working together to enforce serializable transaction
isolation. (Yes, we pass Jepsen, the new, harder Jepsen, and Hermitage)
Transaction logs both durably log and mux/de-mux the mutations from the
incoming transaction stream. They stay in constant communication with
the storage servers to push updates to the storage subsystem.
The commit (write) path is roughly:
Clients send their transaction to the proxies on commit
Proxies check transaction isolation using the resolvers
Proxies send transactions that pass isolation checks to the transaction
Transaction logs write and flush transactions to disk
Transaction logs report success to the clients (via the proxies)
Transaction logs send database updates to the storage nodes
Hold onto your hats. With FoundationDB’s new transaction engine this
test averages 14,400,000 random writes per second. Or, as I like to
I hope you agree that this is an incredible result. And it’s made even
more impressive because we are hitting this number on a fully-ordered,
fully-transactional database with 100% multi-key cross-node
transactions. We haven’t heard of a database that even comes close to
these performance numbers with those guarantees. Oh, and in the public
cloud, with all its usual communications and noisy-neighbor challenges.
Let’s put 14.4 MHz in context:
It is 1,396 times faster than the event-record 618,725 TPM
(tweet-per-minute) rate when Germany won the 2014 World Cup.
It is 4.2 times faster than Facebook’s OLTP row update per second
numbers in Nov. 2010. (They now have ~2.5 times as many users.)
If I had started that old RDBMS doing 100 random writes per second
thirteen years ago, FoundationDB 3.0 would catch up in 48 minutes.
It’s gratifying for the whole team here to hit our ambitious initial
goal after five hard years of theory, simulation, and engineering!
Postscript: lessons learned
The massive project to re-invent the transaction engine in FoundationDB
could have been a second system syndrome disaster. I believe it was
only possible because of our simulation and testing systems, including
our programming language Flow. If you are interested in knowing more
about how these tools work, and how they’ve enabled us to make bold,
confident engineering decisions five years into a complex project,
please enjoy this presentation by my colleague, Will Wilson.
RS Wood <r...@therandymon.com