When we insert a new row, we need to change its parent to point to
that row. Since we do this in an append-only fashion, we copy the
parent node to the end of the file (after the new row). Now, we need
to change the grandparent to point to the new version of the parent,
and so on up to the root. Thus, an insert requires O(log n) copies.
This means that inserting n elements into a blank database causes
about O(n log n) nodes to be written to disk, with a rather large
constant factor (rather than 2n - 1 for a fully balanced binary tree,
in the optimal case). This begs for garbage collection, for which we
have a rather intelligent (if I do say so myself) scheme, which will
be implemented in the next few days.
> I might disagree with your claim that it is OK to queue writers
> because there are few long-running transactions. Even short-running
> transactions encounter disk reads and may benefit from being run
> concurrently to hide IO latency.
A good point, but at the moment, we don't have the capability to allow
multiple writers without horrible corruption. In order to provide a
consistent view of the database for a second writer, we would need to
have a way for the second writer to see what the first writer is about
to write. Instead, we are planning to provide STM for writes, forcing
a rollback if the writes conflict, but STM is a difficult task that we
just haven't tackled yet.
> Finally, I hope that you don't limit the implementation of concurrent
> writers to the use of transactional memory.
I am extremely tired, so perhaps I will understand what you mean
better after some sleep, but so far I'm not sure what you're trying to
say. If you could elaborate, that would be great. ;-)
That being said, as above, I think our current plan is to use STM for
concurrent writes, but we're open to suggestions, and I'll talk to
Slava tomorrow and see if he had any other ideas.
Will GC decrease the rate at which pages are dirtied? Or just clean up
the dirty pages faster?
>> I might disagree with your claim that it is OK to queue writers
>> because there are few long-running transactions.
> This claim comes from Michael Stonebraker's paper (http://
> db.cs.yale.edu/vldb07hstore.pdf). It's true that short running writes
> that encounter disk reads would benefit from concurrent execution, but
> if there are more read queries in the pipeline, we could just
> interleave with those (of course this opens a question of slightly
> starving the writers). In the near future we will implement proper
> write concurrency support, but we're appealing to Stonebraker's
> argument for now.
H-Store is an in-memory DBMS and not subject to stalls for disk reads.
RethinkDB will have stalls in that case.
>> Finally, I hope that you don't limit the implementation of concurrent
>> writers to the use of transactional memory.
> We'll do full transactions. We usually refer to "transactional memory"
> because they use almost the same techniques we'll be using to
> implement transactions, and most people these days know a lot about
> transactional memory and fairly little about how databases implement
My hope is based on my assumption that there are few to no examples of
production-quality server code written using STM or HTM.
Our append-only indexing scheme is O(log n) space (I think this is
what Leif was trying to say). Every time we modify a leaf of the index
tree, we have to copy the entire branch from the leif, all the way
down to root. This is what causes the overhead. We're working on
garbage collection now, so hopefully in a day or two this will seize
to be a problem.
Is that still true with a GC implementation?
>> H-Store is an in-memory DBMS and not subject to stalls for disk reads.
>> RethinkDB will have stalls in that case.
> I'm not sure why this is true. If there is a queue of writers with a
> single writer writing at a time, and multiple concurrent readers, the
> overall system won't stall - while there is a read going on from disk,
> multiple other readers should do CPU work and read relevant data from
> the caches. Why do you think there would be a stall? (I guess this
> also depends on what you define to be a stall - I suppose the entire
> process waiting for disk IO?)
By stalls, I mean that a transaction doing writes blocks on a read.
For example it is doing 'UPDATE foo set x = x + 1'. I assume that by
using a queue for writers you mean that only one write transaction
runs at a time.
>> My hope is based on my assumption that there are few to no examples of
>> production-quality server code written using STM or HTM.
> There are few to no examples of successful append-only databases
> either :) But this is definitely a great point, and we've thought
> about this quite a bit. Ultimately, we might try out a number of
> different schemes, and settle on one that makes the most sense.
There is the Berkeley DB Java Edition, but you have to look really
hard to find the whitepaper that describes it. It is worth reading.