questions about 'Rethinking Database Storage' whitepaper

Showing 1-7 of 7 messages
questions about 'Rethinking Database Storage' whitepaper MarkCallaghan 7/28/09 7:15 AM
What is the cause for 'index data structures generate 2GB of data per
index for every 50MB of data'? Is this from having multiple copies of
data? Or from too much metadata per row? Or from something specific to
your algorithm?

I might disagree with your claim that it is OK to queue writers
because there are few long-running transactions. Even short-running
transactions encounter disk reads and may benefit from being run
concurrently to hide IO latency.

Finally, I hope that you don't limit the implementation of concurrent
writers to the use of transactional memory.
Re: questions about 'Rethinking Database Storage' whitepaper Leif Walsh 7/28/09 7:34 AM
On Tue, Jul 28, 2009 at 7:15 AM, MarkCallaghan<mdca...@gmail.com> wrote:
> What is the cause for 'index data structures generate 2GB of data per
> index for every 50MB of data'? Is this from having multiple copies of
> data? Or from too much metadata per row? Or from something specific to
> your algorithm?

When we insert a new row, we need to change its parent to point to
that row.  Since we do this in an append-only fashion, we copy the
parent node to the end of the file (after the new row).  Now, we need
to change the grandparent to point to the new version of the parent,
and so on up to the root.  Thus, an insert requires O(log n) copies.
This means that inserting n elements into a blank database causes
about O(n log n) nodes to be written to disk, with a rather large
constant factor (rather than 2n - 1 for a fully balanced binary tree,
in the optimal case).  This begs for garbage collection, for which we
have a rather intelligent (if I do say so myself) scheme, which will
be implemented in the next few days.

> I might disagree with your claim that it is OK to queue writers
> because there are few long-running transactions. Even short-running
> transactions encounter disk reads and may benefit from being run
> concurrently to hide IO latency.

A good point, but at the moment, we don't have the capability to allow
multiple writers without horrible corruption.  In order to provide a
consistent view of the database for a second writer, we would need to
have a way for the second writer to see what the first writer is about
to write.  Instead, we are planning to provide STM for writes, forcing
a rollback if the writes conflict, but STM is a difficult task that we
just haven't tackled yet.

> Finally, I hope that you don't limit the implementation of concurrent
> writers to the use of transactional memory.

I am extremely tired, so perhaps I will understand what you mean
better after some sleep, but so far I'm not sure what you're trying to
say.  If you could elaborate, that would be great. ;-)

That being said, as above, I think our current plan is to use STM for
concurrent writes, but we're open to suggestions, and I'll talk to
Slava tomorrow and see if he had any other ideas.

--
Cheers,
Leif

Re: questions about 'Rethinking Database Storage' whitepaper Slava Akhmechet 7/28/09 2:44 PM
On Jul 28, 7:15 am, MarkCallaghan <mdcal...@gmail.com> wrote:
> What is the cause for 'index data structures generate 2GB of data per
> index for every 50MB of data'?
Our append-only indexing scheme is O(log n) space (I think this is
what Leif was trying to say). Every time we modify a leaf of the index
tree, we have to copy the entire branch from the leif, all the way
down to root. This is what causes the overhead. We're working on
garbage collection now, so hopefully in a day or two this will seize
to be a problem.

> I might disagree with your claim that it is OK to queue writers
> because there are few long-running transactions.
This claim comes from Michael Stonebraker's paper (http://
db.cs.yale.edu/vldb07hstore.pdf). It's true that short running writes
that encounter disk reads would benefit from concurrent execution, but
if there are more read queries in the pipeline, we could just
interleave with those (of course this opens a question of slightly
starving the writers). In the near future we will implement proper
write concurrency support, but we're appealing to Stonebraker's
argument for now.

> Finally, I hope that you don't limit the implementation of concurrent
> writers to the use of transactional memory.
We'll do full transactions. We usually refer to "transactional memory"
because they use almost the same techniques we'll be using to
implement transactions, and most people these days know a lot about
transactional memory and fairly little about how databases implement
transactions.
Re: questions about 'Rethinking Database Storage' whitepaper MARK CALLAGHAN 7/28/09 10:12 PM
On Tue, Jul 28, 2009 at 2:44 PM, coffeemug<coff...@gmail.com> wrote:
>
> On Jul 28, 7:15 am, MarkCallaghan <mdcal...@gmail.com> wrote:
>> What is the cause for 'index data structures generate 2GB of data per
>> index for every 50MB of data'?
> Our append-only indexing scheme is O(log n) space (I think this is
> what Leif was trying to say). Every time we modify a leaf of the index
> tree, we have to copy the entire branch from the leif, all the way
> down to root. This is what causes the overhead. We're working on
> garbage collection now, so hopefully in a day or two this will seize
> to be a problem.

Will GC decrease the rate at which pages are dirtied? Or just clean up
the dirty pages faster?

>
>> I might disagree with your claim that it is OK to queue writers
>> because there are few long-running transactions.
> This claim comes from Michael Stonebraker's paper (http://
> db.cs.yale.edu/vldb07hstore.pdf). It's true that short running writes
> that encounter disk reads would benefit from concurrent execution, but
> if there are more read queries in the pipeline, we could just
> interleave with those (of course this opens a question of slightly
> starving the writers). In the near future we will implement proper
> write concurrency support, but we're appealing to Stonebraker's
> argument for now.

H-Store is an in-memory DBMS and not subject to stalls for disk reads.
RethinkDB will have stalls in that case.

>
>> Finally, I hope that you don't limit the implementation of concurrent
>> writers to the use of transactional memory.
> We'll do full transactions. We usually refer to "transactional memory"
> because they use almost the same techniques we'll be using to
> implement transactions, and most people these days know a lot about
> transactional memory and fairly little about how databases implement
> transactions.

My hope is based on my assumption that there are few to no examples of
production-quality server code written using STM or HTM.

--
Mark Callaghan
mdca...@gmail.com

Re: questions about 'Rethinking Database Storage' whitepaper Slava Akhmechet 7/29/09 4:38 AM
On Jul 28, 10:12 pm, MARK CALLAGHAN <mdcal...@gmail.com> wrote:
> Will GC decrease the rate at which pages are dirtied? Or just clean up
> the dirty pages faster?
In an append-only scenario only one page is dirtied - the last one (I
assume you're referring to the page cache). The gc won't have effect
on this, it's just a copying garbage collector that will walk the
index trees (and data rows), and compactly copy this data over to new
a new file to reclaim space. Perhaps I'm misunderstanding the
question?

> H-Store is an in-memory DBMS and not subject to stalls for disk reads.
> RethinkDB will have stalls in that case.
I'm not sure why this is true. If there is a queue of writers with a
single writer writing at a time, and multiple concurrent readers, the
overall system won't stall - while there is a read going on from disk,
multiple other readers should do CPU work and read relevant data from
the caches. Why do you think there would be a stall? (I guess this
also depends on what you define to be a stall - I suppose the entire
process waiting for disk IO?)

> My hope is based on my assumption that there are few to no examples of
> production-quality server code written using STM or HTM.
There are few to no examples of successful append-only databases
either :) But this is definitely a great point, and we've thought
about this quite a bit. Ultimately, we might try out a number of
different schemes, and settle on one that makes the most sense.
Re: questions about 'Rethinking Database Storage' whitepaper MARK CALLAGHAN 7/29/09 7:31 AM
On Wed, Jul 29, 2009 at 4:38 AM, coffeemug<coff...@gmail.com> wrote:
>
> On Jul 28, 10:12 pm, MARK CALLAGHAN <mdcal...@gmail.com> wrote:
>> Will GC decrease the rate at which pages are dirtied? Or just clean up
>> the dirty pages faster?
> In an append-only scenario only one page is dirtied - the last one (I
> assume you're referring to the page cache). The gc won't have effect
> on this, it's just a copying garbage collector that will walk the
> index trees (and data rows), and compactly copy this data over to new
> a new file to reclaim space. Perhaps I'm misunderstanding the
> question?

>>>


Our append-only indexing scheme is O(log n) space (I think this is
what Leif was trying to say). Every time we modify a leaf of the index
tree, we have to copy the entire branch from the leif, all the way
down to root. This is what causes the overhead. We're working on
garbage collection now, so hopefully in a day or two this will seize
to be a problem.
>>>

Is that still true with a GC implementation?

>
>> H-Store is an in-memory DBMS and not subject to stalls for disk reads.
>> RethinkDB will have stalls in that case.
> I'm not sure why this is true. If there is a queue of writers with a
> single writer writing at a time, and multiple concurrent readers, the
> overall system won't stall - while there is a read going on from disk,
> multiple other readers should do CPU work and read relevant data from
> the caches. Why do you think there would be a stall? (I guess this
> also depends on what you define to be a stall - I suppose the entire
> process waiting for disk IO?)

By stalls, I mean that a transaction doing writes blocks on a read.
For example it is doing 'UPDATE foo set x = x + 1'. I assume that by
using a queue for writers you mean that only one write transaction
runs at a time.

>
>> My hope is based on my assumption that there are few to no examples of
>> production-quality server code written using STM or HTM.
> There are few to no examples of successful append-only databases
> either :) But this is definitely a great point, and we've thought
> about this quite a bit. Ultimately, we might try out a number of
> different schemes, and settle on one that makes the most sense.

There is the Berkeley DB Java Edition, but you have to look really
hard to find the whitepaper that describes it. It is worth reading.

--
Mark Callaghan
mdca...@gmail.com

Re: questions about 'Rethinking Database Storage' whitepaper Slava Akhmechet 7/29/09 5:14 PM
On Jul 29, 7:31 am, MARK CALLAGHAN <mdcal...@gmail.com> wrote:
> Is that still true with a GC implementation?
Yes. The GC will reclaim all the wasted space (so indexing scheme will
take up linear space, instead of O(n log n)), but everything else will
remain the same.

> By stalls, I mean that a transaction doing writes blocks on a read.
Yes, this is correct. However, while the write transaction blocks,
there will be other readers running concurrently, so the overall
system will still perform useful work. This is good enough for many
use cases now, but certainly needs to be fixed in the future.

> There is the Berkeley DB Java Edition, but you have to look really
> hard to find the whitepaper that describes it. It is worth reading.
Thanks, I will look it up.