Reducing metadata operations, a first step

33 views
Skip to first unread message

Daniel Phillips

unread,
Nov 12, 2007, 3:44:21 AM11/12/07
to zuma...@googlegroups.com
Reducing metadata operations in general

The initial ddsnap metadata design was "the simplest thing that could
possibly work" at the expense of optimization, a design decision I do
not regret given the high cost of even a small number of correctness
issues in the code. Now that the simple design has stabilized it is
time to do some optimization.

With the current design, each write to a snapshotted origin requires 8
IO transfers (not necessarily in this order):

1. read snapshotted origin data from origin
2. write snapshotted origin data to snapshot store
3. write bitmap block to journal (snapshot store copyout allocation)
4. write btree leaf to journal (pointer to copyout data addition)
5. write commit block to journal
6. write journalled bitmap block to final destination
7. write journalled btree leaf to final destination
8. write new origin data

In short, the current snapshot algorithm increases bandwidth overhead
for single chunk writes by a factor of 8, and 7 new seeks are
introduced. It is clear that reducing the number of metadata writes
will benefit both solid state disk (by reducing bandwidth) and rotating
media (by reducing seeks).

Eliminating most bitmap updates in particular

Of the 5 metadata writes (3 through 7) it is quite easy to amortize away
the bitmap updates (3 and 6 above). The idea is to record allocations
in the journal commit block and flush all those bits out to the actual
bitmaps just once per journal wrap. This turns the per-transaction
bitmap updates into one efficient write pass through the bitmap table
per journal wrap, with just a single seek and in the typical case, with
many bit updates consolidated per block. Depending on various factors
(for example, journal size) savings from this optimization should
approach two out of eight IO operations per origin write, or roughly
25%.

To implent this idea, the ddsnapd code is ammended as follows:

- In the commit block, borrow a bit from the physical block address to
indicate that the block was allocated in the commit. Only
allocations are handled this way, not deallocations, since
deallocation is already a bulk operation. This requires auditing all
accesses to those addresses.

- Dirty bitmap blocks are not added to the journal transaction.
Instead, newly allocated data blocks are marked as such in the
journal commit block.

- On journal wrap (detect at commit time) dirty bitmap blocks are
flushed to disk (or alternatively, journalled to disk for slightly
more safety).

- On journal replay, any "new allocation" flags in commit blocks are
entered into the respective bitmap blocks, dirtying them.

The old, simple, stable code will be kept as a compilation option.

This idea may be extended to btree leaf operations as well, potentially
eliminating another two metadata writes per origin write. This is
somewhat more challenging because a scheme needs to be implemented to
describe the changes to the leaves, and at replay time, to recognize
when the changes have or have not been applied.

In terms of urgency, the bitmap optimization ranks below the completion
of the handful of usability issues we have stacked up for the 0.5
release, and likely ranks below manual snapshot store expansion. On
the other hand, showing some efficiency gains early, as in the next few
weeks, helps answer the question "will we ever do something about our
worst case write performance". The answer is yes, starting now.

Regards,

Daniel

Reply all
Reply to author
Forward
0 new messages