Thoughts

93 views
Skip to first unread message

Greg Young

unread,
Oct 30, 2013, 9:32:50 AM10/30/13
to jour...@googlegroups.com
I spent half an hour or so reading through the code today and I just wanted to put some thoughts/comments/suggestions up for discussion. Overall the code base looks pretty good (except of course that it's in java)

1) compaction seems incapable of compacting multiple files into a single file. This is a really nice feature as otherwise you end up with 10k tiny files :)

2) I saw multi-threaded compaction and got really scared this can even kill the ability to write due to write amplification. More likely you want to be able to slow down compaction in load scenarios. This also goes to #1 as compactions are usually not single file (though still naively parallelizable)

3) in the way things are made durable: this is a trade off between latency and throughput. In another system I have worked with we came up with a really dead simple way of balancing that may be useful. Queue batches. If when the writer finishes a write the queue is empty immediately fsync if not take off up to a period of time (we base time on the length of time it takes to fsync)

4) With the use of byte[] my guess is it will take down the jvm under load with larger messages.

5) quite a few places were deleting data, we have a rule never to delete without first making a copy (in case of bugs etc). A reordered write (or bad sector could cause an entire journal chunk to be deleted)

6) when switching temporary files in compaction it's highly os specific where you need fsyncs (there are none now so this can corrupt a database) eg does the close of handle sync? Does copy ensure an fsync, rename, etc. we found a ton of bugs with this when doing power pulling tests on various operating systems

7) adding a footer to the log with an incrementing hash might help detect bad data (if you write incremental in log you might even be able to save bad data)

Cheers,

Greg

Sergio Bossa

unread,
Nov 2, 2013, 6:36:04 AM11/2/13
to jour...@googlegroups.com
Hi Greg,

sorry for this late response, and thanks for the feedback!

See my comments below.

> 1) compaction seems incapable of compacting multiple files into a single file. This is a really nice feature as otherwise you end up with 10k tiny files :)

Correct, feel free to open an issue.

> 2) I saw multi-threaded compaction and got really scared this can even kill the ability to write due to write amplification. More likely you want to be able to slow down compaction in load scenarios. This also goes to #1 as compactions are usually not single file (though still naively parallelizable)

Correct, another why compaction should be improved is by
controlling/throttling I/O.
Again, feel free to open a new issue, even if right now compaction
improvements are not on the top of my list (but contributions are more
than welcome!).

> 3) in the way things are made durable: this is a trade off between latency and throughput. In another system I have worked with we came up with a really dead simple way of balancing that may be useful. Queue batches. If when the writer finishes a write the queue is empty immediately fsync if not take off up to a period of time (we base time on the length of time it takes to fsync)

Right now durability is in developers' hands: they can control the
right level for them, and I think it is good enough for now.
I think the main problem is for those who want immediate durability by
syncing every single write: in this case, the performance loss is
huge, due to the disk write and the batch overhead: in such case, I
think a time-based approach would be more useful, that is, collect all
writes happening in N milliseconds, write them as a batch and only
then acknowledge the write by returning from the write method (rather
than batching and returning immediately like it happens right now);
like you said, it's a trade-off between latency and throughput (this
solution would hurt latency), but I worked with high-throughput
systems working quite well this way.

> 4) With the use of byte[] my guess is it will take down the jvm under load with larger messages.

Correct, streaming APIs should be added in future versions.

> 5) quite a few places were deleting data, we have a rule never to delete without first making a copy (in case of bugs etc). A reordered write (or bad sector could cause an entire journal chunk to be deleted)

Do you mean log files deletion? In such case, we provide a (primitive
for now) journal archiving, did you try it?

> 6) when switching temporary files in compaction it's highly os specific where you need fsyncs (there are none now so this can corrupt a database) eg does the close of handle sync? Does copy ensure an fsync, rename, etc. we found a ton of bugs with this when doing power pulling tests on various operating systems

Yes, low level disk behaviour is a kind of black magic, more tests
should be done: do you have any suggestions?

> 7) adding a footer to the log with an incrementing hash might help detect bad data (if you write incremental in log you might even be able to save bad data)

You mean when the log file is "finalized"?

Thanks again for the feedback, looking forward to your answers, and
contributions too :)

--
Sergio Bossa
http://www.linkedin.com/in/sergiob
Reply all
Reply to author
Forward
0 new messages