About why fractal tree won't fragment(external fragmentation)

zhiyuan yang

unread,

Oct 8, 2015, 10:56:48 PM10/8/15

to tokudb-user

Hi,

On Percona website there is a blog which says fractal tree won't fragment without explanation. Then I find some slides about TokuMX says fractal tree avoids fragmentation by using large node size to mitigate seek cost. But I'm wondering whether we can do the same thing (use large node size) on B tree. It seems this option didn't got lots of notice, which means there must be some reason for not doing this. Can anyone give some explanation? Thanks!

blog url: https://www.percona.com/blog/2010/11/17/avoiding-fragmentation-with-fractal-trees/

slides url: http://files.meetup.com/1742411/The%20What,%20Why,%20and%20How%20of%20Fractal%20Tree%20Indexing%20for%20MongoDB.pdf

(page 46-49)

Thanks!

Zhiyuan

Leif Walsh

unread,

Oct 8, 2015, 11:23:08 PM10/8/15

to zhiyuan yang, tokudb-user

When you talk about fragmentation on disk, there are two concerns: space amplification due to internal or external fragmentation, and loss of data locality which hurts range query performance. Admittedly, the loss of data locality is maybe not the most obvious use of the term "fragmentation" and it has confused more people than just you.

InnoDB uses small, fixed-size pages, so it exhibits internal space fragmentation and loses lots of data locality as it ages. With a default 16kb page size and 10ms seeks, your range queries are dominated by seek time and you get an effective 16kb/10ms = 1.6MB/s in the worst case of totally randomized page placement.

TokuDB uses large, variable-sized pages, so it exhibits external space fragmentation, but the large page size means you can't lose that much data locality.

The back-of-the-envelope calculation is that you want the seek latency to be about equal to the latency of streaming a page off at dusk bandwidth:

For a spinning disk, assume seek and rotational latency is about 10ms, and sequential bandwidth is about 100MB/s. With a 1MB page, it takes 10ms to seek and 10ms to read, which means you get an effective range query throughput of 1MB/20ms = 50MB/s, even if your pages are in totally random order. TokuDB assumes 4x compression is a safe bet and chooses a default 4MB page size before compression.

Hope this helps!

--
Cheers,
Leif

--
You received this message because you are subscribed to the Google Groups "tokudb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tokudb-user...@googlegroups.com.
To post to this group, send email to tokud...@googlegroups.com.
Visit this group at http://groups.google.com/group/tokudb-user.
For more options, visit https://groups.google.com/d/optout.

Zhiyuan Yang

unread,

Oct 8, 2015, 11:45:18 PM10/8/15

to Leif Walsh, tokudb-user

Thanks for you explanation! Actually, I learned the term ‘external fragmentation’ from sql server community and also feel confused because most cases of ‘fragmentation’ are related to space allocation not the locality problems :-)

I didn’t know TokuDB use not only large but also variable-size pages. Now it seems clear, if we just use large, fixed-size page for InnoDB, we got all kinds of space fragmentation and loss of data locality, which isn’t desired at all.

Thanks!

Zhiyuan

Zhiyuan Yang

unread,

Oct 8, 2015, 11:56:28 PM10/8/15

to Leif Walsh, tokudb-user

Sorry, but one thing of my last response is not correct: if we use large page size for InnoDB, we get internal space fragmentation and loss of locality, not the external fragmentation. This means we can also improve scan performance with cost of space. It seems my confusion isn't resolved: why people don't use large node size for B-tree(like in InnoDB)? Is the space cost due to internal space fragmentation too high to be acceptable?

Thanks!

Zhiyuan

On Oct 8, 2015, at 11:23 PM, Leif Walsh <leif....@gmail.com> wrote:

Leif Walsh

unread,

Oct 9, 2015, 12:15:04 AM10/9/15

to Zhiyuan Yang, tokudb-user

Large pages in a b-tree like innodb cause very bad write amplification:

Suppose you only write a few rows to a leaf page before it needs to be flushed to disk (for eviction or checkpoint). You still have to serialize and write the whole page. So if your writes are, say, 1kb and your pages are 16kb, you have a "write amplification" factor of 16. But if your pages are 1MB, then your write amplification factor is 1000!

Write amplification is a measure of how many actual bytes are written to disk in proportion to the number of "user data bytes" written to the database.

In TokuDB, because writes are buffered high in the tree, we usually overwrite a large portion of a page when we do need to flush it. So, the per-node write amplification (total size / size of changes applied) is very low. But, in a b-tree a write only happens once, at the leaf (for all practical purposes), while in a fractal tree (much like an LSM tree), a single key write does get duplicated at each level as it's flushed down the tree.

For random-access workloads larger than RAM, a fractal tree will have much better (lower) write amplification than a b-tree. For sequential or heavily focused workloads (e.g. high-α Pareto distributions), b-trees do very well (because they can accumulate many writes per page before needing to flush), but TokuDB has optimizations for these types of workloads ("promotion") and actually does quite well for those workloads as well, by skipping the upper levels and injecting messages close to the leaves of the tree. I am not aware of any LSM-tree with similar optimizations.

This is a very long story to say: b-trees exhibit catastrophic write amplification with large page sizes. But write-optimized data structures like fractal trees and LSM-trees can afford large pages because their amortization benefits ameliorate the write amplification issues, and so reap the fragmentation benefits of large pages.

--
Cheers,
Leif

Zhiyuan Yang

unread,

Oct 9, 2015, 1:19:32 AM10/9/15

to Leif Walsh, tokudb-user

Wow, this metric makes a lot of sense and clears my confusion. Thank you so much!