Thoughts about compact in wiredtiger

61 views
Skip to first unread message

bai...@gmail.com

unread,
Jan 11, 2022, 10:04:44 PM1/11/22
to wiredtiger-users
Hi, all.

Lately I'm learning how compact works in WiredTiger. I've read https://source.wiredtiger.com/develop/arch-compact.html and session_compact.c 's leading comments. After that I feel I've known something,  and don't have a clear picture in my mind.

What I do know is compact, similar to autovacuum in sqlite, can shrink the file and they seem to have a very different approach, although they all truncate the file from end.

For sqlite, db maintains the freelist of unused pages of the file. When transaction commits, sqlite will try to move valid pages from the end of file to free pages, then truncating the file. Of course, the parent's pointer to relocated pages needs to be updated. Sounds simple. (Some references:

But for compact in WiredTiger, it seems quite complex, connected with btree layer, block manager, and checkpoints. 

I'm trying to ask some specific questions:

1. why does the action TRUNCATE (`__wt_block_extlist_truncate`) happen inside the checkpoint path( `__wt_block_checkpoint` -> `__ckpt_process`) instead of placed after the third checkpoint in `__compact_worker` ? After all, it is COMPACT not CHECKPOINT meant to truncate the file acoording to API semantics, right ?
( regular CHECKPOINT also truncate file ? )

2. Maybe the avail list in WiredTiger resembles the freelist pages in sqlite ?
sqlite chooses to shrink file at the commit of transaction whereas wiredtiger uses CHECKPOINT
to control the release of free blocks (maybe because the end blocks are referenced by older checkpoints
or named checkpoints, so they cann't be free)

3. What is the fundamental reason that wiredtiger makes compaction (vacuum in other db term)
dependent on CHECKPOINT ?

Keith Bostic

unread,
Jan 12, 2022, 12:31:08 PM1/12/22
to wiredtiger-users
On Tuesday, January 11, 2022 at 7:04:44 PM UTC-8 bai...@gmail.com wrote:
 
For sqlite, db maintains the freelist of unused pages of the file. When transaction commits, sqlite will try to move valid pages from the end of file to free pages, then truncating the file. Of course, the parent's pointer to relocated pages needs to be updated.

WiredTiger also maintains a freelist of unused pages, the trickiness is figuring out when a page can be added to the freelist. A page in a WiredTiger file can be referenced by some number of checkpoints, that is, a block that is unchanged between serial checkpoints will be referenced by all of them, and blocks don't become free until all of the checkpoints that reference them have been discarded. This makes compaction relatively difficult: if you want to rewrite a block, you have to figure out the set of parent pages (one per checkpoint that references the block), and update all of them.
 
But for compact in WiredTiger, it seems quite complex, connected with btree layer, block manager, and checkpoints.

Yes; WiredTiger was intended to support multiple block managers and there's layering for that reason. 
 
I'm trying to ask some specific questions:

1. why does the action TRUNCATE (`__wt_block_extlist_truncate`) happen inside the checkpoint path( `__wt_block_checkpoint` -> `__ckpt_process`) instead of placed after the third checkpoint in `__compact_worker` ? After all, it is COMPACT not CHECKPOINT meant to truncate the file according to API semantics, right ?
( regular CHECKPOINT also truncate file ? )

The problem is recovery. WiredTiger is no-overwrite, so if you rewrite a leaf page you have to rewrite its parent, and it's parent's parent and so on up to the root. If you crash after rewriting the leaf but before the root has been rewritten, everything has to be put back together, and a checkpoint is the mechanism WiredTiger uses for doing that. Compaction doesn't know anything about checkpoints, so all it can do is rewrite a leaf page and then let the checkpoint mechanism deal with writing everything that is necessary so the tree can be recovered after failure.
 
2. Maybe the avail list in WiredTiger resembles the freelist pages in sqlite ?
sqlite chooses to shrink file at the commit of transaction whereas wiredtiger uses CHECKPOINT
to control the release of free blocks (maybe because the end blocks are referenced by older checkpoints
or named checkpoints, so they cannot be free)

Yes, the avail list is the list of pages currently available for use.

I know nothing about sqlite, so I can't really comment. For WiredTiger at least, commit in a logged system doesn't do anything other than flush the write-ahead-log, for performance reasons it doesn't necessarily write any pages at all into the data file, so there's no possibility of shrinking the data file at that time.

And, yes, as you said, even if a block is removed in one checkpoint doesn't mean that it's not still in use in a different checkpoint.
 
3. What is the fundamental reason that wiredtiger makes compaction (vacuum in other db term)
dependent on CHECKPOINT ?

I think the fundamental reason is because WiredTiger's mechanism for recovery is checkpoint. All compaction can do is shuffle blocks around, checkpoint has to run before those changes can be considered durable. 
Reply all
Reply to author
Forward
0 new messages