Compaction

Matthieu Rakotojaona

unread,

Mar 8, 2012, 2:21:14 PM3/8/12

to us...@couchdb.apache.org

Hello everyone,

I discovered couchDB a few months ago, and decided to dive in just
recently. I don't want to be long, but couchDB is Amazing. True offline
mode/replication, JSON over HTTP, MVCC, MapReduce and other concepts
widened my horizon of how to solve a problem, and I'm really grateful.

There is a point though that I find sad : the documentation available on
the interwebs are somewhat scarce. Sure you can find yourself because
couchDB is so easy, but there's a particular point that I found to be
especially undocumented : compaction.

Basically all I could find was :
* If you want to compact your db :
> POST /db/_compact
* If you want to compact your design :
> POST /db/_compact/designname
(which seems to say that you can only compact all your views at once
or none, but not a particular one)
* Although specially designed like that, the absence of automatic
compaction is seen as unneeded, and a number of people run it with
cron jobs
* The real effect of a compaction (ie the real size you are going to
earn) seems to be unknown by many people. Someone (I don't remember
your name, but thank you) came with a patch to dispaly the data_size,
which is the real size of your data on disk; this looks hackish.

And the initial purpose to my mail comes here. I just added a few
documents in my db (1.7+M) and found that the disk_size gives me ~2.5GB.
while the data_size is around 660 Mo. From what I read, a compaction is
supposed to leave you with data_size ~= disk_size; yet, after numerous
compaction, it doesn't shrink a bit.

I suppose the problem is exactly the same with views; I'm building it at
the moment, so I will test it later.

I also would like to understand the process of compaction. All I could
see was :

1. couchdb parses the entire DB, fetching only the last (or the few
last, from parameters) revision of each document
2. it assembles them in a db.compact.couch file
3. when finished, db.compact.couch replaces db.couch

So I wondered :
* Can you launch a compaction, halt it and continue it later ?
* If yes, can you move the temporary db.compact.couch file somewhere
else and link to it so that couchdb thinks nothing has changed ?

Thank you,

--
Matthieu RAKOTOJAONA

Paul Davis

unread,

Mar 8, 2012, 4:39:53 PM3/8/12

to us...@couchdb.apache.org

On Thu, Mar 8, 2012 at 1:21 PM, Matthieu Rakotojaona
<matthieu.r...@gmail.com> wrote:
> Hello everyone,
>
> I discovered couchDB a few months ago, and decided to dive in just
> recently. I don't want to be long, but couchDB is Amazing. True offline
> mode/replication, JSON over HTTP, MVCC, MapReduce and other concepts
> widened my horizon of how to solve a problem, and I'm really grateful.
>
> There is a point though that I find sad : the documentation available on
> the interwebs are somewhat scarce. Sure you can find yourself because
> couchDB is so easy, but there's a particular point that I found to be
> especially undocumented : compaction.
>
> Basically all I could find was :
> * If you want to compact your db :
> > POST /db/_compact
> * If you want to compact your design :
> > POST /db/_compact/designname
> (which seems to say that you can only compact all your views at once
> or none, but not a particular one)

Slightly more specific: Compaction for views is done for all the views
in the specified design document. Also, view compaction is in general
much more efficient than database compaction.

> * Although specially designed like that, the absence of automatic
> compaction is seen as unneeded, and a number of people run it with
> cron jobs

There's an auto compactor in trunk now.

> * The real effect of a compaction (ie the real size you are going to
> earn) seems to be unknown by many people. Someone (I don't remember
> your name, but thank you) came with a patch to dispaly the data_size,
> which is the real size of your data on disk; this looks hackish.
>

Which part looks hackish?

> And the initial purpose to my mail comes here. I just added a few
> documents in my db (1.7+M) and found that the disk_size gives me ~2.5GB.
> while the data_size is around 660 Mo. From what I read, a compaction is
> supposed to leave you with data_size ~= disk_size; yet, after numerous
> compaction, it doesn't shrink a bit.
>

I bet you have random document ids which will indeed cause the
database file to end up with a significant amount of garbage left
after compaction. I'll describe why below.

> I suppose the problem is exactly the same with views; I'm building it at
> the moment, so I will test it later.
>

Technically yes, but in general no. More below.

> I also would like to understand the process of compaction. All I could
> see was :
>
> 1. couchdb parses the entire DB, fetching only the last (or the few
> last, from parameters) revision of each document
> 2. it assembles them in a db.compact.couch file
> 3. when finished, db.compact.couch replaces db.couch
>

In broad strokes. Currently, CouchDB compacts like such:

1. Iterate over docs in order of the update_sequence
2. Read document from the id_btree
3. Write doc to both the update sequence and id indexes in the compaction file
4. When finished, delete the .couch file and rename .couch.compact -> .couch

Its a bit more complicated than that due to buffering of docs to
improve throughput and what not, but those are the important details.

The issue is two fold. First, reading the docs in order of the update
sequence and then fetching them using the id btree means we're
incurring a btree lookup per doc. There's a patch in BigCouch that
addresses this by duplicating a record in both trees. It's been shown
to have significant speedups for compaction and replication both at
the expensive of storing more data (basically it has two copies of the
revision tree, but importantly does not duplicate the actual JSON body
of the document). While not directly size related in itself, it leads
us to the second issue.

Namely, that writing both indexes simultaneously is bad for
introducing garbage into the .compact file if the order of document
ids in the update_seq is random. Ie, if you wrote the same documents
to a database where one had is that were monotonically increasing,
(say, "%0.20d" % i) vs a random document id and then compact both, the
random ids will use significantly more disk space after compaction (as
well as take longer to compact).

The issue here is that when we update the id tree with random doc ids
we end up rewriting more of the internal nodes (append only storage)
which causes more garbage to accumulate. Although, all hope is not
lost.

There's a second set of two patches in BigCouch that I wrote to
address this specifically. The first patch changes the compactor to
use a temporary file for the id btree. Then just before compaction
finishes, this tree is streamed back into the .compact file (in sorted
order so that internal garbage is minimized). This helps tremendously
for databases with random document ids (sorted ids are already
~optimal for this scheme). The second patch in the set uses an
external merge sort on the temporary file which helps speed up the
compaction.

Depending on the dataset these improvements can have massive gains for
post-compaction data sizes as well as time required for compaction. I
plan on pulling these back into CouchDB in the coming months as we
work on merging BigCouch back into CouchDB so hopefully by end of
summer they'll be in master for everyone to enjoy.

As to views, they don't really require these imrpovements because
their indexes are always streamed in sorted order. So its both fast
and close-ish to optimal. Although somewhere I had a patch that
changed the index builds to be actually optimal based on ideas from
Filipe but as I recall it wasn't a super huge win so I didn't actually
commit it.

> So I wondered :
> * Can you launch a compaction, halt it and continue it later ?

While you can resume compaction, there's no API for pausing or
canceling them. There's actually a really neat way in Erlang to do
this that we've mentioned occasionally adding to the active tasks API
but no one has gotten around to adding it.

> * If yes, can you move the temporary db.compact.couch file somewhere
> else and link to it so that couchdb thinks nothing has changed ?
>

I'm not sure what you mean here.

Filipe David Manana

unread,

Mar 10, 2012, 1:05:51 PM3/10/12

to us...@couchdb.apache.org

On Thu, Mar 8, 2012 at 9:39 PM, Paul Davis <paul.jos...@gmail.com> wrote:
>
> There's a second set of two patches in BigCouch that I wrote to
> address this specifically. The first patch changes the compactor to
> use a temporary file for the id btree. Then just before compaction
> finishes, this tree is streamed back into the .compact file (in sorted
> order so that internal garbage is minimized). This helps tremendously
> for databases with random document ids (sorted ids are already
> ~optimal for this scheme). The second patch in the set uses an
> external merge sort on the temporary file which helps speed up the
> compaction.
>
> Depending on the dataset these improvements can have massive gains for
> post-compaction data sizes as well as time required for compaction. I
> plan on pulling these back into CouchDB in the coming months as we
> work on merging BigCouch back into CouchDB so hopefully by end of
> summer they'll be in master for everyone to enjoy.
>
> As to views, they don't really require these imrpovements because
> their indexes are always streamed in sorted order. So its both fast
> and close-ish to optimal. Although somewhere I had a patch that
> changed the index builds to be actually optimal based on ideas from
> Filipe but as I recall it wasn't a super huge win so I didn't actually
> commit it.

Yes, about half a year ago I wrote some code to build btrees bottom up
into a new file while folding them from the source file (see [1]).
This ensures the final btree has a fragmentation of 0% besides
speeding the compaction process (not always faster, but at least it's
never slower then the old approach).
It's been used in Couchbase for view compaction since then and working
perfectly fine.
I haven't yet adapted it to CouchDB's code as well.

As for databases, we use it together with a temporary file and
external disk sorting (erlang's file_sorter module) as well (see [2]).
Maybe it's exactly the same approach as you mentioned, however our
file format is very different from CouchDB's one. Besides guaranteeing
0% of fragmentation, it's also much faster for the random IDs case.

[1] - https://github.com/fdmanana/couchdb/commit/45a2956e0534c853d58169d7fd2cea23b3978c03

[2] - https://github.com/couchbase/couchdb/commit/f4f62ac6

--
Filipe David Manana,

"Reasonable men adapt themselves to the world.
Unreasonable men adapt the world to themselves.
That's why all progress depends on unreasonable men."

Matthieu Rakotojaona

unread,

Mar 10, 2012, 2:01:40 PM3/10/12

to us...@couchdb.apache.org

Hello,

Wow, thank you for the very comprehensive answer.

On Thu, Mar 8, 2012 at 10:39 PM, Paul Davis <paul.jos...@gmail.com> wrote:
>> And the initial purpose to my mail comes here. I just added a few
>> documents in my db (1.7+M) and found that the disk_size gives me ~2.5GB.
>> while the data_size is around 660 Mo. From what I read, a compaction is
>> supposed to leave you with data_size ~= disk_size; yet, after numerous
>> compaction, it doesn't shrink a bit.
>>
>
> I bet you have random document ids which will indeed cause the
> database file to end up with a significant amount of garbage left
> after compaction. I'll describe why below.

Yup. I already had my ids, but they were not ordered as I read through
the file. Now that couchDB stores my rows with its own-generated IDs
(with the 'sequential' algorithm), the new size of my whole DB shrank
down to 500 MB. Very neat.

>> * If yes, can you move the temporary db.compact.couch file somewhere
>> else and link to it so that couchdb thinks nothing has changed ?
>>
>
> I'm not sure what you mean here.

In case I see that I will lack storage space, like what happened to
me, I would like the .compact file to be created and used in another
disk, but I didn't see this in the config file. So I thought something
like that would do the trick :

1. Launch compaction
2. Pause it (actually, stop the server for now)
3. Move the .compact created file somewhere else, and symlink to it
4. Continue compaction

This flow could also be useful if we want to use an SSD to do a
(faster) compaction, later writing the DB back to a classic HDD.

I resorted to mounting some directory on my data disk to
/var/lib/couchdb, which I'm not really proud of.

--
Matthieu RAKOTOJAONA

Paul Davis

unread,

Mar 10, 2012, 2:25:51 PM3/10/12

to us...@couchdb.apache.org

On one hand this makes a lot of sense, on the other though it might
cause a bit of an issue for people. If we allow people to specify a
different directory that ends up on a different disk then the atomic
rename that we rely on becomes a possibly quite length copy between
two devices. Since this swap is serialized in the couch_db_updater
code it would render a database unresponsive to any traffic during
that possibly lengthy copy. Its possible that we could have a two step
process but that'd would require a bit more trickery and I'm not sure
it'd be worth it in the general case.

Matthieu Rakotojaona

unread,

Mar 10, 2012, 2:58:43 PM3/10/12

to us...@couchdb.apache.org

Ok, I see what you mean. This kind of modification would be useful for
admins, as they speed up the whole process, but poses the risk of an
unavailability moment for the DB users.