Reducing blobserver fsync() calls?

Theodore Ts'o

unread,

Oct 2, 2016, 2:48:31 PM10/2/16

to Camlistore

Hi,

I was noticing that it was taking "a while" to do a test backup of 76 GB, so I started digging into why, and it appears that the blobserver is issuing an fsync() call after each object is received from the network, as well as after appending each blob to a pack file. This might make sense if we were using the blobserver as, say, a back end store for a Mail Server, where you want to make sure that you won't lose an object, even after a power failure, before you send that SMTP 200 code, but if you're doing a backup of millions and millions of objects, those fsync's are going to be expensive. And if you do crash, well, we can just restart the backup, so making sure that the bytes are solidly on iron oxide one object at a time seems a bit wasteful. (Especially since if you crash before the permanode is written, the client is going to have to restart the whole backup from scratch anyway.)

It would be fairly easily to add a config parameter which turns off fsync's entirely, or only after the server has gone idle or every N seconds, which ever comes first, but it occurs to me that might not be the best way to do things. Would it make more sense if there was some way for the client to the tell the server, "everything coming down this HTTP/2 link doesn't need to be treated as 'precious'", so backups might be treated one way, but other camlistore clients that might need more careful treatment of their data could still get it.

I'm sure I could do the first without a huge amount of difficulty[1], but if the second is more likely to be accepted if I try to submit a code contribution, some advice about how to design such a protocol enhancement, and how to code it up would be greatly appreciated.

[1] Even if it would be "Ted's second non-trivial Go code patch". :-)

Cheers,

-- Ted

Theodore Ts'o

unread,

Oct 2, 2016, 3:21:29 PM10/2/16

to Camlistore

On Sunday, October 2, 2016 at 2:48:31 PM UTC-4, Theodore Ts'o wrote:

(Especially since if you crash before the permanode is written, the client is going to have to restart the whole backup from scratch anyway.)

One thought --- as an automated heuristic, if the blobserver receives a stream of unsigned blobs, it doesn't need to fsync() them. After all, any objects which aren't referenced by a permanode are subject to GC treatment. So if you crash and then run a GC, any immutable, non-signed objects that were uploaded just before the crash would be GC'ed anyway. Hence, there's no point to treat them as precious objects that have to be fsync'ed before the client upload is acknowledged. So what could be done is when the first signed object is received, the blob server could send down a sync(2) command, and then write all of the signed objects using fsync(2).

If we did this, the next obvious optimization would be to tune the writeback interval for the disk in question to be 2-3 minutes, instead of the usual 30 seconds. I noticed that objects were getting written as loose files, and then repacked into pack file approximately every 2 minutes or so. All modern file systems do delayed allocation, which means that if we're not fsync'ing the loose files, they won't get flushed to disk, and so if they are written into the packed file and then get deleted within the writeback interval, the loose files will never get written to disk. This will double camlistore's effective write throughput to the disk, since we won't be writing each byte being backed up twice --- once to the loose file, and a second time to the pack file.

Cheers,

- Ted

P.S. I assume there are good reasons why we can't just stream the objects straight to the pack file, which is what git does? I noticed there were some comments about wanting to rearrange the objects so they would be in an optimal order for later access. Is that right?

Mathieu Lonjaret

unread,

Oct 3, 2016, 10:52:39 AM10/3/16

to camli...@googlegroups.com

Leaving the fsync question aside for now (I'd need to think about it
some more, and hoping Brad will reply in the meantime anyway), and
answering your question about why you don't straight pack all incoming
blobs:
packed blobs are supposed to help with sequential access of files, so
they're basically a re-assembling of all the small blobs of a file
into one (or more, if needed) zip. For, I think, efficiency reasons,
blobs/files which are under 512 bytes in size do not get stored in a
pack, so they stay forever in the loose blobserver. That is at least
one reason why you can't just stream all blobs directly to the packed
blobserver.

> --
> You received this message because you are subscribed to the Google Groups
> "Camlistore" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to camlistore+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Brad Fitzpatrick

unread,

Oct 3, 2016, 12:26:36 PM10/3/16

to camli...@googlegroups.com

I've actually been thinking that sync should be an explicit part of the protocol so higher levels can decide the atomicity that they require.

Then we make everything async by default, but all blob storage implementations must support a sync (or "Flush"?) operation. And then camput and other tools be sure to do a sync at the end before they return success. Or maybe they even have a flag (defaulting to --sync=true?) to let the caller control.

Thoughts? And on naming?

--
You received this message because you are subscribed to the Google Groups "Camlistore" group.

To unsubscribe from this group and stop receiving emails from it, send an email to camlistore+unsubscribe@googlegroups.com.

zimbatm

unread,

Oct 3, 2016, 1:51:48 PM10/3/16

to camli...@googlegroups.com

Hi,

Wouldn't the blob commit state also have to be shared in case two clients upload the same blob at the same time? Otherwise one client might upload a tree of blobs and loose the subset that's been uploaded by another client.

To unsubscribe from this group and stop receiving emails from it, send an email to camlistore+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Camlistore" group.

To unsubscribe from this group and stop receiving emails from it, send an email to camlistore+...@googlegroups.com.

clive boulton

unread,

Oct 3, 2016, 2:17:24 PM10/3/16

to camli...@googlegroups.com

Handling two (or more) clients: writes append to a log but are marked as speculative?

c/o Tango: Distributed Data Structures over a Shared Log - Mahesh Balakrishnan et al.
http://www.cs.cornell.edu/~taozou/sosp13/tangososp.pdf

On Mon, Oct 3, 2016 at 10:51 AM, zimbatm <zim...@zimbatm.com> wrote:

Hi,

Wouldn't the blob commit state also have to be shared in case two clients upload the same blob at the same time? Otherwise one client might upload a tree of blobs and loose the subset that's been uploaded by another client.

On Mon, 3 Oct 2016, 17:26 Brad Fitzpatrick, <br...@danga.com> wrote:

I've actually been thinking that sync should be an explicit part of the protocol so higher levels can decide the atomicity that they require.

Then we make everything async by default, but all blob storage implementations must support a sync (or "Flush"?) operation. And then camput and other tools be sure to do a sync at the end before they return success. Or maybe they even have a flag (defaulting to --sync=true?) to let the caller control.

Thoughts? And on naming?

On Sun, Oct 2, 2016 at 12:21 PM, Theodore Ts'o <theodo...@gmail.com> wrote:

On Sunday, October 2, 2016 at 2:48:31 PM UTC-4, Theodore Ts'o wrote:
(Especially since if you crash before the permanode is written, the client is going to have to restart the whole backup from scratch anyway.)

One thought --- as an automated heuristic, if the blobserver receives a stream of unsigned blobs, it doesn't need to fsync() them. After all, any objects which aren't referenced by a permanode are subject to GC treatment. So if you crash and then run a GC, any immutable, non-signed objects that were uploaded just before the crash would be GC'ed anyway. Hence, there's no point to treat them as precious objects that have to be fsync'ed before the client upload is acknowledged. So what could be done is when the first signed object is received, the blob server could send down a sync(2) command, and then write all of the signed objects using fsync(2).

If we did this, the next obvious optimization would be to tune the writeback interval for the disk in question to be 2-3 minutes, instead of the usual 30 seconds. I noticed that objects were getting written as loose files, and then repacked into pack file approximately every 2 minutes or so. All modern file systems do delayed allocation, which means that if we're not fsync'ing the loose files, they won't get flushed to disk, and so if they are written into the packed file and then get deleted within the writeback interval, the loose files will never get written to disk. This will double camlistore's effective write throughput to the disk, since we won't be writing each byte being backed up twice --- once to the loose file, and a second time to the pack file.

Cheers,

- Ted

P.S. I assume there are good reasons why we can't just stream the objects straight to the pack file, which is what git does? I noticed there were some comments about wanting to rearrange the objects so they would be in an optimal order for later access. Is that right?

--
You received this message because you are subscribed to the Google Groups "Camlistore" group.

To unsubscribe from this group and stop receiving emails from it, send an email to camlistore+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Camlistore" group.

To unsubscribe from this group and stop receiving emails from it, send an email to camlistore+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Camlistore" group.

To unsubscribe from this group and stop receiving emails from it, send an email to camlistore+unsubscribe@googlegroups.com.

Theodore Tso

unread,

Oct 3, 2016, 2:39:31 PM10/3/16

to camli...@googlegroups.com

For, I think, efficiency reasons,
blobs/files which are under 512 bytes in size do not get stored in a
pack, so they stay forever in the loose blobserver.

I'm not sure I understand the efficiency argument? Certainly from a space efficiency perspective, it would be much preferably to store it in a pack --- otherwise we would be wasting at least 7.5k per small object. From a performance effiency perspective, it would depend on whether the small file was "hot" or not, I suppose. But I can't think of any reason why a small file would automatically be hotter than a cold file. And even if it is hot, we could cache it after the first access.

Certainly if we are doing a backup, the file will be cold cold cold, and in the csae of a symlink tree, we could have a huge number of smallish blobs that would really do better packed into a packfile.

Cheers,

- Ted

> email to camlistore+unsubscribe@googlegroups.com.

> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Camlistore" group.

To unsubscribe from this group and stop receiving emails from it, send an email to camlistore+unsubscribe@googlegroups.com.

Theodore Tso

unread,

Oct 3, 2016, 2:41:01 PM10/3/16

to camli...@googlegroups.com

Or maybe as an optional argument to the put operation that requests whether or not the blob write should be flushed or not, with a config option to define what happens in the default case when the caller doesn't specify one way or another?

- Ted

Mathieu Lonjaret

unread,

Oct 3, 2016, 6:32:13 PM10/3/16

to camli...@googlegroups.com

Are you saying that instead of adding a Sync/Flush method to the
blobserver.Storage interface (which is what I believe Brad is
proposing), you'd add an argument to the blobserver.ReceiveBlob
method?

Or were you talking more specifically about how it would translate for
higher level tools like camput?

>>> email to camlistore+...@googlegroups.com.

>>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Camlistore" group.
>> To unsubscribe from this group and stop receiving emails from it, send an

>> email to camlistore+...@googlegroups.com.

>> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "Camlistore" group.
> To unsubscribe from this group and stop receiving emails from it, send an

> email to camlistore+...@googlegroups.com.

Theodore Ts'o

unread,

Oct 4, 2016, 12:10:23 AM10/4/16

to camli...@googlegroups.com

On Tue, Oct 04, 2016 at 12:31:52AM +0200, Mathieu Lonjaret wrote:
> Are you saying that instead of adding a Sync/Flush method to the
> blobserver.Storage interface (which is what I believe Brad is
> proposing), you'd add an argument to the blobserver.ReceiveBlob
> method?

I'm suggesting that we do both things. If the client has a way of
explicitly asking for the fsync() to be skipped while it is uploading
objects to the server, the client might also want to have a way of
asking the server, "please issue a sync(2) or otherwise make sure
everything I sent w/o the fsync being requested has been flushed to
disk".

This gives full control to client, which I think is what Brad was
suggesting. The server could also have a config option which
describes what should happen if the client doesn't say explicitly one
way or another.

> Or were you talking more specifically about how it would translate for
> higher level tools like camput?

I wasn't thinking about that at all. My initial thoughts are that
camput should have command line options that would allow the user (or
shell script) to specify the behavior, again with perhaps some
defaults that could be controlled by a client config.

I have started thinking, however, that if we want Camlistore to have a
proper full backup functionality which is compariable with other
backup solutions (even simple backup solutions like "rsync" and "tar",
never mind more advanced tools like Areca and Bacula), we will need to
make a program separate from camput, and maybe it's better to start
sooner rather than later.

Maybe it would start as a fork of camput, but then I would want to add
different defaults, add exclude / exclude file support, maybe have the
ability to start camlistore servers on private ports to deal with
backups to externally mounted USB drives (which aren't always
connected to the laptop), etc. I'd also want to have automated tags
so the user isn't having to manually specify a large number of useful
tags (backup, hostname:callcc, backup-config:homedir, etc.)

But that's a whole seprate discussion; my main concern with this mail
thread was that the blobserver is writing every byte being backed up
twice, with fsyncs() to guarantee that we are wasting a huge amount of
disk write bandwidth (as well as flash write endurance on SSD's), and
in the backup use case it's really not adding any value.

Hence, it would be useful to update the protocol as Brad suggested,
since that's needed as a prerequisite towards non-embarassing
performance numbers when compared to other backup solutions. Whether
the initial implementation is done as a shell script on top of camput,
or as a clone and specializatoin of camput implemented in Go, is
really a separate issue.

Cheers,

- Ted

Brad Fitzpatrick

unread,

Dec 29, 2017, 3:53:28 PM12/29/17

to camli...@googlegroups.com

Following up a year later, ...

I filed https://github.com/camlistore/camlistore/issues/999

I'd like to work on this soon-ish.

--
You received this message because you are subscribed to the Google Groups "Camlistore" group.

To unsubscribe from this group and stop receiving emails from it, send an email to camlistore+unsubscribe@googlegroups.com.

Reply all

Reply to author

Forward