Improving fsck / par2 multicore behaviour

Alex Selby

unread,

Jun 3, 2014, 10:38:10 PM6/3/14

to bup-...@googlegroups.com

Hi Everyone,

I've just been experimenting with bup for the first time. It looks
very nice and as far as I can see offers something not available
elsewhere. I have a few comments/questions. (I'll put them in
separate threads for simplicity.)

First on "fsck -g" / par2 speed:

The implementation of par2 is (surprisingly) CPU bound. At least
it is on my (pretty normal, modern CPU) hardware. I.e., the disk
can throw packfiles at par2 much (10x) faster than par2 can compute
checksums. Of course this is nothing to do with bup itself: bup is
just using the external par2 program, which appears to run rather
slowly.

Probably par2 could be replaced with something better, but of
course the no-effort way to improve performance is to use more
cores. There is a comment in 'bup help fsck' under -j that there
if you add too many cores, the disk ends up seeking too much, as
indeed it does.

As the par2 and git verify-pack processes simultaneously try to
fill their buffers, the disk is continuous jumps backwards and
fowards as it loads a bit of each packfile. But this isn't
necessary because the processes for each packfile don't interact
with each other. If it could be arranged so that one par2 process
has exclusive use of the disk and does a big sequential load while
the others are computing, then it would all run much faster and you
could use as many cores as you like.

But I'm not sure there is a way to directly control the input
buffer size for par2 and git verify-pack. They seem to use (on my
OS, anyway) be reading from the disk in 128k chunk sizes (probably
set by stdio or the device readahead or something) which is too
small. Ideally this buffer should be large compared with
(sequential disk speed)*(seek time) ~= (for me) 140MB/s*5ms =
0.7MB, so a few MB would be good. But you presumably don't want to
be complicating bup by cross-compiling with git and par2, so you
don't have much control here.

You can improve the par2 behaviour by using the option -m. -m128
works quite well for bup, since this corresponds to being able to
contain all of the output in RAM for the parameters set by bup (10%
of 1GB).

But really it's more complicated to analyse because the OS caches
disk reads. So after the git verify-pack pass from fsck -g
(assuming this wasn't disabled), the packfile will probably be in
the OS's filecache and the subsequent par2 will go smoothly. If the
par2 processes finish at different times and if there aren't too
many simultaneous processes, then the next git verify-pack might
have exclusive use of the disk and run smoothly (sequentially), but
if you're unlucky it may clash with another git verify-pack.

Anyway, sorry to write so much. All I really wanted to say was that
a couple of simple tweaks improved perforamnce for me, and I think
will probably improve performance on most setups involving non-SSD
hard disks, and hopefully will never slow things down very much.

The -m128 option is added to par2, and the controlling program
prereads the file at a time when it probably has exclusive use of
the disk, so that the OS will hopefully cache the file. This works
provided the number of processes is no more than about half the
memory size in GB (because the packfiles are up to 1GB).

Alex

diff --git a/cmd/fsck-cmd.py b/cmd/fsck-cmd.py
index 794c755..2505ae0 100755
--- a/cmd/fsck-cmd.py
+++ b/cmd/fsck-cmd.py
@@ -44,7 +44,7 @@ def parv(lvl):
         return ['-qq']

def par2_generate(base):
-    return run(['par2', 'create', '-n1', '-c200'] + parv(2)
+    return run(['par2', 'create', '-n1', '-c200', '-m128'] + parv(2)
                + ['--', base, base+'.pack', base+'.idx'])

def par2_verify(base):
@@ -190,6 +190,9 @@ for name in extra:
                 del outstanding[pid]
                 code = code or nc
                 count += 1
+        f=open(name)
+        for b in chunkyreader(f): pass
+        f.close()
         pid = os.fork()
         if pid: # parent
             outstanding[pid] = 1

Rob Browning

unread,

Jun 5, 2014, 5:10:26 PM6/5/14

to Alex Selby, bup-...@googlegroups.com

Alex Selby <aselb...@gmail.com> writes:

> Anyway, sorry to write so much. All I really wanted to say was that
> a couple of simple tweaks improved perforamnce for me, and I think
> will probably improve performance on most setups involving non-SSD
> hard disks, and hopefully will never slow things down very much.
>
> The -m128 option is added to par2, and the controlling program
> prereads the file at a time when it probably has exclusive use of
> the disk, so that the OS will hopefully cache the file. This works
> provided the number of processes is no more than about half the
> memory size in GB (because the packfiles are up to 1GB).

Quite interesting, though I wonder how it will affect really low RAM
hosts (people seem to be using bup on inexpensive NAS machines, for
example). There, if nothing else, I wonder if you could get in a
situation where the preload actually hurts.

Though past a certain point, bup just isn't friendly to really limited
RAM right now.

Oh, and did you by any chance take a look at the par2 source to see what
the -m option actually does?

And without having thought too hard about it yet -- I wondered how well
it might work to have fsck use non-blocking writes to par2 input pipes,
then fsck could control the read-parallelism (if any), and hopefully
keep all the par2 processes' pipes full. There, I assume fsck would be
select()ing for write on all the pipes, and maintaining some buffered
data.

Though if par2 can't accept data over a pipe, that'd also require either
changes to par2, or implementing something like bash's "<(foo)" process
substitution using named pipes -- which is getting reasonably complex.

Oh, and I believe there may be an active par2 upstream now, if you
decide to poke around there.

Thanks
--
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4

Alex Selby

unread,

Jun 7, 2014, 11:37:25 PM6/7/14

to bup-...@googlegroups.com, aselb...@gmail.com

On Thursday, 5 June 2014 22:10:26 UTC+1, Rob Browning wrote:

Alex Selby <aselb...@gmail.com> writes:

> Anyway, sorry to write so much. All I really wanted to say was that
> a couple of simple tweaks improved perforamnce for me, and I think
> will probably improve performance on most setups involving non-SSD
> hard disks, and hopefully will never slow things down very much.
>
> The -m128 option is added to par2, and the controlling program
> prereads the file at a time when it probably has exclusive use of
> the disk, so that the OS will hopefully cache the file. This works
> provided the number of processes is no more than about half the
> memory size in GB (because the packfiles are up to 1GB).

Quite interesting, though I wonder how it will affect really low RAM
hosts (people seem to be using bup on inexpensive NAS machines, for
example). There, if nothing else, I wonder if you could get in a
situation where the preload actually hurts.

Though past a certain point, bup just isn't friendly to really limited
RAM right now.

That's a good point. I was assuming that any computer with multiple
cores is like to be a fairly modern one with at least 0.5GB spare RAM
per core, but I guess that isn't always so. (The preloading could be
made to operate only if more than one core is selected with the -j
option, so older single-core systems wouldn't be affected.)

Probably the worst case for preloading would be (i) not enough RAM to
cache the input, (ii) using an SSD with low seek times (so multicore
would have worked fine without any preloading) and (iii) not amazingly
fast sequential read speed, (iv) several available cores. Even in this
case, the overhead may only be 20% or so, because par2 is sooooo slow
that it still dominates the (sequential) disk time.

Oh, and did you by any chance take a look at the par2 source to see what
the -m option actually does?

I didn't look at it properly, but looking at the messages par2 produces
suggests that this option is mainly affecting par2's output
buffer. Possibly if it knows it can cache its entire output then it
reads the input in one go, though that's a guess.

And without having thought too hard about it yet -- I wondered how well
it might work to have fsck use non-blocking writes to par2 input pipes,
then fsck could control the read-parallelism (if any), and hopefully
keep all the par2 processes' pipes full. There, I assume fsck would be
select()ing for write on all the pipes, and maintaining some buffered
data.

Though if par2 can't accept data over a pipe, that'd also require either
changes to par2, or implementing something like bash's "<(foo)" process
substitution using named pipes -- which is getting reasonably complex.

It's a nice idea to use fsck to control it. We really want par2 to be
multithreaded in CPU but single-threaded in disk I/O, and that would be
a way to do it. Unfortunately it doesn't look like par2 accepts input
pipes, which as you say makes things messy.

I also tried this (was it what you meant?):

$ mkfifo door
$ (cat input > door) & par2 create -c200 -n1 output door

But par2 doesn't seem to like named pipes as input either as it just
immediately exits. Maybe it likes to be able to seek its input.

I can't think of a simple way to do it simply and properly if we treat par2
as a black box program. But if the solution becomes too complicated,
then it would probably make more sense to look into a better replacement
for par2.

A while ago, it was suggested using zfec instead of (or as well as) par2.
(https://groups.google.com/d/msg/bup-list/phdIVXCHygc/5nXtKpIk2c8J)
python-zfec is now in Debian stable, so it would be quite convenient to
use, and it is certainly faster.

I think it may have worse error-recovery properties though, because it
uses a smaller finite field than par2. That means (as far as I can see)
that for a given size of parity file, zfec can't cope with as many
randomly-distributed errors as par2. But zfec and par2 should do equally
well with contiguous errors. So when comparing speeds, it may be fairer
to increase the overhead for zfec to compensate for this.

On the other hand, I'd guess the current 10% overhead (100MB parity file
for a 1GB pack file) gives plenty of error recovery. This (using par2)
should cope with up to 100MB of errors (in the best case), or a
randomly-distributed 6% of its ~450k blocks being bad which is quite a
big error rate. You can imagine a hard disk might rarely have a bit
error affecting a single block, possibly due to a RAM error, or it may
completely die altogether, but I don't know what kind of fault would
lead to 6% of blocks going bad. (So maybe there is a case for lowering
the current overhead from 10%.)

Oh, and I believe there may be an active par2 upstream now, if you
decide to poke around there.

I was mainly looking for a simple improvement which didn't involve
bringing par2 into bup, but for the record can you say which is the
active par2 now? I had a quick look but wasn't sure.

Thanks,

Alex

Rob Browning

unread,

Jun 8, 2014, 3:38:52 PM6/8/14

to Alex Selby, bup-...@googlegroups.com

Alex Selby <aselb...@gmail.com> writes:

> I can't think of a simple way to do it simply and properly if we treat
> par2 as a black box program. But if the solution becomes too
> complicated, then it would probably make more sense to look into a
> better replacement for par2.

...or working on par2 itself. And I'd much rather help improve par2 if
the alternative turns out to be cludges inside bup. I'd probably poke
around there next.

> I was mainly looking for a simple improvement which didn't involve
> bringing par2 into bup

If we helped improve par2, I probably wouldn't bring par2 in to bup,
just note the newer version in our README.md and support it when it's
available.

> but for the record can you say which is the active par2 now? I had a
> quick look but wasn't sure.

You may want to double check, but from an earlier thread
(cf. "Corruption to PAR2 files can go unobserved",
http://mid.gmane.org/535E7793...@oern.de), I believe it's probably
here:

https://github.com/BlackIkeEagle/par2cmdline

Thanks again

Alex Selby

unread,

Jun 8, 2014, 9:35:46 PM6/8/14

to bup-...@googlegroups.com, aselb...@gmail.com

On Sunday, 8 June 2014 20:38:52 UTC+1, Rob Browning wrote:

Alex Selby <aselb...@gmail.com> writes:

> I can't think of a simple way to do it simply and properly if we treat
> par2 as a black box program. But if the solution becomes too
> complicated, then it would probably make more sense to look into a
> better replacement for par2.

...or working on par2 itself. And I'd much rather help improve par2 if
the alternative turns out to be cludges inside bup. I'd probably poke
around there next.

If the github link below is the upstream of the current Debian package, and
if it's being actively developed, then that sounds useful.

(Though it may be tricky to get away without using memory equal to the
size of the input file if par2 wants to process its input file in stripes.)

> I was mainly looking for a simple improvement which didn't involve
> bringing par2 into bup

If we helped improve par2, I probably wouldn't bring par2 in to bup,
just note the newer version in our README.md and support it when it's
available.

> but for the record can you say which is the active par2 now? I had a
> quick look but wasn't sure.

You may want to double check, but from an earlier thread
(cf. "Corruption to PAR2 files can go unobserved",

http://mid.gmane.org/535E7793.605...@oern.de), I believe it's probably
here:

https://github.com/BlackIkeEagle/par2cmdline

I looked into par2 and zfec a bit more. This is as I understand it:

zfec uses GF(2^8), which means that it can support up to 256 data blocks
+ recovery blocks. A typical 10% overhead configuration might be 230
data blocks + 23 recovery blocks. (A block is like a disk in a RAID
array.)

par2 uses GF(2^16) which means that it could support up to 65536 data
blocks, though I think it is restricted to 32768 in practice. The 10%
overhead configuration currently used by bup fsck -g is 2000 data blocks
+ 200 recovery blocks.

In principle this should mean that for a given size of parity check file
(10% of 1GB = 100MB in the case of bup packfiles), par2 should have
better error-recovery performance than zfec for randomly distributed
bitflip errors, though similar error-recovery for contiguous errors
(like an entire disk in a RAID array going down).

However, it's not so simple because the par2 decoder (as far as I can
see) doesn't make use of the error-correcting code in the .par2 files to
the fullest extent, and only attempts to correct "whole block" errors.
I think that it may be the same for zfec too, though I can't find a spec
for a high-level utility using python-zfec.

(Strangely, this may mean there are a load of par2 files out there that
par2 has declared unrecoverable that are actually recoverable.)

There is still an advantage for par2 in having more blocks, but unless
I'm missing something there would be a bigger advantage in a better
implementation of the decoder, at least for the random-error case.

Last point: if you make par2 use zfec parameters, then it seems the
speed advantage of zfec mostly disappears. On my machine, par2 in
2000+200 mode (as used by bup fsck -g) runs at 10MB/s of input file,
par2 in 230+23 mode runs at 60MB/s and zfec in 230+20 mode runs at
80MB/s.

Rob Browning

unread,

Jun 9, 2014, 11:11:45 AM6/9/14

to Alex Selby, bup-...@googlegroups.com, aselb...@gmail.com

Alex Selby <aselb...@gmail.com> writes:

> If the github link below is the upstream of the current Debian package, and
> if it's being actively developed, then that sounds useful.
>
> (Though it may be tricky to get away without using memory equal to the
> size of the input file if par2 wants to process its input file in stripes.)

While I don't know exactly what par2 needs (or does right now), another
possibility might be allowing it to handle multiple files in parallel,
shifting the read scheduling/buffering into par2 itself.

Reply all

Reply to author

Forward