Bug#969463: rsync: fails reproducible but unexplained on a specific volume, breaks horribly when started by BackupPC

Andreas Feldner

unread,

Sep 3, 2020, 9:30:03 AM9/3/20

to

Package: rsync
Version: 3.2.3-2
Severity: important

Dear Maintainer,

* What led up to the situation?

I believe the problem started with the latest update of the rsync package end
of 08/2020. A previously working setup of backuppc reproducibly fails for a
specific host and specific directory tree since then.

* What exactly did you do (or not do) that was effective (or
ineffective)?

I repeatedly restarted the backup from BackupPC, all failing.
I conducted a test rsynch over ssh directly without BackupPC, all failing.
I checked dmesg for indications of lower level problems, there are none.

* What was the outcome of this action?

1) With BackupPC

-- Output of
/usr/share/backuppc/bin/BackupPC_zcat /var/lib/backuppc/pc/localhost/XferLOG.z|grep -v -a '^ '
full backup started for directory /var/lib/lxc (baseline backup #131)
Running: /usr/bin/ssh -q -x -l root localhost /usr/bin/rsync --server --sender --numeric-ids \
--perms --owner --group -D --links --hard-links --times --block-size=2048 --recursive \
--one-file-system --ignore-times . /var/lib/lxc/
Xfer PIDs are now 77976
Got remote protocol 31
Negotiated protocol version 28
Sent exclude: tmp
Sent exclude: temp
Sent exclude: *Trash*
Sent exclude: trash*
Sent exclude: backup*
Sent exclude: *.bak
Sent exclude: *.tmp
Xfer PIDs are now 77976,78855
Remote[-7]:
Remote[98]: <<here starts binary content, obviously file content meant to be
transmitted, pushing the log file to several GiB>>
-- end of filtered log

-- Another run, the output looks like:
incr backup started back to 2020-08-19 22:00:03 (backup #127) for directory /var/lib/lxc
Running: /usr/bin/ssh -q -x -l root localhost /usr/bin/rsync --server --sender --numeric-ids --perms --owner --group -D --links --hard-links --times --block-size=2048 --recursive --one-file-system . /var/lib/lxc/
Xfer PIDs are now 23770
Got remote protocol 31
Negotiated protocol version 28
Sent exclude: tmp
Sent exclude: temp
Sent exclude: *Trash*
Sent exclude: trash*
Sent exclude: backup*
Sent exclude: *.bak
Sent exclude: *.tmp
Xfer PIDs are now 23770,24835
Remote[-7]:
Remote[-7]:
Remote[-7]:
Remote[-7]:
Remote[-7]:
Remote[-7]:
Remote[-7]:
Remote[-7]:
Remote[-7]:
Remote[-7]:
Remote[-7]:
Remote[-7]:
Remote[-7]:
Remote[-7]:
Remote[-7]:
Remote[-7]:
Remote[-7]:
Remote[-7]:
Remote[-7]:
Remote[-7]:
Remote[-7]:
Remote[-7]:
Remote[-7]:
Remote[-7]:
Remote[-7]:
Remote[-7]:
Remote[41]: <C0>^@^G^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
Remote[-7]:
Remote[-7]:
Remote[-7]:
Remote[-7]:
Remote[-7]:
Remote[-7]:
Remote[-7]:
Remote[-7]:
Remote[-7]:
Remote[-7]:
<<.... thousands of identical lines>>
<<...binary garbage>>
--

Finally, rsync (and BackupPC_dump) just hang around each waiting for something to happen:

-- Output of strace on rsync's PID:
strace: Process 23781 attached
select(2, [], [1], [], {tv_sec=35, tv_usec=526085}) = 0 (Timeout)
select(2, [], [1], [], {tv_sec=60, tv_usec=0}) = 0 (Timeout)
select(2, [], [1], [], {tv_sec=60, tv_usec=0}) = 0 (Timeout)
select(2, [], [1], [], {tv_sec=60, tv_usec=0}^Cstrace: Process 23781 detached
<detached ...>
--

It is required to stop the BackupPC service, or to manually kill BackupPC_dump and/or rsync.

2) rsync test run

-- rsync stdout
opening connection using: ssh -l root localhost rsync --server --sender -vvvlHogDtprxe.iLsfxCIvu -B2048 --numeric-ids . /var/lib/lxc (12 args)
receiving incremental file list
server_sender starting pid=144226
[sender] make_file(lxc,*,0)
send_file_list done
[sender] pushing local filters for /var/lib/lxc/
<<very much diagnostics of transferring thousands of files without problems>>
got file_sum
set modtime, atime of lxc/nextcloud/rootfs/var/backups/.gshadow.bak.vcUsMX to (1596971541) 2020/08/09 13:12:21, (1599125234) 2020/09/03 11:27:14
renaming lxc/nextcloud/rootfs/var/backups/.gshadow.bak.vcUsMX to lxc/nextcloud/rootfs/var/backups/gshadow.bak
recv_files(lxc/nextcloud/rootfs/var/backups/nextcloud-dir_20200119.tar)
lxc/nextcloud/rootfs/var/backups/nextcloud-dir_20200119.tar
[receiver] _exit_cleanup(code=13, file=log.c, line=245): about to call exit(13)
[generator] _exit_cleanup(code=13, file=io.c, line=1644): about to call exit(13)
-- end of stdout

-- rsync stderr
rsync error: errors with program diagnostics (code 13) at log.c(245) [receiver=3.2.3]
rsync: [generator] write error: Broken pipe (32)
-- end of stderr

3) tar file generation
...just worked

* What outcome did you expect instead?

A copy of the source directory tree at the destination ;-)

-- System Information:
Debian Release: bullseye/sid
APT prefers testing
APT policy: (990, 'testing'), (500, 'stable')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 5.7.0-3-amd64 (SMP w/4 CPU threads)
Kernel taint flags: TAINT_FIRMWARE_WORKAROUND
Locale: LANG=de_DE.UTF-8, LC_CTYPE=de_DE.UTF-8 (charmap=UTF-8), LANGUAGE not set
Shell: /bin/sh linked to /bin/bash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages rsync depends on:
ii init-system-helpers 1.58
ii libacl1 2.2.53-8
ii libc6 2.31-3
ii liblz4-1 1.9.2-2
ii libpopt0 1.18-1
ii libssl1.1 1.1.1g-1
ii libxxhash0 0.8.0-1
ii libzstd1 1.4.5+dfsg-4
ii lsb-base 11.1.0
ii zlib1g 1:1.2.11.dfsg-2

rsync recommends no packages.

Versions of packages rsync suggests:
ii openssh-client 1:8.3p1-1
ii openssh-server 1:8.3p1-1
ii python3 3.8.2-3

-- no debconf information

Paul Slootman

unread,

Sep 3, 2020, 10:50:03 AM9/3/20

to

On Thu 03 Sep 2020, Andreas Feldner wrote:

> 2) rsync test run
>
> -- rsync stdout
> opening connection using: ssh -l root localhost rsync --server --sender -vvvlHogDtprxe.iLsfxCIvu -B2048 --numeric-ids . /var/lib/lxc (12 args)

How are you invoking rsync? Please give the exact command line,
including what you current working directory is.

Are you actually using ssh to connect to localhost and transferring the
data over that ssh link?
If so, then these lines are very strange in the case (1) text:

> Got remote protocol 31
> Negotiated protocol version 28

I would expect an rsync to negotiate the highest protocol version
possible, so this looks like it's talking to a newer version than
itself.

> receiving incremental file list
> server_sender starting pid=144226
> [sender] make_file(lxc,*,0)
> send_file_list done
> [sender] pushing local filters for /var/lib/lxc/
> <<very much diagnostics of transferring thousands of files without problems>>
> got file_sum
> set modtime, atime of lxc/nextcloud/rootfs/var/backups/.gshadow.bak.vcUsMX to (1596971541) 2020/08/09 13:12:21, (1599125234) 2020/09/03 11:27:14
> renaming lxc/nextcloud/rootfs/var/backups/.gshadow.bak.vcUsMX to lxc/nextcloud/rootfs/var/backups/gshadow.bak
> recv_files(lxc/nextcloud/rootfs/var/backups/nextcloud-dir_20200119.tar)
> lxc/nextcloud/rootfs/var/backups/nextcloud-dir_20200119.tar
> [receiver] _exit_cleanup(code=13, file=log.c, line=245): about to call exit(13)
> [generator] _exit_cleanup(code=13, file=io.c, line=1644): about to call exit(13)
> -- end of stdout
>
> -- rsync stderr
> rsync error: errors with program diagnostics (code 13) at log.c(245) [receiver=3.2.3]
> rsync: [generator] write error: Broken pipe (32)
> -- end of stderr

Because of the "errors with program diagnostics" I would have expected
more messages...

It would be most helpful if you could follow the instructions on
https://rsync.samba.org/issues.html especially de second half of
question 3.

Paul

Pelzi

unread,

Sep 3, 2020, 1:00:03 PM9/3/20

to

Hi Paul,

thanks for the fast answer!

"Paul Slootman" pa...@debian.org – 3. September 2020 16:27
[...]

>> opening connection using: ssh -l root localhost rsync --server --sender -vvvlHogDtprxe.iLsfxCIvu -B2048 --numeric-ids . /var/lib/lxc (12 args)
>

> How are you invoking rsync? Please give the exact command line,
> including what you current working directory is.

User is backuppc, working directory is /var/lib/backuppc. Command line:

sh-5.0$ rsync -vvv --rsh=ssh --numeric-ids --perms --owner --group -D --links --hard-links --times --block-size 48 \
--recursive --one-file-system root@localhost:/var/lib/lxc /mnt/testrsync/ >/mnt/testrsync/stdout.txt 2>/mnt/testrsync/stderr.txt

> Are you actually using ssh to connect to localhost and transferring the
> data over that ssh link?

In this case (2), I issued the command from within a screen session. Here, ssh is used internally by rsync (--rsh=ssh). The idea is to change privilege, as backuppc is allowed to execute rsync via authorized_keys. Also it is nice that the same mechanism works for all hosts to be backed up.

> If so, then these lines are very strange in the case (1) text:
>

>> Got remote protocol 31
>> Negotiated protocol version 28
>

> I would expect an rsync to negotiate the highest protocol version
> possible, so this looks like it's talking to a newer version than
> itself.

In contrast to case (2), this case is running from within BackupPC. AFAIK, BackupPC_dump directly executes ssh rsync —server —sender … and implements the receiving side itself. So, I would conclude that BackupPC_dump implements protocol version 28.

>> -- rsync stderr
>> rsync error: errors with program diagnostics (code 13) at log.c(245) [receiver=3.2.3]
>> rsync: [generator] write error: Broken pipe (32)
>> -- end of stderr
>

> Because of the "errors with program diagnostics" I would have expected
> more messages...

So did I :-)

> It would be most helpful if you could follow the instructions on

> rsync.samba.org/issues.html especially de second half of
> question 3.

Thanks for the reference, I will try it later.

Andreas.

Pelzi

unread,

Sep 5, 2020, 6:10:03 AM9/5/20

to

Hi again,

[…]

Because of the "errors with program diagnostics" I would have expected
more messages...

It would be most helpful if you could follow the instructions on
https://rsync.samba.org/issues.html especially de second half of
question 3.

So I did:

sh-5.0$ rsync -vvv --rsh=ssh --rsync-path=/usr/bin/rsync --numeric-ids --perms --owner --group -D --links --hard-links --times --block-size=2048 --recursive --one-file-system root@localhost:/var/lib/lxc /mnt/testrsync/ >/mnt/testrsync/stdout2.txt 2>/mnt/testrsync/stderr2.txt

sh-5.0$ echo $?

13

With the following results:

— begin stderr2.txt

rsync error: errors with program diagnostics (code 13) at log.c(245) [receiver=3.2.3]

rsync: [generator] write error: Broken pipe (32)

— end stderr2.txt

— tail of stdout2.txt

recv_generator(lxc/nextcloud/rootfs/var/backups/nextcloud-confbkp_20200119/extra-apps/groupfolders/l10n/sr.js,519849)

recv_generator(lxc/nextcloud/rootfs/var/backups/nextcloud-confbkp_20200119/extra-apps/groupfolders/l10n/sr.json,519850)

recv_generator(lxc/nextcloud/rootfs/var/backups/nextcloud-confbkp_20200119/extra-apps/groupfolders/l10n/sv.js,519851)

recv_generator(lxc/nextcloud/rootfs/var/backups/nextcloud-confbkp_20200119/extra-apps/groupfolders/l10n/sv.json,519852)

recv_generator(lxc/nextcloud/rootfs/var/backups/nextcloud-confbkp_20200119/extra-apps/groupfolders/l10n/tr.js,519853)

recv_generator(lxc/nextcloud/rootfs/var/backups/nextcloud-confbkp_20200119/extra-apps/groupfolders/l10n/tr.json,519854)

recv_generator(lxc/nextcloud/rootfs/var/backups/nextcloud-confbkp_20200119/extra-apps/groupfolders/l10n/uz.js,519855)

recv_generator(lxc/nextcloud/rootfs/var/backups/nextcloud-confbkp_20200119/extra-apps/groupfolders/l10n/uz.json,519856)

recv_generator(lxc/nextcloud/rootfs/var/backups/nextcloud-confbkp_20200119/extra-apps/groupfolders/l10n/zh_CN.js,519857)

recv_generator(lxc/nextcloud/rootfs/var/backups/nextcloud-confbkp_20200119/extra-apps/groupfolders/l10n/zh_CN.json,519858)

recv_generator(lxc/nextcloud/rootfs/var/backups/nextcloud-confbkp_20200119/extra-apps/groupfolders/l10n/zh_TW.js,519859)

recv_generator(lxc/nextcloud/rootfs/var/backups/nextcloud-confbkp_20200119/extra-apps/groupfolders/l10n/zh_TW.json,519860)

[receiver] _exit_cleanup(code=13, file=log.c, line=245): about to call exit(13)

[generator] _exit_cleanup(code=13, file=io.c, line=1644): about to call exit(13)

— end of stdout2.txt

— tail of sender-side strace

244242 12:26:26 select(2, [0], [1], [0], {tv_sec=60, tv_usec=0}) = 1 (out [1], left {tv_sec=59, tv_usec=999998})

244242 12:26:26 write(1, "\4\200\0\7\364\207\237\312\235\2176X\312\217m\t"..., 32776) = 32776

244242 12:26:26 select(2, [0], [1], [0], {tv_sec=60, tv_usec=0}) = 1 (out [1], left {tv_sec=59, tv_usec=999999})

244242 12:26:26 write(1, "\4\200\0\7\370\37H\305,s\333G42,\261:\"..., 32776) = 24584

244242 12:26:26 select(2, [0], [1], [0], {tv_sec=60, tv_usec=0}) = 2 (in [0], out [1], left {tv_sec=34, tv_usec=971695})

244242 12:26:51 read(0, "", 13405)      = 0

244242 12:26:51 select(2, [], [1], [], {tv_sec=60, tv_usec=0}) = 1 (out [1], left {tv_sec=59, tv_usec=999995})

244242 12:26:51 write(1, "u\226A\256\337\231\325\366\242"..., 8192) = -1 EPIPE (Datenübergabe unterbrochen (broken pipe))

244242 12:26:51 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=244242, si_uid=0} ---

244242 12:26:53 write(2, "rsync: [sender] write error: Broken pipe (32)\n", 46) = -1 EPIPE (Datenübergabe unterbrochen (broken pipe))

244242 12:26:53 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=244242, si_uid=0} ---

244242 12:26:53 rt_sigaction(SIGUSR1, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f563bcb3e30}, NULL, 8) = 0

244242 12:26:53 rt_sigaction(SIGUSR2, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f563bcb3e30}, NULL, 8) = 0

244242 12:26:53 write(2, "rsync error: errors with program diagnostics (code 13) at log.c(245) [sender=3.2.3]\n", 84) = -1 EPIPE (Datenübergabe unterbrochen (broken pipe))

244242 12:26:53 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=244242, si_uid=0} ---

244242 12:26:53 rt_sigaction(SIGUSR1, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f563bcb3e30}, NULL, 8) = 0

244242 12:26:53 rt_sigaction(SIGUSR2, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f563bcb3e30}, NULL, 8) = 0

244242 12:26:53 write(2, "[sender] _exit_cleanup(code=13, file=log.c, line=245): about to call exit(13)\n", 78) = -1 EPIPE (Datenübergabe unterbrochen (broken pipe))

244242 12:26:53 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=244242, si_uid=0} ---

244242 12:26:53 rt_sigaction(SIGUSR1, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f563bcb3e30}, NULL, 8) = 0

244242 12:26:53 rt_sigaction(SIGUSR2, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f563bcb3e30}, NULL, 8) = 0

244242 12:26:53 clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=100000000}, 0x7fff8399a960) = 0

244242 12:26:53 exit_group(13)          = ?

244242 12:26:53 +++ exited with 13 +++

— end of strace

Note I drastically shortened the content of the write(1, …) commands in the same style they were shortened anyway; guess it is just file content.

I must admit that I’m none the wiser. The sending half obviously wanted to write exactly the same diagnostic as the receiving half did ("errors with program diagnostics (code 13)“) - but couldn’t because the communication pipe was already broken. It looks as if the sending side was unexpectedly struck from the receiving side closing the connection. Whereas the receiving side doesn’t seem to give any indication of what is going wrong apart from that somewhat generic message.

Hope this helps anyway!

Best regards,

Andreas.

Andre Klärner

unread,

Nov 14, 2020, 4:10:04 AM11/14/20

to

Hi all,

I also face this issue, and tried the last versions of rsync.

The problem was introduced in rsync_3.2.3-1, downgrading to 3.2.2-2
works.

Best regards,
Andre

signature.asc

Philipp Marek

unread,

Apr 19, 2021, 3:40:03 AM4/19/21

to

I've got a script that tries to rsync a file to an external harddisk.

-rw------- 1 libvirt-qemu libvirt-qemu 107390828544 15. Apr 15:07
vm.qcow2

In essence,

/usr/bin/time rsync -va --inplace --no-whole-file --block-size=65536
--progress --stats \
/var/lib/libvirt/images/vm.qcow2 /mnt/tmp3/vm.qcow2

That command takes quite a while (I guess checksumming the source file),
then starts transmitting;

sending incremental file list
vm.qcow2
5,477,040,128 5% 237.65MB/s 0:06:58

At that position rsync reproducible stops to transfer data; there's
no error message, the first rsync process just hangs with 100% CPU.

strace of the 3 processes only shows that the other two are waiting
for data on their socket; the "hanging" one is simply user-bound,
/proc/<pid>/stack is empty.

That happens with both 3.1.3-6 and 3.2.3-4.

Next time I'll try to have debug symbols installed and look at the
process via gdb - right now the file is identical, I copied it via
"cp -a".

Paul Slootman

unread,

Apr 19, 2021, 4:10:03 AM4/19/21

to

On Mon 19 Apr 2021, Philipp Marek wrote:
>
> /usr/bin/time rsync -va --inplace --no-whole-file --block-size=65536
> --progress --stats \
> /var/lib/libvirt/images/vm.qcow2 /mnt/tmp3/vm.qcow2

Hi,

Could you also try it without the --block-size=65536 option?

Paul

Philipp Marek

unread,

Apr 28, 2021, 4:10:03 AM4/28/21

to

# gdb --args rsync -va --inplace --no-whole-file --progress --stats
/var/lib/libvirt/images/win10.qcow2 win10.qcow2

went through without any problem.
So either the "--blocksize" is the reason -- or some bit pattern in the
previous version.

Touching the file and re-trying with checksumming (-c) has nothing to
copy and so did succeed; I'll try gdb when the file was changed.

Philipp Marek

unread,

Apr 30, 2021, 2:10:03 AM4/30/21

to

This time I used "--block-size=65536" again, and it hang:

3841 pts/15 S 0:01 \_ -bash
92536 pts/15 R+ 2:36 \_ rsync -va --inplace --no-whole-file
--block-size=65536 --progress --stats
/var/lib/libvirt/images/win10.qcow2 win10.qcow2
92537 pts/15 S+ 1:28 \_ rsync -va --inplace
--no-whole-file --block-size=65536 --progress --stats
/var/lib/libvirt/images/win10.qcow2 win10.qcow2
92538 pts/15 S+ 0:08 \_ rsync -va --inplace
--no-whole-file --block-size=65536 --progress --stats
/var/lib/libvirt/images/win10.qcow2 win10.qcow2

GDB says:

0x00005601c4fd64b0 in hash_search (len=<optimized out>, buf=<optimized
out>, s=<optimized out>, f=<optimized out>) at match.c:206
206 match.c: Datei oder Verzeichnis nicht gefunden.
(gdb) bt
#0 0x00005601c4fd64b0 in hash_search (len=<optimized out>,
buf=<optimized out>, s=<optimized out>, f=<optimized out>) at
match.c:206
#1 match_sums (f=f@entry=4, s=s@entry=0x5601c6465420,
buf=buf@entry=0x5601c6465450, len=107390828544) at match.c:396
#2 0x00005601c4fca6c1 in send_files (f_in=f_in@entry=5,
f_out=f_out@entry=4) at sender.c:389
#3 0x00005601c4fd49ac in client_run (f_in=5, f_out=4,
pid=pid@entry=92537, argc=argc@entry=1, argv=argv@entry=0x5601c643b660)
at main.c:1317
#4 0x00005601c4fb461a in start_client (argv=<optimized out>, argc=1)
at main.c:1580
#5 main (argc=<optimized out>, argv=<optimized out>) at main.c:1812
(gdb) up
#1 match_sums (f=f@entry=4, s=s@entry=0x5601c6465420,
buf=buf@entry=0x5601c6465450, len=107390828544) at match.c:396
396 in match.c
(gdb) info locals
sum_len = <optimized out>
(gdb) print *f
Cannot access memory at address 0x4
(gdb) print *s
$1 = {flength = 107390828544, sums = 0x7f8a6e030010, count = 1638654,
blength = 65536, remainder = 0, s2length = 5}
(gdb) up
#2 0x00005601c4fca6c1 in send_files (f_in=f_in@entry=5,
f_out=f_out@entry=4) at sender.c:389
389 sender.c: Datei oder Verzeichnis nicht gefunden.
(gdb) info locals
fd = 3
s = 0x5601c6465420
mbuf = 0x5601c6465450
st = {st_dev = 48, st_ino = 261, st_nlink = 1, st_mode = 33152, st_uid
= 64055, st_gid = 64055, __pad0 = 0, st_rdev = 0, st_size =
107390828544, st_blksize = 4096, st_blocks = 151051736, st_atim =
{tv_sec = 1619705294,
tv_nsec = 576804175}, st_mtim = {tv_sec = 1619705289, tv_nsec =
236796467}, st_ctim = {tv_sec = 1619705294, tv_nsec = 556804145},
__glibc_reserved = {0, 0, 0}}
fname = "win10.qcow2", '\000' <repeats 3453 times>...
xname = "\000in10.qcow2\000virt/images\000win10.qcow2", '\000' <repeats
1117 times>...
path = 0x5601c643bb40 "/var/lib/libvirt/images"
slash = 0x5601c5005213 "/"
fnamecmp_type = 128 '\200'
iflags = 32776
xlen = -1
file = <optimized out>
phase = 0
max_phase = 2
itemizing = 0
log_code = FLOG
f_xfer = 4
save_io_error = 0
ndx = <optimized out>
j = <optimized out>
(gdb) print f_
f_in f_name f_name_buf f_name_cmp
f_name_has_prefix f_out f_owner_ex f_xfer
(gdb) print f_in
$2 = 5
(gdb) print f_out
$3 = 4

Any other data I can get you?

Philipp Marek

unread,

Jun 1, 2021, 3:10:03 AM6/1/21

to

Update: even without --block-size rsync hangs with 100% CPU.

# rsync -va --inplace --no-whole-file --progress --stats
/var/lib/libvirt/images/... /mnt/tmp4/

0x00005619df81a4c0 in hash_search (len=<optimized out>, buf=<optimized

out>, s=<optimized out>, f=<optimized out>) at match.c:206
206 match.c: Datei oder Verzeichnis nicht gefunden.
(gdb) bt

#0 0x00005619df81a4c0 in hash_search (len=<optimized out>,

buf=<optimized out>, s=<optimized out>, f=<optimized out>) at
match.c:206

#1 match_sums (f=f@entry=4, s=s@entry=0x5619e04333a0,
buf=buf@entry=0x5619e04333d0, len=107390828544) at match.c:396
#2 0x00005619df80e6c1 in send_files (f_in=f_in@entry=5,
f_out=f_out@entry=4) at sender.c:389
#3 0x00005619df8189ac in client_run (f_in=5, f_out=4,
pid=pid@entry=668556, argc=argc@entry=1, argv=argv@entry=0x5619e0409660)
at main.c:1317
#4 0x00005619df7f861a in start_client (argv=<optimized out>, argc=1) at

main.c:1580
#5 main (argc=<optimized out>, argv=<optimized out>) at main.c:1812

(gdb) info locals
l = <optimized out>
done_csum2 = 1
hash_entry = <optimized out>
i = 667490
prev = 0x7fbb95ded798
aligned_i = <optimized out>
more = <optimized out>
aligned_offset = <optimized out>
end = 107390697473
k = 131072
want_i = <optimized out>
backup = <optimized out>
s1 = 1310720
sum = <optimized out>
offset = 45707824536
sum2 = "1\205_\250A]\025\370SF\303\362\342>k\243"
s2 = 785055744
map = <optimized out>
offset = <optimized out>
aligned_offset = <optimized out>
end = <optimized out>
k = <optimized out>
want_i = <optimized out>
aligned_i = <optimized out>
backup = <optimized out>
sum2 = {<optimized out> <repeats 16 times>}
s1 = <optimized out>
s2 = <optimized out>
sum = <optimized out>
more = <optimized out>
map = <optimized out>
null_hash = <optimized out>
check_want_i = <optimized out>
done_csum2 = <optimized out>
hash_entry = <optimized out>
i = <optimized out>
prev = <optimized out>
l = <optimized out>
(gdb) info registers
rax 0xffffffff 4294967295
rbx 0x1 1
rcx 0x7fbb94477010 140443623321616
rdx 0x5 5
rsi 0x19aad850 430626896
rdi 0x855fa841 2237638721
rbp 0x7fbb95ded760 0x7fbb95ded760
rsp 0x7ffd76aa33b0 0x7ffd76aa33b0
r8 0x0 0
r9 0x1 1
r10 0x5619e04333a0 94669136671648
r11 0x7fbb95ded798 140443650021272
r12 0x1976750 26699600
r13 0xa2f62 667490
r14 0xa2f62 667490
r15 0xaa4660d98 45707824536
rip 0x5619df81a4c0 0x5619df81a4c0 <match_sums+1328>
eflags 0x283 [ CF SF IF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0

Yeah, that's not that helpful... but that might be:

(gdb) l
warning: Source file is more recent than executable.
201 int32 l;
202
203 /* When updating in-place, the chunk's
offset must be
204 * either >= our offset or identical
data at that offset.
205 * Remove any bypassed entries that we
can never use. */
206 if (updating_basis_file &&
s->sums[i].offset < offset
207 && !(s->sums[i].flags &
SUMFLG_SAME_OFFSET)) {
208 *prev = s->sums[i].chain;
209 continue;
210 }
(gdb) n
211 prev = &s->sums[i].chain;
(gdb) n
213 if (sum != s->sums[i].sum1)
(gdb) n
217 l = (int32)MIN((OFF_T)s->blength,
len-offset);
(gdb) n
218 if (l != s->sums[i].len)
(gdb) n
221 if (DEBUG_GTE(DELTASUM, 3)) {
(gdb) n
227 if (!done_csum2) {
(gdb) n
233 if
(memcmp(sum2,s->sums[i].sum2,s->s2length) != 0) {
(gdb) n
234 false_alarms++;
(gdb) n
235 continue;
(gdb) n
307 } while ((i = s->sums[i].chain) >= 0);
(gdb) n
206 if (updating_basis_file &&
s->sums[i].offset < offset
(gdb) n
211 prev = &s->sums[i].chain;
(gdb) n
213 if (sum != s->sums[i].sum1)
(gdb) n
217 l = (int32)MIN((OFF_T)s->blength,
len-offset);
(gdb) n
218 if (l != s->sums[i].len)
(gdb) n
221 if (DEBUG_GTE(DELTASUM, 3)) {
(gdb) n
227 if (!done_csum2) {
(gdb) n
233 if
(memcmp(sum2,s->sums[i].sum2,s->s2length) != 0) {
(gdb) n
234 false_alarms++;
(gdb) print i
$1 = 667489
(gdb) print s->sums[i].chain
value has been optimized out
(gdb) print false_alarms
$2 = -745410538
(gdb) n
235 continue;
(gdb) n
307 } while ((i = s->sums[i].chain) >= 0);
(gdb) n
206 if (updating_basis_file &&
s->sums[i].offset < offset
(gdb) print i
$3 = 667488
(gdb) b 206
Breakpoint 1 at 0x5619df81a4a3: file match.c, line 206.
(gdb) c
Continuing.

Breakpoint 1, hash_search (len=<optimized out>, buf=<optimized out>,

s=<optimized out>, f=<optimized out>) at match.c:206

206 if (updating_basis_file &&
s->sums[i].offset < offset
(gdb) print i
$4 = 667487
(gdb) c
Continuing.

Breakpoint 1, hash_search (len=<optimized out>, buf=<optimized out>,

s=<optimized out>, f=<optimized out>) at match.c:206

206 if (updating_basis_file &&
s->sums[i].offset < offset
(gdb) print i
$5 = 667486
(gdb) c
Continuing.

Breakpoint 1, hash_search (len=<optimized out>, buf=<optimized out>,

s=<optimized out>, f=<optimized out>) at match.c:206

206 if (updating_basis_file &&
s->sums[i].offset < offset
(gdb) print i
$6 = 667485
(gdb) c
Continuing.

Breakpoint 1, hash_search (len=<optimized out>, buf=<optimized out>,

s=<optimized out>, f=<optimized out>) at match.c:206

206 if (updating_basis_file &&
s->sums[i].offset < offset
(gdb) print i
$7 = 667484

This time rsync was hung at
45,706,764,560 42% 16.99MB/s 0:59:06 ^C

martin f krafft

unread,

Jun 25, 2021, 11:10:04 PM6/25/21

to

Just popping in here the fact that I have two Debian unstable clients backing up to the same BackupPC server (v3), and while one works fine, the other does exhibit a problem similar to this one. I am unsure about the hang and 100% CPU use, but my XferLog file is definitely growing indefinitely, and the backups are taking forever.

-- 
.''`. martin f. krafft <madduck@d.o> @martinkrafft
:' : proud Debian developer

.'http://people.debian.org/~madduck- Debian - when you have better things to do than fixing systems

Paul Slootman

unread,

Jun 26, 2021, 8:10:03 AM6/26/21

to

On Sat 26 Jun 2021, martin f krafft wrote:

> Just popping in here the fact that I have two Debian unstable clients
> backing up to the same BackupPC server (v3), and while one works fine, the
> other does exhibit a problem similar to this one. I am unsure about the hang
> and 100% CPU use, but my XferLog file is definitely growing indefinitely,
> and the backups are taking forever.

Thanks for the update.

It most probably is dependent on the data, which means it's not
surprising that one client works fine while the other doesn't.
It also means it's extremely difficult to track down.

Paul

Baptiste Jonglez

unread,

Nov 17, 2021, 10:30:03 AM11/17/21

to

Hello,

TL;DR: we experience the same bug using backuppc, and found upstream rsync
commit f9bb8f76ee is the culprit. Unfortunately the bug only occurs on
some debian 11 machines, not all.

We are seeing the same problem when using backuppc to backup a debian 11
machine: the backup never finishes, because the rsync process on the target
machine ends up doing nothing at all (not using any CPU). Rsync on the
target machine is version 3.2.3-4+deb11u1 (standard debian package).

Debugging with strace on the target debian 11 machine, rsync scans the
whole filesystem as expected, but then it gets stuck doing nothing in a
select loop, forever:

strace: Process 4104384 attached
select(1, [0], [], [0], {tv_sec=52, tv_usec=637639}) = 0 (Timeout)
select(1, [0], [], [0], {tv_sec=60, tv_usec=0}) = 0 (Timeout)
select(1, [0], [], [0], {tv_sec=60, tv_usec=0}) = 0 (Timeout)
select(1, [0], [], [0], {tv_sec=60, tv_usec=0}) = 0 (Timeout)
select(1, [0], [], [0], {tv_sec=60, tv_usec=0}) = 0 (Timeout)

Given this behaviour, it seems likely that rsync ends up running the
"noop_io_until_death()" function and loops there forever.

Thanks to a previous comment from Andre, I focused on the git history
between 3.2.2 and 3.2.3 and found this commit:

f9bb8f76ee ("Change daemon variable & simplify some option code")

and specifically this change that introduces a new condition to keep
running noop_io_until_death():

https://github.com/WayneD/rsync/commit/f9bb8f76ee728bd1391a2b4890ce0281457a7bf2#diff-92194f057884b3287a3a6bf84e6e3b2bf433a556b68562799252a091744e7854L920-L921

Reverting f9bb8f76ee on top of the Debian package indeed fixes the issue
for us on this machine.

The strange thing is: on another debian 11 machine that is configured very
similarly (same rsync version, same backuppc server) but has different
data, we could not reproduce the bug at all.

I'm not familiar with rsync internals, but my guess from reading the code is:

- for some reason, possibly due to the way rsync is started by backuppc,
daemon_connection is sometimes set to -1 ("daemon via socket")

- when rsync has finished working, it goes into the noop_io_until_death()
loop because of the new condition in 3.2.3, waiting for a signal

- but the signal never comes, so rsync never exits

- this causes backuppc to wait forever for rsync to finish

For reference, this is how rsync is started by backupppc through SSH:

/usr/bin/ssh -q -x -o StrictHostKeyChecking=no -l backup $CLIENT_HOSTNAME nice -n 19 sudo /usr/bin/rsync --server --sender --numeric-ids --perms --owner --group -D --links --hard-links --times --block-size=2048 --recursive --checksum-seed=32761 . /

I have opened an upstream bug report here: https://github.com/WayneD/rsync/issues/256

Thanks,
Baptiste

Baptiste Jonglez

unread,

Nov 17, 2021, 11:30:03 AM11/17/21

to

On Wed, Nov 17, 2021 at 04:14:23PM +0100, Baptiste Jonglez wrote:
> Given this behaviour, it seems likely that rsync ends up running the
> "noop_io_until_death()" function and loops there forever.
>

> Reverting f9bb8f76ee on top of the Debian package indeed fixes the issue
> for us on this machine.

Actually, both these assertions were red herrings :/

Reverting f9bb8f76ee worked for a few backups, but now it's back to the
same hanging issue.

And rsync is not stuck in noop_io_until_death() as I thought: it's actually
stuck trying to read some data from backuppc through the SSH connection.

I have added more details in the upstream bug report: https://github.com/WayneD/rsync/issues/256

In any case, I believe it's completely unrelated to Philipp's issue with
rsync taking 100% CPU. But it's definitely the same issue as the one
reported by Andreas at the beginning of this bug report.

Baptiste

Bastian Schmidt

unread,

Nov 19, 2021, 8:10:04 AM11/19/21

to

Hello,

I also had a reproducible failure on a full backup of a debian 11 pc,
which worked with rsync 3.2.2 but failed with rsync 3.3.3.At some point
during the backup file contents ended up the the backuppc transfer log
and either the backup stalled or got so slow that it never finished.

I stayed with rsync 3.2.2 until recently I stumbled upon
https://github.com/backuppc/backuppc/issues/369#issuecomment-692431546.
The suggested solution of adding the argument --no-msgs2stderr to rsync
fixed the issue for me. Unfortunately, this argument is unknown in older
versions of rsync (e.g. in version 3.1.3 included in debian 10), so it
has to be added to the configuration depending on the version of rsync
on the pc to be backed up.

I am unsure whether I really had the same issue as originally posted but
I observed it with the same versions and wanted to share this
information hoping it helps someone.

Bastian

Baptiste Jonglez

unread,

Nov 23, 2021, 12:40:03 PM11/23/21

to

Hello Bastian,

Many thanks for the link, it does indeed look like the exact same bug!

So, this turns out to be a longstanding bug in libfile-rsyncp-perl. I see
two ways to fix this problem in Debian:

- update libfile-rsyncp-perl to 0.76 in buster. This seems quite unlikely
but I'm not 100% familiar with Debian policies.

- wait for upstream rsync to (possibly) revert the default behaviour of
rsync, that is, disable --msgs2stderr by default. And then wait for
this updated version of rsync to land in bullseye.
See https://github.com/WayneD/rsync/issues/95#issuecomment-700424731
It also looks quite unlikely.

So I guess we need the workarounds for now :/

Baptiste