bareos 23 - disk backup over 10G network slow

137 views
Skip to first unread message

Markus Dubois

unread,
Sep 18, 2024, 1:31:29 PM9/18/24
to bareos-users
Hi,

i'm trying to backup between a directly attached (no switch,. no router in between) a client via a 10G network

ipferf3 shows me:

 [ ID] Interval           Transfer     Bitrate

[  5]   0.00-1.00   sec   393 MBytes  3.30 Gbits/sec                  

[  5]   1.00-2.00   sec   460 MBytes  3.86 Gbits/sec                  

[  5]   2.00-3.00   sec   308 MBytes  2.59 Gbits/sec                  

[  5]   3.00-4.00   sec   397 MBytes  3.33 Gbits/sec                  

[  5]   4.00-5.00   sec   480 MBytes  4.03 Gbits/sec                  

[  5]   5.00-6.00   sec   372 MBytes  3.12 Gbits/sec                  

[  5]   6.00-7.00   sec   476 MBytes  3.99 Gbits/sec                  

[  5]   7.00-8.00   sec   348 MBytes  2.92 Gbits/sec                  

[  5]   8.00-9.00   sec   437 MBytes  3.66 Gbits/sec                  

[  5]   9.00-10.00  sec   441 MBytes  3.70 Gbits/sec                  

[  5]  10.00-10.00  sec   181 KBytes  3.85 Gbits/sec                  

- - - - - - - - - - - - - - - - - - - - - - - - -

[ ID] Interval           Transfer     Bitrate

[  5]   0.00-10.00  sec  4.02 GBytes  3.45 Gbits/sec                  receiver

-----------------------------------------------------------

the backup task runs with lz4 compression over TLS

but in average i get this:

Full Backup Job started: 18-Sep-24 15:16 Files=6,075,920 Bytes=1,715,453,454,098 Bytes/sec=113,171,490 Errors=2

this seems not the "best" transfer rate.

I'm trying to find a way to optimize this. Any hints?




Bruno Friedmann (bruno-at-bareos)

unread,
Sep 19, 2024, 4:37:53 AM9/19/24
to bareos-users
So beside the fact that TLS as a cpu cost, LZ4 depending of the type of data too. 
I would go for a run without LZ4 first to see how much cpu bonding is your FD.

Then you may want to tune TLS to only use cipher that may benefit hardware acceleration like certain aes.

Trying without TLS will give you a better view of how much % it cost. You may then want to consider ktls 

The global bytes/sec seens at the end is really a "naive" computation between the time spend between how much data divided by stop-start timestamp.
 
If you're loading tape (the time to load, unload is included), if the job is waiting 1 hour an operator to label a volume, that also count.
so so ...

Markus Dubois

unread,
Sep 19, 2024, 6:40:36 AM9/19/24
to bareos-users
i have start a run without tls and without lz4 but didn't see much difference actually.
BTW: backup is disk only

Andreas Rogge

unread,
Sep 19, 2024, 6:46:36 AM9/19/24
to bareos-users
Am 18.09.24 um 19:31 schrieb Markus Dubois:
> i'm trying to backup between a directly attached (no switch,. no router
> in between) a client via a 10G network
> [ ID] Interval           Transfer     Bitrate
> [  5]   0.00-10.00  sec  4.02 GBytes  3.45 Gbits/sec
> receiver

Honestly, that looks pretty awful.

I just measured two virtual machines on different hosts in different
networks. So the path (that probably also has some switches in between)
looks like this:
VM -> Host -> Router -> Router -> Host -> VM

And the result I got was
[ ID] Interval Transfer Bitrate Retr
[ 5] 0.00-10.04 sec 10.7 GBytes 9.16 Gbits/sec
receiver

So there seems to be something wrong with your network.

> the backup task runs with lz4 compression over TLS
Do you have checksums enabled in the Fileset (i.e. "Signature = md5")?
Depending on the checksum used, this can severely impact performance.
>
> but in average i get this:
> Full Backup Job started: 18-Sep-24 15:16 Files=6,075,920
> Bytes=1,715,453,454,098 *Bytes/sec=113,171,490* Errors=2

So when I calculate the average file size on that, I get
1,715,453,454,098 Bytes / 6,075,920 Files = 282,336 Bytes/File or around
282 KB per file on average.
Bareos (like virtually every other file-based backup system) does not
perform at line-speed for very small files. This has a lot of reasons,
but basically boils down to reading and writing a lot of metadata
compared to the actual data being backed up.

> this seems not the "best" transfer rate.
agreed.

> I'm trying to find a way to optimize this. Any hints?
1. Make your network perform properly
2. Use tar or cpio redirected to /dev/null to see how fast your client
could actually discover and read the files - Bareos will never be able
to outperform that value
3. Check the CPU-load of bareos-fd (and probably also bareos-sd) during
the backup, if the processes sit at 100% most of the time, your
performance is CPU-bound in which case you simply need to reduce that to
improve performance (i.e. disable compression, disable checksums or
maybe remove regex expressions from your fileset).

Hope that helps!

Best Regards,
Andreas
--
Andreas Rogge andrea...@bareos.com
Bareos GmbH & Co. KG Phone: +49 221-630693-86
http://www.bareos.com

Sitz der Gesellschaft: Köln | Amtsgericht Köln: HRA 29646
Komplementär: Bareos Verwaltungs-GmbH
Geschäftsführer: Stephan Dühr, Jörg Steffens, Philipp Storz
OpenPGP_signature.asc

Markus Dubois

unread,
Sep 19, 2024, 7:06:30 AM9/19/24
to bareos-users
the iperf looks not "at its best" because the bareos director and the storage daemon are installed on a NAS system with an Intel Atom 2,1 Ghz 4core CPU
On backup server CPU is on 30-50% overall
as mentioned above, i've launched a job without tls and lz4, the signature is configured as XXH128
I can see that the NAS is reporting 250 MB/s in average and 300 MB/s max during job runtime.
But i have also, how to say this properly, speed crashes from 250 MB/s to 80 MB/s. The network rate is not constantly high.
At the same time, i see this on the bareos client screen
 Files=1,212,590 Bytes=177,269,712,013 Bytes/sec=165,672,628 Errors=0

the bareos FD is running on a 30 Core 192GB RAM server, i can see 75% CPU of one core usage. Overall server (to be backued up) is on 34 % CPU


so, what do you mean with "network perform properly"? As mentioned, i have some hardware limitations on backup server side, but i think even those limitations shouldn't lead to those numbers....
Should i fiddle around with systemctl settings or max nettwork buffers?



Andreas Rogge

unread,
Sep 19, 2024, 8:15:36 AM9/19/24
to bareos...@googlegroups.com
Am 19.09.24 um 13:06 schrieb Markus Dubois:
> the iperf looks not "at its best" because the bareos director and the
> storage daemon are installed on a NAS system with an Intel Atom 2,1 Ghz
> 4core CPU
Maybe that system simply cannot fully utilize 10 GBe.
Did you verify that the NAS system can actually handle a sustained
data-stream of 300+ MB/s?

> On backup server CPU is on 30-50% overall
Overall as in 30-50% of all four cores?
Because that could mean one core is busy with bareos-sd and it simply
cannot work faster.
Also when bareos-dir and the postgresql database are on the same system,
you may run out of resources (CPU, memory or IOPS).

> as mentioned above, i've launched a job without tls and lz4, the
> signature is configured as XXH128
> I can see that the NAS is reporting 250 MB/s in average and 300 MB/s max
> during job runtime.
> But i have also, how to say this properly, speed crashes from 250 MB/s
> to 80 MB/s. The network rate is not constantly high.
So basically your maximum throughput is in the ballpark of what iperf
shows. That's not bad.

Depending on the structure of files you're backing up, the drop to 80
MB/s might be normal. If - for example - you have a set of large files,
these will be backed up around 250 MB/s. But as soon as the FD gets to a
directory with a lot of really small files, the rates will drop.
Even if the FD could keep up with it, the SD will send the file metadata
for every file to the director which will have to put it into the database.
Maybe it is a good idea to run a backup with (almost) only large files
(i.e. some video archive) and one with lots and lots of small files, you
might see a pattern evolving.
If it really is the database inserts, you can enable "Spool Attributes"
on the job. This will spool the metadata and only insert it to the
database when the job is finished.

> the bareos FD is running on a 30 Core 192GB RAM server, i can see 75%
> CPU of one core usage. Overall server (to be backued up) is on 34 % CPU
So I guess we can rule out signatures and compression then, as that's
happening on the FD.

> so, what do you mean with "network perform properly"? As mentioned, i
> have some hardware limitations on backup server side, but i think even
> those limitations shouldn't lead to those numbers....
> Should i fiddle around with systemctl settings or max nettwork buffers?
I really don't know. The only thing I can say for sure is that when
iperf reports only around 3 GBit/s, you definitely won't see a backup
transmitting more than 300 MB/s.
OpenPGP_signature.asc

Markus Dubois

unread,
Sep 19, 2024, 8:36:07 AM9/19/24
to bareos-users
>Did you verify that the NAS system can actually handle a sustained
>data-stream of 300+ MB/s?

didn't i have done this with iperf3?

i have spool attributes already enabled. pgsql is tuned with pgtune

>So I guess we can rule out signatures and compression then, as that's
>happening on the FD.

so, as i have enough power on the FD side, would reenable lz4 be an option?

>I really don't know. The only thing I can say for sure is that when
>iperf reports only around 3 GBit/s, you definitely won't see a backup
>transmitting more than 300 MB/s.

so, the value looking for, is the rate on the actual network card? There i get 300 MB/s sometimes during job execution.

On FD job report after 2 hours runtime i get:
Files=4,775,011 Bytes=1,626,243,569,572 Bytes/sec=239,048,003 Errors=0

which is much better. the last change i've made was the change the signature from md5 to XXH128

Markus Dubois

unread,
Sep 22, 2024, 2:47:11 PM9/22/24
to bareos-users
i had quite some sucess with 220.500/s throughput for one full backup, but this was only achieved with Block Checksum = no
on storage daemon device side. Great?
No, had errors through restore....:-)
So reverting back and now i'm back to 105.000/s throughput

Andreas Rogge

unread,
Sep 25, 2024, 9:46:07 AM9/25/24
to bareos...@googlegroups.com
Am 22.09.24 um 20:47 schrieb Markus Dubois:
> i had quite some sucess with 220.500/s throughput for one full backup,
> but this was only achieved with Block Checksum = no
That's pretty bad - it means calculating the CRC32 block checksums
significantly slows down your SD. Which somewhat confirms what I
suspected: the SD running on your Intel Atom is the bottleneck slowing
down the data stream.
We have been discussing and evaluating a faster replacement for CRC32
that is currently used to checksum the blocks. However, I'm not sure
when this will be implemented, if it is implemented at all.

> on storage daemon device side. Great?
> No, had errors through restore....:-)
That's actually weird. Would you mind sharing a joblog? I'd really like
to see where/how it failed. Turning off the block checksums removes a
safety belt, but for sure shouldn't produce data errors.
Basically if it doubles performance for you and you trust your storage
system, it should be okay to disable it.

> So reverting back and now i'm back to 105.000/s throughput
:(
OpenPGP_signature.asc

Markus Dubois

unread,
Oct 19, 2024, 5:09:56 AM10/19/24
to bareos-users
i've abandoned the Atom CPU solution and but more beef on it. Unfortunately i have no job logs from back then anymore.
At the moment i get 280.000 /s throughput which is much better. Further optimizing it with optimizations on the disk subsystem.
So at the moment the problem is fixed

Andreas Rogge

unread,
Oct 23, 2024, 6:43:27 AM10/23/24
to bareos...@googlegroups.com
Am 19.10.24 um 11:09 schrieb Markus Dubois:
> i've abandoned the Atom CPU solution and but more beef on it.
> Unfortunately i have no job logs from back then anymore.
> At the moment i get 280.000 /s throughput which is much better. Further
> optimizing it with optimizations on the disk subsystem.
> So at the moment the problem is fixed

Glad to hear that you found a solution. Nevertheless it's a pity it
didn't work out on the Atom machine :(
OpenPGP_signature.asc
Reply all
Reply to author
Forward
0 new messages