I would like to summarize in this message some methods that can significantly improve speed of bareos backup on Ceph object store. Also I leave a patch for developers, which hopefully will be included in Bareos.
For impatient:
- Maximum Block Size = 4194304 (add to Pool definition)
- Compile in libradosstriper in bareos-sd
- apply patch to have objects > 4M
One of the exceptional achievements of Bareos is its speed. Ceph also have very good speed, data redundancy, on-the-fly storage increase and other goodies. But Bareos and Ceph do not perform well together. See for example https://groups.google.com/d/msg/bareos-users/dz_Cb-DxQ0w/WB6v0KR1GQAJ
In that particular case the speed was only 3MB/s, when one would expect from Rados benchmarking 70MB/s. I also have similar situations on two different installations (Ceph jewel and Ceph luminous).
1. Fundamental problem - Ceph elementary piece of data is called object, and the low-level library which works with objects is called Rados. For performance reasons Rados object should be 4M-32M of size (the bigger the object - the longer delay for data recovery in case of disk/network failure) Elementary data piece of Bareos is Volume and its size for Full backup is usually 25G-50G. The latest version of Ceph made this situation even worse: earlier versions Ceph allowed objects of size up to 100G, but now there is a limit of 132M.
In Bareos language Volume = Rados Object and with such drastic difference in size there seem to be no solution at all. With small object size Bareos database will choke on too many volumes and with large object size Ceph will perform poorly.
2. But actually there is a solution and it is called striper mode. This mode was introduced by CERN, who had gazillions of physics data and who wanted to be as close to Object Store as possible (they did not want Block Device abstraction, HTTP, or Ceph FS in their way, perhaps like Bareos developers). So they build a mode which striped big objects into little objects. This way they store small size objects but refer to them as to very large chunks of data of almost unlimited size:
root@backup4:~# rados -p backup ls
Full-5409.0000000000000000
Full-5409.0000000000000001
Full-5409.0000000000000002
Full-5409.0000000000000003
.....
hundreds of objects
root@backup4:~# rados -p backup ls --striper
Full-5409
Incr-5822
Just 2 "big striped objects"
And thanks to Bareos developers this mode can be used. I just introduced little modification and bug-fixes to overall excellent piece of code.
3. I would argue that:
- Bareos distribution have to have libradosstriper compiled in by default. (Now libradosstriper is NOT compiled in in binaries distributed on Bareos Web-site)
- In configuration files Rados striped mode should be the default Bareos mode, and you have to make an effort to turn it off.
4. Underline low-level write routine is called rados_striper_write (). It is a blocking call and it performs well when block size is several megabytes. In Bareos block size default parameter is 64K, so you can significantly boost performance by setting in pool definition:
Maximum Block Size = 4194304
3,39 MB/s Block Size=64K
20,44 MB/s Block Size=1M
32,90 MB/s Block Size=2M
39,95 MB/s Block Size=4M (you can not increase Block Size more)
5. If you do not apply the patch the Rados Device definition can only be:
Device Options = "conffile=/etc/ceph/ceph.conf,poolname=backup,striped,stripe_unit=4194304,stripe_count=1"
(Attention! - no spaces allowed in between double quotes).
Any other combination will kill storage daemon. If you are OK with this setting of striping, then you can use official code.
6. The logic of striper mode is very much the same as in RAID-0. There are 3 parameters that drives it (There is almost no documentation on this in Ceph):
striper_unit - the stripe size (default=4M)
stripe_count - how many objects to write in parallel (default=1)
object_size - when to stop increasing object size and create new objects. (default =4M)
For example if you write 132M of data (132 1M pieces of data) in striped mode with the following parameters (object_size should always be > stripe_unit):
striper_unit = 8M
striper_count = 4
object_size = 24M
Then 8 objects will be created with (4 with 24M size and 4 with 8M size)
Obj1=24M Obj2=24M Obj3=24M Obj4=24M
00 .. 07 08 .. 0f 10 .. 17 18 .. 1f <-- consecutive 1M pieces of data
20 .. 27 21 .. 2f 30 .. 37 38 .. 3f
40 .. 47 48 .. 4f 50 .. 57 58 .. 5f
Obj5= 8M Obj6= 8M Obj7= 8M Obj8= 8M
60 .. 67 68 .. 6f 70 .. 77 78 .. 7f
So perhaps if you have 4 or 16 OSDs and you would like to get better performance you may want to play with these parameters. In my case I have very very modest Ceph installation and did not see any significant improvement.
If you apply the patch you can try different setting for striping, for example:
Device Options = "conffile=/etc/ceph/ceph.conf,poolname=backup,striped,stripe_unit=4194304,stripe_count=2,object_size=33554432"
7. Is there room for further improvement? I believe there is. It is called Asynchronous IO mode, which is also supported by libradosstriper but is not used in Bareos. Actually I saw speed about 100MB/s with it, (I have only 1Gbps network core, so it could be more). Also reports by CERN IT-group also suggest that AsyncIO is the way to increase performance.
I hope I can try to write this piece unless someone who is good in asyncIO programming is willing to step up.
8. I hope this will attract more people who are interested in using Bareos together with Ceph Object Storage.
Alexander.
Great to see some work being done there!
I use the RADOS backend for some time now, but with bad performance. Without striping, producing massive objects, I got ca. 8MB/s, after moving to striping, it even dropped to 3-5MB/s. With "rados bench", I can achieve ca. 45MB/s from my storage daemon host.
I came up with almost the same configuration as you:
Device Options = "conffile=/etc/ceph/ceph.conf,poolname=backup-striped,striped,stripe_unit=4194304,stripe_count=1024"
Except the stripe count. Could this be the reason for my bad performance?
Which of these options can be changed without rendering the already written volumes unreadable?
I use Bareos 16.2.6 (built from source) on EL7 with Ceph 12.2.1, with 8 OSDs across two hosts.
I plan on upgrading to 17.2 as soon as it is released...
Thanks
Martin
In the meantime, I upgraded to Bareos 17.2 from Git branch "bareos-17.2"), and I am running Ceph 12.2.3.
My new devices with striping look like this:
Device {
Name = InfrastructureBackup
Device Type = Rados
Media Type = RadosFile
Archive Device = "Rados Device 2"
Device Options = "conffile=/etc/ceph/ceph.conf,poolname=backup-fra1-infrastructure,striped,stripe_unit=4194304,stripe_count=16"
Maximum Block Size = 4194304
Minimum Block Size = 2097152
LabelMedia = yes;
Random Access = Yes;
AutomaticMount = yes;
RemovableMedia = no;
Maximum Concurrent Jobs = 4
}
I still get only 4-11MB/s. And since lately, I get random SIGSEGV within libradosstriper when trying to label a new volume:
Thread 2 (Thread 0x7f81f6ffd700 (LWP 204442)):
#0 0x00007f822ffa2eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6
#1 0x00007f822ff2005d in _L_lock_14730 () from /usr/lib64/libc.so.6
#2 0x00007f822ff1d163 in malloc () from /usr/lib64/libc.so.6
#3 0x00007f822fecf898 in _nl_make_l10nflist () from /usr/lib64/libc.so.6
#4 0x00007f822fecd64a in _nl_find_domain () from /usr/lib64/libc.so.6
#5 0x00007f822fecce8e in __dcigettext () from /usr/lib64/libc.so.6
#6 0x00007f8232059637 in signal_handler (sig=11) at signal.c:152
#7 <signal handler called>
#8 0x00007f822ff185db in malloc_consolidate () from /usr/lib64/libc.so.6
#9 0x00007f822ff1a3a5 in _int_malloc () from /usr/lib64/libc.so.6
#10 0x00007f822ff1d10c in malloc () from /usr/lib64/libc.so.6
#11 0x00007f822ff1f65c in posix_memalign () from /usr/lib64/libc.so.6
---Type <return> to continue, or q <return> to quit---
#12 0x00007f8230b420a6 in ceph::buffer::list::append(char const*, unsigned int) () from /usr/lib64/librados.so.2
#13 0x00007f8230e2d734 in rados_striper_write () from /usr/lib64/libradosstriper.so.1
#14 0x00007f82324d21c7 in rados_device::write_object_data (this=this@entry=0x7f8208002ac8, offset=<optimized out>, buffer=<optimized out>, count=294) at backends/rados_device.c:391
#15 0x00007f82324d222a in rados_device::d_write (this=0x7f8208002ac8, fd=<optimized out>, buffer=<optimized out>, count=<optimized out>) at backends/rados_device.c:416
#16 0x00007f82324b8927 in DEVICE::write (this=0x7f8208002ac8, buf=0x7f81cc454ec0, len=len@entry=294) at dev.c:1190
#17 0x00007f82324b2fba in DCR::write_block_to_dev (this=this@entry=0x7f81cc040718) at block.c:596
#18 0x00007f82324bb8e4 in write_new_volume_label_to_dev (dcr=dcr@entry=0x7f81cc040718, VolName=VolName@entry=0x7f81cc040830 "InfrastructureIncrVolume-8503", PoolName=PoolName@entry=0x7f81cc0408b0 "InfrastructureIncrToDiskPool", relabel=relabel@entry=false)
at label.c:433
#19 0x00007f82324be75c in DCR::try_autolabel (this=0x7f81cc040718, opened=<optimized out>) at mount.c:771
#20 0x00007f82324bf9fa in DCR::mount_next_write_volume (this=this@entry=0x7f81cc040718) at mount.c:232
#21 0x00007f82324b8fc1 in fixup_device_block_write_error (dcr=dcr@entry=0x7f81cc040718, retries=retries@entry=4) at device.c:120
#22 0x00007f82324b39eb in DCR::write_block_to_device (this=this@entry=0x7f81cc040718) at block.c:410
#23 0x00007f82324c2d45 in DCR::write_record (this=this@entry=0x7f81cc040718) at record.c:668
#24 0x000000000040857c in do_append_data (jcr=jcr@entry=0x7f81cc00f148, bs=bs@entry=0x17be8e8, what=what@entry=0x41b03c "FD") at append.c:223
#25 0x000000000040f7d4 in append_data_cmd (jcr=0x7f81cc00f148) at fd_cmds.c:271
#26 0x000000000040fbe9 in do_fd_commands (jcr=0x7f81cc00f148) at fd_cmds.c:227
#27 0x000000000040fde2 in run_job (jcr=jcr@entry=0x7f81cc00f148) at fd_cmds.c:183
#28 0x000000000041091b in do_job_run (jcr=0x7f81cc00f148) at job.c:238
#29 0x000000000040f012 in handle_director_connection (dir=dir@entry=0x17be568) at dir_cmd.c:315
#30 0x0000000000414f78 in handle_connection_request (arg=0x17be568) at socket_server.c:100
#31 0x00007f8232063905 in workq_server (arg=arg@entry=0x621ea0 <socket_workq>) at workq.c:336
#32 0x00007f823204b575 in lmgr_thread_launcher (x=0x17bf708) at lockmgr.c:928
#33 0x00007f82312f4e25 in start_thread () from /usr/lib64/libpthread.so.0
#34 0x00007f822ff9534d in clone () from /usr/lib64/libc.so.6
(Since the Bareos signal handler tries to print something after catching the signal, the process does not actually crash, but just freezes).
Unfortunately, This night I ran it within valgrind/memcheck, and the error did not appear, so I am still on the hunt.
But the performance is still not what I expect, did I choose some bad values in my config?
Cheers,
Martin
Hi!Thanks! I tried with:Device Options = "conffile=/etc/ceph/ceph.conf,poolname=backup-fra1-infrastructure,striped,stripe_unit=4194304,stripe_count=8,object_size=33554432"I am running a big 100GB backup now, I still get only ca. 11MB/s.With rados bench and similar parameters, I get my 40MB/s:# rados --pool backup-fra1-infrastructure bench -t 1 -b 4194304 -o 33554432 --striper --run-name test1 180 writehints = 1Maintaining 1 concurrent writes of 4194304 bytes to objects of size 33554432 for up to 180 seconds or 0 objectsObject prefix: benchmark_data_ceph-bareos-sd_1261337sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)0 1 1 0 0 0 - 01 1 11 10 39.9557 40 0.0970568 0.09083232 1 23 22 43.9704 48 0.0860451 0.09087643 1 33 32 42.6454 40 0.087131 0.09179994 1 43 42 41.9829 40 0.153463 0.09482465 1 52 51 40.7853 36 0.104806 0.09693826 1 63 62 41.3198 44 0.0833655 0.09645197 1 73 72 41.1305 40 0.117676 0.09702348 1 83 82 40.9883 40 0.092857 0.09658379 1 93 92 40.8777 40 0.0959908 0.097238610 1 104 103 41.1892 44 0.0901445 0.0969038Cheers,MartinAm 01.03.2018 um 12:33 schrieb Alexander Kushnirenko <kushn...@gmail.com>:Hi, Martin!Just for the reference - here is my conf file. You may benefit from increasing object_size parameterDevice {Name = RadosStorageDevice
Archive Device = "Rados Device"
Device Options = "conffile=/etc/ceph/ceph.conf,poolname=backup,striped,stripe_unit=4194304,stripe_count=2,object_size=33554432"Device Type = radosMedia Type = RadosObjectLabel Media = yesRandom Access = yesAutomatic Mount = yesRemovable Media = noAlways Open = no}Alexander
Martin
--
You received this message because you are subscribed to a topic in the Google Groups "bareos-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bareos-users/hnLJrH60GHU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bareos-users...@googlegroups.com.
To post to this group, send email to bareos...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
What i have learned about ceph is that it might benefit from kernel tuning:
(example)
vm.swappiness=1
vm.dirty_background_ratio=1
vm.dirty_ratio=10
vm.dirty_expire_centisecs=1000
vm.dirty_writeback_centisecs=25
For hosts running old OSD nodes (those using filesystems , not bluestore) you can enable more write caching. I got this idea after examining Ceph osd nodes running on Centos7. In nmon i saw that kernel was not able to simultaneously access disk and handle network traffic. I think this numbers above may not be exactly what i used hovewer it gave a huge improvement. It would be great if someone could confirm.
Ceph is powerful but complex. There should be a publicly available guidance from Redhat however there are a few general recommendations:
- 10Gbit networking (or bonded multiple 1Gbit links)
- separate 10Gbit networking for cluster backbone
- good understanding of how Ceph places data across hosts and disks
- ssd's or at least osd journals on ssd's
When i was testing Ceph i used to start nmon on every node. This was a simplest solution. For keeping clusters in balance (performance) i'd suggest installing decent monitoring like prometheus+grafana. It will help finding bottlenecks and hotspots.
BTW, what is a point of using S3 on Ceph for Bareos ? Going straight with librados should save some infra overhead (if you are not interested in replicating across clusters).
Regards,
Bart