How to get good speed with Rados (Ceph) storage daemon

1,904 views
Skip to first unread message

Alexander Kushnirenko

unread,
Oct 16, 2017, 7:21:12 PM10/16/17
to bareos-users
Hello,

I would like to summarize in this message some methods that can significantly improve speed of bareos backup on Ceph object store. Also I leave a patch for developers, which hopefully will be included in Bareos.

For impatient:
- Maximum Block Size = 4194304 (add to Pool definition)
- Compile in libradosstriper in bareos-sd
- apply patch to have objects > 4M

One of the exceptional achievements of Bareos is its speed. Ceph also have very good speed, data redundancy, on-the-fly storage increase and other goodies. But Bareos and Ceph do not perform well together. See for example https://groups.google.com/d/msg/bareos-users/dz_Cb-DxQ0w/WB6v0KR1GQAJ
In that particular case the speed was only 3MB/s, when one would expect from Rados benchmarking 70MB/s. I also have similar situations on two different installations (Ceph jewel and Ceph luminous).

1. Fundamental problem - Ceph elementary piece of data is called object, and the low-level library which works with objects is called Rados. For performance reasons Rados object should be 4M-32M of size (the bigger the object - the longer delay for data recovery in case of disk/network failure) Elementary data piece of Bareos is Volume and its size for Full backup is usually 25G-50G. The latest version of Ceph made this situation even worse: earlier versions Ceph allowed objects of size up to 100G, but now there is a limit of 132M.

In Bareos language Volume = Rados Object and with such drastic difference in size there seem to be no solution at all. With small object size Bareos database will choke on too many volumes and with large object size Ceph will perform poorly.

2. But actually there is a solution and it is called striper mode. This mode was introduced by CERN, who had gazillions of physics data and who wanted to be as close to Object Store as possible (they did not want Block Device abstraction, HTTP, or Ceph FS in their way, perhaps like Bareos developers). So they build a mode which striped big objects into little objects. This way they store small size objects but refer to them as to very large chunks of data of almost unlimited size:

root@backup4:~# rados -p backup ls
Full-5409.0000000000000000
Full-5409.0000000000000001
Full-5409.0000000000000002
Full-5409.0000000000000003
.....
hundreds of objects

root@backup4:~# rados -p backup ls --striper
Full-5409
Incr-5822
Just 2 "big striped objects"

And thanks to Bareos developers this mode can be used. I just introduced little modification and bug-fixes to overall excellent piece of code.

3. I would argue that:
- Bareos distribution have to have libradosstriper compiled in by default. (Now libradosstriper is NOT compiled in in binaries distributed on Bareos Web-site)
- In configuration files Rados striped mode should be the default Bareos mode, and you have to make an effort to turn it off.

4. Underline low-level write routine is called rados_striper_write (). It is a blocking call and it performs well when block size is several megabytes. In Bareos block size default parameter is 64K, so you can significantly boost performance by setting in pool definition:
Maximum Block Size = 4194304

3,39 MB/s Block Size=64K
20,44 MB/s Block Size=1M
32,90 MB/s Block Size=2M
39,95 MB/s Block Size=4M (you can not increase Block Size more)

5. If you do not apply the patch the Rados Device definition can only be:
Device Options = "conffile=/etc/ceph/ceph.conf,poolname=backup,striped,stripe_unit=4194304,stripe_count=1"
(Attention! - no spaces allowed in between double quotes).

Any other combination will kill storage daemon. If you are OK with this setting of striping, then you can use official code.

6. The logic of striper mode is very much the same as in RAID-0. There are 3 parameters that drives it (There is almost no documentation on this in Ceph):

striper_unit - the stripe size (default=4M)
stripe_count - how many objects to write in parallel (default=1)
object_size - when to stop increasing object size and create new objects. (default =4M)

For example if you write 132M of data (132 1M pieces of data) in striped mode with the following parameters (object_size should always be > stripe_unit):
striper_unit = 8M
striper_count = 4
object_size = 24M
Then 8 objects will be created with (4 with 24M size and 4 with 8M size)

Obj1=24M Obj2=24M Obj3=24M Obj4=24M
00 .. 07 08 .. 0f 10 .. 17 18 .. 1f <-- consecutive 1M pieces of data
20 .. 27 21 .. 2f 30 .. 37 38 .. 3f
40 .. 47 48 .. 4f 50 .. 57 58 .. 5f

Obj5= 8M Obj6= 8M Obj7= 8M Obj8= 8M
60 .. 67 68 .. 6f 70 .. 77 78 .. 7f

So perhaps if you have 4 or 16 OSDs and you would like to get better performance you may want to play with these parameters. In my case I have very very modest Ceph installation and did not see any significant improvement.

If you apply the patch you can try different setting for striping, for example:
Device Options = "conffile=/etc/ceph/ceph.conf,poolname=backup,striped,stripe_unit=4194304,stripe_count=2,object_size=33554432"

7. Is there room for further improvement? I believe there is. It is called Asynchronous IO mode, which is also supported by libradosstriper but is not used in Bareos. Actually I saw speed about 100MB/s with it, (I have only 1Gbps network core, so it could be more). Also reports by CERN IT-group also suggest that AsyncIO is the way to increase performance.

https://indico.cern.ch/event/524549/contributions/2185945/attachments/1289528/1919824/CephForHighThroughput.pdf

I hope I can try to write this piece unless someone who is good in asyncIO programming is willing to step up.

8. I hope this will attract more people who are interested in using Bareos together with Ceph Object Storage.

Alexander.


bareos-RadosStriper.patch

Jörg Steffens

unread,
Oct 18, 2017, 5:24:26 AM10/18/17
to bareos...@googlegroups.com
Hello Alexander,

thank you very much for improving Bareos and for this excellent
description of the problems and solutions for it.

I've a few questions:

So with your adoptions, you archive a backup (and restore?) speed of
around 40 MB/s, while native access gives you 70 MB/s?

Is the Block Size of 4194304 bytes a limitation of a general limitation
of Bareos or of Rados or the the Rados backend of Bareos?

What "Maximum Volume Size" are you using? Does it have influence on the
performance? Do I get it right, that with radosstriper the (virtual)
Object Size is arbitrary?


> 5. If you do not apply the patch the Rados Device definition can only be:
> Device Options =
"conffile=/etc/ceph/ceph.conf,poolname=backup,striped,stripe_unit=4194304,stripe_count=1"
> (Attention! - no spaces allowed in between double quotes).
>
> Any other combination will kill storage daemon. If you are OK with
this setting of striping, then you can use official code.

Do you mean, make it slow, or the Storage Daemon will crash?

Is it okay for you, if I add some of your explanation to the Bareos
documentation?

regards,
Jörg
--
Jörg Steffens joerg.s...@bareos.com
Bareos GmbH & Co. KG Phone: +49 221 630693-91
http://www.bareos.com Fax: +49 221 630693-10

Sitz der Gesellschaft: Köln | Amtsgericht Köln: HRA 29646
Komplementär: Bareos Verwaltungs-GmbH
Geschäftsführer:
S. Dühr, M. Außendorf, Jörg Steffens, P. Storz

Alexander Kushnirenko

unread,
Oct 18, 2017, 9:01:54 AM10/18/17
to bareos-users
Hi, Jörg!

>
> I've a few questions:
>
> So with your adoptions, you archive a backup (and restore?) speed of
> around 40 MB/s, while native access gives you 70 MB/s?
>

Speed is somewhat setup dependent, so the numbers should be taken with
caution.

Network speed - 108MB/s (1Gbps, no jumbo packets in public network)
Maximum Bareos write speed that I see 67MB/s (Bareos log, large database
backup, no compression, Disk Backup)
Rados Benchmark - 100MB/s (from SSD disk, Ceph redundancy=3D2, so you write=
2 copies, I still have some doubts what "write complete" means - writing to
memory, writing to disk or writing to both disks, but Rados provides
signals)
Bareos Write to Rados - 34MB/s (from bareos log file, large database, no
compression, 4MB blocks)
Bareos Read from Rados - 26MB/s (I think this number should be larger as I
restored to remote node with slow disks)


>
> Is the Block Size of 4194304 bytes a limitation of a general limitation
> of Bareos or of Rados or the the Rados backend of Bareos?
>

There is a segmentation fault in rados_striper_write() if you use 8M for
example - I tested it on stand-alone code. Also I believe there may be
all sorts of troubles in other places working with such big buffers. But
I think that even with 64K buffer you should have good results with
AsyncIO. I consider large buffer as a temporary measure.

>
> What "Maximum Volume Size" are you using? Does it have influence on the
> performance?


I use Maximum Volume Bytes =3D 25G. No I do not see any influence.


> Do I get it right, that with radosstriper the (virtual) Object Size is
> arbitrary?
>
Correct. 25G is certainly possible. I think the limit is far beyond this
- I can try to create 1TB striped objects if this is interesting. Perhaps
CERN guys know the exact answer.


>
> > 5. If you do not apply the patch the Rados Device definition can only b=
e:
> > Device Options =3D
> "conffile=3D/etc/ceph/ceph.conf,poolname=3Dbackup,striped,stripe
> _unit=3D4194304,stripe_count=3D1"
> > (Attention! - no spaces allowed in between double quotes).
> >
> > Any other combination will kill storage daemon. If you are OK with
> this setting of striping, then you can use official code.
>
> Do you mean, make it slow, or the Storage Daemon will crash?
>

Storage daemon will crash. libradosstriper assumes object_size=3D4M, so wh=
en
you create stripe_unit=3D8M the code will crash as it checks (assert
(stripe_unit < object_size) ) I think 32M is a reasonable size for
object. If you set it to 4M (Ceph default) you will have 1.5M objects on
6TB disk. Ceph developers say it is OK, but it sounds like 1M objects
start to affect the performance. Maximum object size is 132M, so 32M is
somewhere in between. In case of 32M - there are 0.2M objects and I think
this is OK, but I do not have this experience yet. Many Ceph installations
have SSD to store "object directory" and I do not want this for backup
purposes. Perhaps we need to look in more details in this issue.


> Is it okay for you, if I add some of your explanation to the Bareos
> documentation?
>

Oh, absolutely! I hope more people will try to use Rados object store for
backup and perhaps we can gain more knowledge there. I still feel myself
on thin ice in this setup.

I also would like to try AsynIO for Rados, as I saw 100MB/s speeds with it
and on 10G network it should be even more. Do you have examples of AsyncIO
code in Bareos? (I can use it as a guidance line, unless there is someone
much more experienced than me)

regards,
Alexander

Jörg Steffens

unread,
Oct 18, 2017, 9:16:24 AM10/18/17
to bareos...@googlegroups.com
Hi Alexander,

thank you again for your detailed explanation.

I'll will try to add libradosstripper and your patch to the bareos-17.2
branch.

> I also would like to try AsynIO for Rados, as I saw 100MB/s speeds with it> and on 10G network it should be even more. Do you have examples of
AsyncIO> code in Bareos? (I can use it as a guidance line, unless there
is someone> much more experienced than me)
An other approach to speed up backup to Ceph is implemented in
https://github.com/bareos/bareos/pull/61

It implements a Bareos S3 backend and used it against CEPH S3 radosgw.

It is based on libdroplet and AFAIK can be used in async-IO mode.

We are currently evaluating it. Maybe it is also interesting for you.

Jörg Steffens

unread,
Nov 8, 2017, 12:26:42 PM11/8/17
to bareos...@googlegroups.com
Hi Alexander,

an update on this:

> 3. I would argue that:
> - Bareos distribution have to have libradosstriper compiled in by default. (Now libradosstriper is NOT compiled in in binaries distributed on Bareos Web-site)

libradosstriper is compiled in for some distributions, more
specifically, Ubuntu 16.04. After checking this again, I've noticed,
that libradosstriper is also included in Debian 9, so I enabled it there
as well (master and bareos-17.2).

I also understand, that using bareos with rados backend is not useful
without libradosstriper. Therefore I disabled it for older Debian and
Ubuntu distributions (again, master and bareos-17.2 only).

What distribution are you using?

I also included your patch to
https://github.com/bareos/bareos/commits/bareos-17.2. I will be
propagated to master soon.

> An other approach to speed up backup to Ceph is implemented in
> https://github.com/bareos/bareos/pull/61
>
> It implements a Bareos S3 backend and used it against CEPH S3 radosgw.

This has as well been integrated into bareos-17.2.

I hope to find time soon, to update the documentation.

regards,Jörg--

Martin Emrich

unread,
Nov 27, 2017, 8:56:34 AM11/27/17
to bareos-users
Hi!

Great to see some work being done there!

I use the RADOS backend for some time now, but with bad performance. Without striping, producing massive objects, I got ca. 8MB/s, after moving to striping, it even dropped to 3-5MB/s. With "rados bench", I can achieve ca. 45MB/s from my storage daemon host.

I came up with almost the same configuration as you:

Device Options = "conffile=/etc/ceph/ceph.conf,poolname=backup-striped,striped,stripe_unit=4194304,stripe_count=1024"

Except the stripe count. Could this be the reason for my bad performance?
Which of these options can be changed without rendering the already written volumes unreadable?

I use Bareos 16.2.6 (built from source) on EL7 with Ceph 12.2.1, with 8 OSDs across two hosts.

I plan on upgrading to 17.2 as soon as it is released...

Thanks

Martin

Martin Emrich

unread,
Mar 1, 2018, 5:22:24 AM3/1/18
to bareos-users
Some updates from my part:

In the meantime, I upgraded to Bareos 17.2 from Git branch "bareos-17.2"), and I am running Ceph 12.2.3.

My new devices with striping look like this:

Device {
Name = InfrastructureBackup
Device Type = Rados
Media Type = RadosFile

Archive Device = "Rados Device 2"
Device Options = "conffile=/etc/ceph/ceph.conf,poolname=backup-fra1-infrastructure,striped,stripe_unit=4194304,stripe_count=16"

Maximum Block Size = 4194304

Minimum Block Size = 2097152

LabelMedia = yes;
Random Access = Yes;
AutomaticMount = yes;
RemovableMedia = no;
Maximum Concurrent Jobs = 4
}

I still get only 4-11MB/s. And since lately, I get random SIGSEGV within libradosstriper when trying to label a new volume:

Thread 2 (Thread 0x7f81f6ffd700 (LWP 204442)):
#0 0x00007f822ffa2eec in __lll_lock_wait_private () from /usr/lib64/libc.so.6
#1 0x00007f822ff2005d in _L_lock_14730 () from /usr/lib64/libc.so.6
#2 0x00007f822ff1d163 in malloc () from /usr/lib64/libc.so.6
#3 0x00007f822fecf898 in _nl_make_l10nflist () from /usr/lib64/libc.so.6
#4 0x00007f822fecd64a in _nl_find_domain () from /usr/lib64/libc.so.6
#5 0x00007f822fecce8e in __dcigettext () from /usr/lib64/libc.so.6
#6 0x00007f8232059637 in signal_handler (sig=11) at signal.c:152
#7 <signal handler called>
#8 0x00007f822ff185db in malloc_consolidate () from /usr/lib64/libc.so.6
#9 0x00007f822ff1a3a5 in _int_malloc () from /usr/lib64/libc.so.6
#10 0x00007f822ff1d10c in malloc () from /usr/lib64/libc.so.6
#11 0x00007f822ff1f65c in posix_memalign () from /usr/lib64/libc.so.6
---Type <return> to continue, or q <return> to quit---
#12 0x00007f8230b420a6 in ceph::buffer::list::append(char const*, unsigned int) () from /usr/lib64/librados.so.2
#13 0x00007f8230e2d734 in rados_striper_write () from /usr/lib64/libradosstriper.so.1
#14 0x00007f82324d21c7 in rados_device::write_object_data (this=this@entry=0x7f8208002ac8, offset=<optimized out>, buffer=<optimized out>, count=294) at backends/rados_device.c:391
#15 0x00007f82324d222a in rados_device::d_write (this=0x7f8208002ac8, fd=<optimized out>, buffer=<optimized out>, count=<optimized out>) at backends/rados_device.c:416
#16 0x00007f82324b8927 in DEVICE::write (this=0x7f8208002ac8, buf=0x7f81cc454ec0, len=len@entry=294) at dev.c:1190
#17 0x00007f82324b2fba in DCR::write_block_to_dev (this=this@entry=0x7f81cc040718) at block.c:596
#18 0x00007f82324bb8e4 in write_new_volume_label_to_dev (dcr=dcr@entry=0x7f81cc040718, VolName=VolName@entry=0x7f81cc040830 "InfrastructureIncrVolume-8503", PoolName=PoolName@entry=0x7f81cc0408b0 "InfrastructureIncrToDiskPool", relabel=relabel@entry=false)
at label.c:433
#19 0x00007f82324be75c in DCR::try_autolabel (this=0x7f81cc040718, opened=<optimized out>) at mount.c:771
#20 0x00007f82324bf9fa in DCR::mount_next_write_volume (this=this@entry=0x7f81cc040718) at mount.c:232
#21 0x00007f82324b8fc1 in fixup_device_block_write_error (dcr=dcr@entry=0x7f81cc040718, retries=retries@entry=4) at device.c:120
#22 0x00007f82324b39eb in DCR::write_block_to_device (this=this@entry=0x7f81cc040718) at block.c:410
#23 0x00007f82324c2d45 in DCR::write_record (this=this@entry=0x7f81cc040718) at record.c:668
#24 0x000000000040857c in do_append_data (jcr=jcr@entry=0x7f81cc00f148, bs=bs@entry=0x17be8e8, what=what@entry=0x41b03c "FD") at append.c:223
#25 0x000000000040f7d4 in append_data_cmd (jcr=0x7f81cc00f148) at fd_cmds.c:271
#26 0x000000000040fbe9 in do_fd_commands (jcr=0x7f81cc00f148) at fd_cmds.c:227
#27 0x000000000040fde2 in run_job (jcr=jcr@entry=0x7f81cc00f148) at fd_cmds.c:183
#28 0x000000000041091b in do_job_run (jcr=0x7f81cc00f148) at job.c:238
#29 0x000000000040f012 in handle_director_connection (dir=dir@entry=0x17be568) at dir_cmd.c:315
#30 0x0000000000414f78 in handle_connection_request (arg=0x17be568) at socket_server.c:100
#31 0x00007f8232063905 in workq_server (arg=arg@entry=0x621ea0 <socket_workq>) at workq.c:336
#32 0x00007f823204b575 in lmgr_thread_launcher (x=0x17bf708) at lockmgr.c:928
#33 0x00007f82312f4e25 in start_thread () from /usr/lib64/libpthread.so.0
#34 0x00007f822ff9534d in clone () from /usr/lib64/libc.so.6

(Since the Bareos signal handler tries to print something after catching the signal, the process does not actually crash, but just freezes).

Unfortunately, This night I ran it within valgrind/memcheck, and the error did not appear, so I am still on the hunt.

But the performance is still not what I expect, did I choose some bad values in my config?

Cheers,

Martin

Alexander Kushnirenko

unread,
Nov 14, 2018, 5:52:38 PM11/14/18
to Martin Emrich, bareos-users
Hi,

I would like to confirm that on Debian 9.0 ceph-luminous-12.2.0 and bareos-17.2.4 give reasonable speed for Rados device (both are taken from official repositories)

Rados Benchmarking gives 40MB/s, and I get about 25MB/s, which is somewhat smaller, but is in the right range.
Device Options = "conffile=/etc/ceph/ceph.conf,poolname=backup,striped,stripe_unit=4194304,stripe_count=2,object_size=33554432"

It is very important to have Maximum Block Size = 4194304 either in Pool definition or in Storage Device definition (I'm not sure which one is the "right" place to put this parameter in).  WIth default Block Size = 64k the speed will drop down by a factor of 3.

Alexander.

On Thu, Mar 1, 2018 at 8:32 PM Martin Emrich <frozenh...@gmail.com> wrote:
Hi!

Thanks! I tried with:

  Device Options = "conffile=/etc/ceph/ceph.conf,poolname=backup-fra1-infrastructure,striped,stripe_unit=4194304,stripe_count=8,object_size=33554432"

I am running a big 100GB backup now, I still get only ca. 11MB/s.

With rados bench and similar parameters, I get my 40MB/s:

# rados --pool backup-fra1-infrastructure bench -t 1 -b 4194304 -o 33554432 --striper --run-name test1 180 write
hints = 1
Maintaining 1 concurrent writes of 4194304 bytes to objects of size 33554432 for up to 180 seconds or 0 objects
Object prefix: benchmark_data_ceph-bareos-sd_1261337
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       1         1         0         0         0           -           0
    1       1        11        10   39.9557        40   0.0970568   0.0908323
    2       1        23        22   43.9704        48   0.0860451   0.0908764
    3       1        33        32   42.6454        40    0.087131   0.0917999
    4       1        43        42   41.9829        40    0.153463   0.0948246
    5       1        52        51   40.7853        36    0.104806   0.0969382
    6       1        63        62   41.3198        44   0.0833655   0.0964519
    7       1        73        72   41.1305        40    0.117676   0.0970234
    8       1        83        82   40.9883        40    0.092857   0.0965837
    9       1        93        92   40.8777        40   0.0959908   0.0972386
   10       1       104       103   41.1892        44   0.0901445   0.0969038

Cheers,

Martin

Am 01.03.2018 um 12:33 schrieb Alexander Kushnirenko <kushn...@gmail.com>:

Hi, Martin!

Just for the reference - here is my conf file.  You may benefit from increasing object_size parameter

Device {
  Name = RadosStorageDevice
  Archive Device = "Rados Device"
  Device Options = "conffile=/etc/ceph/ceph.conf,poolname=backup,striped,stripe_unit=4194304,stripe_count=2,object_size=33554432"
  Device Type = rados
  Media Type = RadosObject
  Label Media = yes
  Random Access = yes
  Automatic Mount = yes
  Removable Media = no
  Always Open = no
}

Alexander





Martin

--
You received this message because you are subscribed to a topic in the Google Groups "bareos-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bareos-users/hnLJrH60GHU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bareos-users...@googlegroups.com.
To post to this group, send email to bareos...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Bartek R

unread,
Dec 30, 2018, 12:27:32 PM12/30/18
to bareos-users
Hi,

What i have learned about ceph is that it might benefit from kernel tuning:

(example)
vm.swappiness=1
vm.dirty_background_ratio=1
vm.dirty_ratio=10
vm.dirty_expire_centisecs=1000
vm.dirty_writeback_centisecs=25

For hosts running old OSD nodes (those using filesystems , not bluestore) you can enable more write caching. I got this idea after examining Ceph osd nodes running on Centos7. In nmon i saw that kernel was not able to simultaneously access disk and handle network traffic. I think this numbers above may not be exactly what i used hovewer it gave a huge improvement. It would be great if someone could confirm.

Ceph is powerful but complex. There should be a publicly available guidance from Redhat however there are a few general recommendations:

- 10Gbit networking (or bonded multiple 1Gbit links)
- separate 10Gbit networking for cluster backbone
- good understanding of how Ceph places data across hosts and disks
- ssd's or at least osd journals on ssd's

When i was testing Ceph i used to start nmon on every node. This was a simplest solution. For keeping clusters in balance (performance) i'd suggest installing decent monitoring like prometheus+grafana. It will help finding bottlenecks and hotspots.

BTW, what is a point of using S3 on Ceph for Bareos ? Going straight with librados should save some infra overhead (if you are not interested in replicating across clusters).

Regards,
Bart

Reply all
Reply to author
Forward
0 new messages