FileIO vs BlockIO latencies & performance

Morgan Robertson

unread,

Dec 23, 2013, 3:56:41 PM12/23/13

to esos-...@googlegroups.com

Hi all,

Although this is applicable to all Linux block storage targets, I thought I'd inquire on this group as there is a high general knowledge of Linux SAN systems around here and I use ESOS. :)

I have a few questions regarding the age old fileio vs blockio debate. If I understand correctly & to recap:

1) FileIO requires the use of standard files as LUN backend, this goes through the linux page cache (RAM caching) & filesystem code pathways. This allows read caching in addition to buffering writes for improved performance. This can introduce some funny latency behavior as data read from RAM is very fast but your initiators can hit 100+ms at times. Marcin Czupryniak mentioned the following a few months back:

"I'm personally against file_io (even if in some situations it can boost performance) for several reaosns (VFS translation and page caching latencies) which do not occur with block_io or passtrough SCSI. "

2) BlockIO requires the use of a block device as a backend. Reads & writes go directly to the device bypassing standard RAM caches. You tend to get worse performance than FileIO in regards to throughput but your latencies are more consistent (though sometimes slower) due to the lack of lockups & 'VFS translation'.

Does this sound right? I could easily be wrong and would love to understand more.

If so, here are my questions:

1) Why is FileIO latency so bad at times? Can anyone elaborate on Martin's VFS and page caching latencies assertion? I don't know much about it. From the initiator, I can get great latency most of the times, but it can go to beyond 100ms under some workloads (synthetic tests/benchmarks).

2) Does anyone know how to implement the o_direct flag and allow files to bypass the page cache?

3) Additionally, can FileIO be more dangerous than BlockIO in that chances of data corruption increase if power is lost as the SAN is caching writes via the file system code paths? Or do modern journaled file systems handle this with little issue?

4) Lastly, is there a reason/suggestions as to why BlockIO is so slow from the initiator side in regards to sequential throughput? If I read/write to the block devices locally on the SAN, I get great sequential performance (1GB/sec for zeros) but through my SRP or iSCSI/IPoIB initators it drops to 300MB read/write sequentially (though my latency is more consistent than FileIO).

Thanks for your time. I'm happy to provide full server/environment details if needed but I thought I'd keep this somewhat general because I think my issues/questions are not a 'problem' to do with my server, just different settings. Please also note that I get similar (slightly slower) performance with CentOS 6.3 rather than ESOS.

Kind regards,

Morgan

Marc Smith

unread,

Dec 27, 2013, 10:34:01 PM12/27/13

to esos-...@googlegroups.com

On Mon, Dec 23, 2013 at 3:56 PM, Morgan Robertson <morganr...@gmail.com> wrote:

Hi all,

Although this is applicable to all Linux block storage targets, I thought I'd inquire on this group as there is a high general knowledge of Linux SAN systems around here and I use ESOS. :)

I have a few questions regarding the age old fileio vs blockio debate. If I understand correctly & to recap:

1) FileIO requires the use of standard files as LUN backend, this goes through the linux page cache (RAM caching) & filesystem code pathways. This allows read caching in addition to buffering writes for improved performance. This can introduce some funny latency behavior as data read from RAM is very fast but your initiators can hit 100+ms at times. Marcin Czupryniak mentioned the following a few months back:

"I'm personally against file_io (even if in some situations it can boost performance) for several reaosns (VFS translation and page caching latencies) which do not occur with block_io or passtrough SCSI. "

2) BlockIO requires the use of a block device as a backend. Reads & writes go directly to the device bypassing standard RAM caches. You tend to get worse performance than FileIO in regards to throughput but your latencies are more consistent (though sometimes slower) due to the lack of lockups & 'VFS translation'.

Does this sound right? I could easily be wrong and would love to understand more.

I think that mostly sounds right from what I understand regarding those two SCST device handlers. I'm not sure if I agree that you tend to get worse throughput with blockio, I think your system and back-end storage really need a real test with your real setup between the two device handlers. Use fio or IOMeter on the initiator and try to simulate your typical workload for the storage (eg, if you have lots of random IO, or lots of sequential reads/writes, try it out with those tools).

Obviously a simulation using the fio or IOMeter is not the same as the real world, but it will at least give you some clue, or at least make you feel better that you did try it out. =)

If so, here are my questions:

1) Why is FileIO latency so bad at times? Can anyone elaborate on Martin's VFS and page caching latencies assertion? I don't know much about it. From the initiator, I can get great latency most of the times, but it can go to beyond 100ms under some workloads (synthetic tests/benchmarks).

I'm honestly not sure, but the best place to look is the scst-devel mailing list, and if you don't see any similar questions in the archives, I'd post and ask you question. Using copy/pasted results from one of the IO tools and detailed SCST configuration will take you a long way when asking.

2) Does anyone know how to implement the o_direct flag and allow files to bypass the page cache?

There may be more on this in the SCST documentation (README in svn) or the scst-devel mailing list. Remember, ESOS uses vanilla SCST so any documentation for SCST applies to ESOS (since its just running/using SCST).

3) Additionally, can FileIO be more dangerous than BlockIO in that chances of data corruption increase if power is lost as the SAN is caching writes via the file system code paths? Or do modern journaled file systems handle this with little issue?

I seem to recall something about this in the SCST README as well, or scst-devel archives.

4) Lastly, is there a reason/suggestions as to why BlockIO is so slow from the initiator side in regards to sequential throughput? If I read/write to the block devices locally on the SAN, I get great sequential performance (1GB/sec for zeros) but through my SRP or iSCSI/IPoIB initators it drops to 300MB read/write sequentially (though my latency is more consistent than FileIO).

When you say your testing the devices locally from "inside" a ESOS host, what tool(s) are you using? Can you show examples? What tools/commands on the initiator side?

In my experience, I've noticed the IO scheduler used for the back-end block devices makes a huge difference when used with SSD back-end storage and vdisk_blockio. I use the 'noop' scheduler ('deadline' seems to give same/similar performance numbers) but 'cfq' seems to be bad with SSD, blockio, and random IOPS. I haven't tested with sequeantial type IO and other types of back-end storage, but you might want to check it out. I've used the pass-through SCSI disk handler, and I do not need to adjust the IO scheduler (I guess because Linux/ESOS is not actually scheduling IO with that mode since its passed directly through).

Thanks for your time. I'm happy to provide full server/environment details if needed but I thought I'd keep this somewhat general because I think my issues/questions are not a 'problem' to do with my server, just different settings. Please also note that I get similar (slightly slower) performance with CentOS 6.3 rather than ESOS.

Check scst-devel and the SCST README out, but you can post real tests here too so we can look. =)

--Marc

Kind regards,

Morgan

--
You received this message because you are subscribed to the Google Groups "esos-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to esos-users+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

gray...@gmail.com

unread,

Feb 7, 2018, 8:32:34 AM2/7/18

to esos-users

Hi all.

We are running Active-Active SCST 3.3 targets with vdisk-rbd backend on full ssd ceph and have a big trouble with write latency.

One RBD performs well, about 25 kIOPS bs=64k vs 3ms lat.

Initiators via SCST could write only about 4-7kIOPS with lat>50-1000ms and random lat >2sec.

Could you please help us to solve it?

суббота, 28 декабря 2013 г., 6:34:01 UTC+3 пользователь Marc Smith написал:

Marc Smith

unread,

Feb 7, 2018, 9:15:02 AM2/7/18

to esos-...@googlegroups.com

On Wed, Feb 7, 2018 at 8:32 AM, <gray...@gmail.com> wrote:
> Hi all.
>
> We are running Active-Active SCST 3.3 targets with vdisk-rbd backend on full
> ssd ceph and have a big trouble with write latency.

Using vdisk_fileio or vdisk_blockio?

>
> One RBD performs well, about 25 kIOPS bs=64k vs 3ms lat.
> Initiators via SCST could write only about 4-7kIOPS with lat>50-1000ms and
> random lat >2sec.
>
> Could you please help us to solve it?

What's raw performance like against the RBD block device from the ESOS
side? What is your SAN medium/type? What's the performance of
initiators across SAN to a vdisk_nullio device?

--Marc

> For more options, visit https://groups.google.com/d/optout.

gray...@gmail.com

unread,

Feb 7, 2018, 11:09:19 AM2/7/18

to esos-users

Hi, Mark

Thank you for your quick answer.

We use vdisk_blockio handler.

Our SAN media type is iSCSI over 10Gb/s Ethernet.

>What's the performance of
>initiators across SAN to a vdisk_nullio device?

We haven't test it yet, we will.

I'm not sure the meaning of "Raw performance". From quest VM with round-robin VMWare initiators it's about 4-7kIOPS with bs=64k, doesn't matter random or seq.

Reading is very good - 9-10kIOPS with 5-10ms lat.

The main problem is write latency and suddenly lat jitter even when there are 1-2kIOPS.

insufficient IOPS is not a problem, hi latency vs jitter is..

среда, 7 февраля 2018 г., 17:15:02 UTC+3 пользователь Marc Smith написал:

Marc Smith

unread,

Feb 7, 2018, 1:24:00 PM2/7/18

to esos-...@googlegroups.com

On Wed, Feb 7, 2018 at 11:09 AM, <gray...@gmail.com> wrote:
> Hi, Mark
>
> Thank you for your quick answer.
>
> We use vdisk_blockio handler.
>
> Our SAN media type is iSCSI over 10Gb/s Ethernet.
>
>
>>What's the performance of
>>initiators across SAN to a vdisk_nullio device?
>
> We haven't test it yet, we will.
>
> I'm not sure the meaning of "Raw performance". From quest VM with
> round-robin VMWare initiators it's about 4-7kIOPS with bs=64k, doesn't
> matter random or seq.
> Reading is very good - 9-10kIOPS with 5-10ms lat.

Any chance some type of write latency introduced by the hypervisor
layer? ESXi is definitely not known for having a blazing fast I/O
stack.

>
> The main problem is write latency and suddenly lat jitter even when there
> are 1-2kIOPS.
> insufficient IOPS is not a problem, hi latency vs jitter is..

I'd expect pretty high latency write when using a distributed system
like Ceph, but I'm not sure what's reasonable.

I'd test performance using the RDB block device (as seen by ESOS) with
the 'fio' tool (on the shell). If you have low performance there, then
you may be able to tune your Ceph setup? If performance is good, I'd
then test performance with 'fio' from remote initiators (use Linux)
with a vdisk_nullio LUN. If performance is bad there, then tweak iSCSI
configuration (eg, target, switch, initiator).

If performance is good on both sides, then perhaps some type of
alignment / block size adjustment may benefit performance.

--Marc

gray...@gmail.com

unread,

Feb 8, 2018, 4:44:19 AM2/8/18

to esos-users

Hi, Marc.

Please review out configuration and data path:

5 hosts ESXI hosts -> 2x10Gb iSCSI hw adapter -> 4 Active-Active path -> 2 SCST targets -> 2 LUNS (SSD and SAS) -> 2 rbd map -> RoundRobin LB vmfs 5.61 (~10Tb) -> 4 storage ceph bluestore with 2xHBA 2008 LSI and Toshiba SSD

When latency is high we can see write load

#ceph -s

84 pgs

objects: 1894k objects, 4204 GB

usage: 7541 GB used, 80872 GB / 88413 GB avail

pgs: 3584 active+clean

io:

client: 15938 kB/s rd, 72376 kB/s wr, 461 op/s rd, 7681 op/s wr

cache: 4088 kB/s flush, 19420 kB/s evict, 5 op/s promote

BS ~ 72376/7681/2= 19 KB

At this time we can't see any serious ssd osd load:

#ceph osd tree

-3 5.23947 host storage01

2 ssd 1.74649 osd.2 up 1.00000 1.00000

5 ssd 1.74649 osd.5 up 1.00000 1.00000

10 ssd 1.74649 osd.10 up 1.00000 1.00000

-7 5.23947 host storage02

12 ssd 1.74649 osd.12 up 1.00000 1.00000

18 ssd 1.74649 osd.18 up 1.00000 1.00000

19 ssd 1.74649 osd.19 up 1.00000 1.00000

-10 5.23947 host storage03

24 ssd 1.74649 osd.24 up 1.00000 1.00000

29 ssd 1.74649 osd.29 up 1.00000 1.00000

34 ssd 1.74649 osd.34 up 1.00000 1.00000

-13 5.23947 host storage04

38 ssd 1.74649 osd.38 up 1.00000 1.00000

41 ssd 1.74649 osd.41 up 1.00000 1.00000

46 ssd 1.74649 osd.46 up 1.00000 1.00000

#ceph osd perf

osd commit_latency(ms) apply_latency(ms)

46 1 1

19 3 3

18 3 3

12 2 2

10 2 2

2 1 1

5 32 32

24 1 1

29 2 2

One VM inside host can perform 5-7kIOPS random write with BS=64K and lat<5ms.

More than two VM share path to datastore (limit 7kIOPS), but we use RR path LB.

So, the problem is here:

4 Active-Active path -> 2 SCST targets -> 2 LUNS (SSD and SAS) -> 2 rbd map

How many threads SCST use when act with rbd map disk?

SCST config:

# Automatically generated by SCST Configurator v3.3.0-pre1.

HANDLER vdisk_blockio {

DEVICE sas {

filename /dev/rbd1

prod_id SCST

}

DEVICE sata {

filename /dev/rbd2

prod_id SCST

threads_num 8

}

DEVICE ssd {

filename /dev/rbd0

prod_id SCST

}

HANDLER vdisk_nullio {

DEVICE dummy {

dummy 1

}

TARGET_DRIVER copy_manager {

TARGET copy_manager_tgt {

LUN 0 sas

LUN 1 sata

LUN 2 ssd

LUN 3 dummy

}

TARGET_DRIVER iscsi {

enabled 1

TARGET iqn.2006-10.net:ssd {

enabled 1

rel_tgt_id 1

LUN 0 dummy

LUN 1 ssd

LUN 2 sas

LUN 3 sata

}

We will test you recommendation soon.

Any idea how we can tweak SCST?

среда, 7 февраля 2018 г., 21:24:00 UTC+3 пользователь Marc Smith написал:

Marc Smith

unread,

Feb 8, 2018, 9:27:29 AM2/8/18

to esos-...@googlegroups.com

The default is the number logical CPU's in the system, which should be
sufficient. I haven't ever gained much (if anything) when tuning this
parameter.

I'm not sure there is anything in SCST to tweak -- it may well be that
your /dev/rbdN devices (eg, Ceph) is the bottleneck. Using the fio
tool to test performance on the /dev/rbdN devices would be destructive
to your data, but if you're really curious, I would recommend testing
performance like this on those devices first:

Random write IOPS performance: fio --bs=4k --direct=1 --rw=randwrite
--ioengine=libaio --iodepth=64 --name=/dev/fioa

You could also try adding additional iSCSI targets (eg, multiple per
physical link) and use the allowed_portals attribute to control which
NIC they are tied to.

gray...@gmail.com

unread,

Feb 8, 2018, 11:03:23 AM2/8/18

to esos-users

Marc, as i texted before over one rbd in linux we can see ~22-25kIOPS raw performance, so it isn't a bottleneck.

Okay, we will try to use more paths and targets and try to isolate SSD and SAS traffic via different paths and targets, thanks!

четверг, 8 февраля 2018 г., 17:27:29 UTC+3 пользователь Marc Smith написал:

gray...@gmail.com

unread,

Feb 11, 2018, 4:18:24 PM2/11/18

to esos-users

Guys, i'am confused.

Marc, the results of fio rbd tests are very interesting...

It's a light load cluster, look at this:

root@storage04:~# fio --name iops --rw randwrite --bs 4k --filename /dev/rbd2 --numjobs 12 --ioengine=libaio --group_reporting --direct=1

iops: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=1

...

fio-2.2.10

Starting 12 processes

^Cbs: 12 (f=12): [w(12)] [0.5% done] [0KB/14304KB/0KB /s] [0/3576/0 iops] [eta 03h:42m:23s]]

fio: terminating on signal 2

iops: (groupid=0, jobs=12): err= 0: pid=31552: Mon Feb 12 00:02:39 2018

write: io=617132KB, bw=9554.8KB/s, iops=2388, runt= 64589msec

slat (usec): min=3, max=463, avg=11.56, stdev=14.17

clat (usec): min=930, max=999957, avg=5004.58, stdev=30299.43

lat (usec): min=938, max=999975, avg=5016.50, stdev=30299.58

root@storage04:~# fio --name iops --rw randwrite --bs 4k --filename /dev/rbd2 --numjobs 12 --iodepth=32 --ioengine=libaio --group_reporting --direct=1 --runtime=100

iops: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=32

...

fio-2.2.10

Starting 12 processes

Jobs: 12 (f=12): [w(12)] [100.0% done] [0KB/49557KB/0KB /s] [0/12.4K/0 iops] [eta 00m:00s]

iops: (groupid=0, jobs=12): err= 0: pid=32632: Mon Feb 12 00:04:51 2018

write: io=3093.6MB, bw=31675KB/s, iops=7918, runt=100009msec

slat (usec): min=2, max=973942, avg=1510.77, stdev=18783.09

clat (msec): min=1, max=1719, avg=46.97, stdev=111.64

lat (msec): min=1, max=1719, avg=48.49, stdev=113.62

clat percentiles (msec):

| 1.00th=[ 5], 5.00th=[ 8], 10.00th=[ 10], 20.00th=[ 13],

| 30.00th=[ 16], 40.00th=[ 18], 50.00th=[ 21], 60.00th=[ 25],

| 70.00th=[ 31], 80.00th=[ 43], 90.00th=[ 68], 95.00th=[ 131],

| 99.00th=[ 742], 99.50th=[ 824], 99.90th=[ 963], 99.95th=[ 988],

| 99.99th=[ 1045]

bw (KB /s): min= 6, max= 8320, per=9.51%, avg=3011.84, stdev=1956.64

lat (msec) : 2=0.07%, 4=0.55%, 10=10.28%, 20=37.47%, 50=35.89%

lat (msec) : 100=9.47%, 250=3.45%, 500=0.61%, 750=1.26%, 1000=0.92%

lat (msec) : 2000=0.03%

cpu : usr=0.28%, sys=0.58%, ctx=265641, majf=0, minf=146

IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%

submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%

complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%

issued : total=r=0/w=791939/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0

latency : target=0, window=0, percentile=100.00%, depth=32

but i can't see high latency on ssd, only 60-85% utilization.

okay, we will increase ssd osd op_threads, but how can we increase SCST iodepth to rbd back-end?

четверг, 8 февраля 2018 г., 19:03:23 UTC+3 пользователь gray...@gmail.com написал:

Marc Smith

unread,

Feb 11, 2018, 11:59:30 PM2/11/18

to esos-...@googlegroups.com

These write performance numbers to your Ceph cluster seem dismal... is
that right? It's late and I could be reading that wrong. =)

> but i can't see high latency on ssd, only 60-85% utilization.

You can't see high latency where? Are you looking at the stats
directly on the other Linux machines that host your SSD's drives and
Ceph software? If so, then you'd only be looking at performance /
utilization of the SSD's locally, and not including all of the Ceph
distributed-ness (latency) across the network that you do see when
accessing via the rbdN block device.

>
> okay, we will increase ssd osd op_threads, but how can we increase SCST
> iodepth to rbd back-end?

I don't understand this request -- it doesn't seem you're getting good
performance directly to the /dev/rbd2 device, so you certainly
wouldn't get any better performance across the SAN (you'd almost
certainly get less performance than raw, because nothing is free, and
each layer is going to likely add its own overhead).

If you can get X performance out of a device when testing locally on
your ESOS machine using the fio tool, at the very best, I'd say you'd
only get 90% of that performance (X * 0.9) when adding access across
the SAN layer (via SCST or whatever else).

--Marc

Reply all

Reply to author

Forward