what is the iops limit of ESOS

700 views
Skip to first unread message

newchan...@gmail.com

unread,
Nov 21, 2014, 1:23:45 PM11/21/14
to esos-...@googlegroups.com
I think the sequential read write performance should solely depend on the hardware interface, such as what kind of FC, how many FCs.

What would be the IOPS limit of ESOS? Has anyone tested this, given enough SSD, enough CPU cores?
Wandering that will SCST software be the performance bottleneck?


Steve Jones

unread,
Nov 21, 2014, 1:27:14 PM11/21/14
to esos-...@googlegroups.com
If the ESOS software is the bottleneck in any particular installation, wouldn't it mean that upgrading the motherboard/processor it is running on would again raise the bar??  In otherwords, I'd expect that you really couldn't make a determination of a maximum speed for ESOS as a software package, given that there's a million differences in the hardware it could be running on..??

--
You received this message because you are subscribed to the Google Groups "esos-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to esos-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Marc Smith

unread,
Nov 21, 2014, 2:07:26 PM11/21/14
to esos-...@googlegroups.com
Hi,

As you pointed out, the performance of an ESOS system is very much dependent on your configuration: Everything from the speed of your back-end storage, to the performance of your ESOS server(s) itself (CPU, the SCST threads that are processing IO are running on this system), and then to the front-end target ports (what's connected to the SAN -- FC, FCoE, InfiniBand, or iSCSI).

And you definitely don't get anything for free: When you introduce other layers into the setup (those that are required, SCST, and front-end SAN target interfaces) you're going to get some overhead. I typically test my back-end storage using the built-in 'fio' tool so I can see what kind of performance I can get without the SCST/SAN layers, and then do the same test with a remote initiator (traversing the SAN and SCST layers). I typically see between 5-10% performance loss with the configurations I've done in the past (Fibre Channel and vdisk_blockio mode). Perhaps with other SAN technologies (like InfiniBand) you'd see less overhead -- not sure, you really gotta test and see for your particular setup.

Again, I think its unrealistic to expect no performance loss. Test/tweak/tune and get a performance number you're happy with -- there is no sense in having 1-million-IOPS solution, when you're only ever going to max out on 50K IOPS of utilization.


--Marc



On Fri, Nov 21, 2014 at 1:23 PM, <newchan...@gmail.com> wrote:

newchan...@gmail.com

unread,
Nov 21, 2014, 7:52:10 PM11/21/14
to esos-...@googlegroups.com
Target is one million 4K random read iops, 0.5 million 4k random write iops.

"5-10% performance loss" is not bad.
I do notice that SCST support multithread(although not know the maximum multithread number)
Also as Steve said, more CPU power will benefit the software.

I will measure the performance at each layer: single SSD => raid => FC/iSCSI.

Thanks, guys.


在 2014年11月21日星期五UTC-5下午2时07分26秒,Marc Smith写道:

Marc Smith

unread,
Nov 21, 2014, 8:53:01 PM11/21/14
to esos-...@googlegroups.com
And perhaps the solution may be to scale out: If you're not able to obtain the performance you need with a single ESOS instance, use multiple ESOS-based arrays and spread the load across all of them.

I'd be interested to hear your results.


--Marc

newchan...@gmail.com

unread,
Nov 23, 2014, 1:15:50 AM11/23/14
to esos-...@googlegroups.com
hmm, interesting..

But how to use multiple ESOS-based arrays.
The system will have only one ESXI host.
Two ESOS basically means two SAN targets.
Say the interface will be FC, are you saying that there is "FC bridge" which can bridge one input channel to multiple out channels => like raid controller? I will dig more.


在 2014年11月21日星期五UTC-8下午5时53分01秒,Marc Smith写道:

Marc Smith

unread,
Nov 23, 2014, 1:22:15 PM11/23/14
to esos-...@googlegroups.com
Like they did in this article:
http://www.vmware.com/files/pdf/1M-iops-perf-vsphere5.pdf

I believe they used eight unique arrays in this setup, with the load spread across (60) LUs (volumes). Even after getting your ESOS target side setup to handle ~ 1 million, its still a pretty intense configuration on the initiator (ESXi) side to generate that kind of load.

Again, very interested in how this goes for you, and your results/data. Let me know if I can offer any assistance.


--Marc

newchan...@gmail.com

unread,
Nov 25, 2014, 1:37:08 PM11/25/14
to esos-...@googlegroups.com
Hi, Marc:

Currently I am trying four SSD in my environment.
With fio, it gives me around 300K write iops. But when I run ASSSD test after FC, it shows only 25K write iops.

Although my temporary FC is 4G, I think the speed drop issue is not related to FC bandwith, because I can see almost double speed of read iops.

I think the problem could be due to  the thread number configuration.
So I set the /etc/scst.conf, change the threads from 4 to 16, also change the threads_number from 1 to 4.
But each time when I reboot the system, it will be overwritten to threads==4, and threads_number==1.
I think this setting is written by ESOS script during boot time, by command like this:
scstadmin -set_dev_attr device -attributes threads_pool_type=per_initiator,threads_num=4
scstadmin -nonkey -write_config /etc/scst.conf

So I use the command above the set the scst configuration mananually.
After done this, I am trying to restart SCST service, /etc/init.d/scst restart.  Unfortunatelly /etc/init.d/scst cound not be found.

My questions:
1. how can I restart SCST without reboot.
2. Is there any workaround that I can set the configuration so that it will preserve after power cycle. anything like /etc/esos.conf?

Thanks.



在 2014年11月23日星期日UTC-8上午10时22分15秒,Marc Smith写道:

Marc Smith

unread,
Nov 25, 2014, 1:57:02 PM11/25/14
to esos-...@googlegroups.com
On Tue, Nov 25, 2014 at 1:37 PM, <newchan...@gmail.com> wrote:
Hi, Marc:

Currently I am trying four SSD in my environment.
With fio, it gives me around 300K write iops. But when I run ASSSD test after FC, it shows only 25K write iops.

What device handler are you using? What scheduler are you using for the back-end SCSI block device (in ESOS)?

 

Although my temporary FC is 4G, I think the speed drop issue is not related to FC bandwith, because I can see almost double speed of read iops.

I think the problem could be due to  the thread number configuration.
So I set the /etc/scst.conf, change the threads from 4 to 16, also change the threads_number from 1 to 4.
But each time when I reboot the system, it will be overwritten to threads==4, and threads_number==1.
I think this setting is written by ESOS script during boot time, by command like this:
scstadmin -set_dev_attr device -attributes threads_pool_type=per_initiator,threads_num=4
scstadmin -nonkey -write_config /etc/scst.conf

So I use the command above the set the scst configuration mananually.
After done this, I am trying to restart SCST service, /etc/init.d/scst restart.  Unfortunatelly /etc/init.d/scst cound not be found.

That setting is not modified by anything ESOS specific. You probably don't need to modify the threads setting... see the SCST README here: http://sourceforge.net/p/scst/svn/HEAD/tree/trunk/scst/README



My questions:
1. how can I restart SCST without reboot.

In the shell: /etc/rc.d/rc.scst stop && /etc/rc.d/rc.scst start

 
2. Is there any workaround that I can set the configuration so that it will preserve after power cycle. anything like /etc/esos.conf?

You should check that your configuration files are sync'ing properly. After making changes in the shell you should run 'conf_sync.sh' or using the TUI (System->Sync Configuration). Check that your time is accurate -- the script that syncs the files relies on time stamps of files to decide what is newest.

After sync'ing, you can 'mount /mnt/conf' and check the files in /mnt/conf/etc/ (like scst.conf) to make sure its accurate and sync is working properly. Unmount that FS when done. Also, if you shutdown/reboot properly ("poweroff" or "reboot"), the configuration files will be sync'd automatically.


--Marc

newchan...@gmail.com

unread,
Nov 25, 2014, 2:53:43 PM11/25/14
to esos-...@googlegroups.com
My setting:
Device handler: vdisk_blockio
threads_pool_type: per_initiator
write_through: 0
scheduler: i did not touch this.use the default setting, should be CFQ?

I have changed the threads_num to 4 for HANDLER, no IOPS improvement is observed, just as what you have pointed out.

Any idea what other possibilities I have missed?

Thanks.

在 2014年11月25日星期二UTC-8上午10时57分02秒,Marc Smith写道:

Marc Smith

unread,
Nov 25, 2014, 2:55:25 PM11/25/14
to esos-...@googlegroups.com
Try changing the scheduler to 'noop'.

newchan...@gmail.com

unread,
Nov 25, 2014, 3:38:00 PM11/25/14
to esos-...@googlegroups.com
echo noop > /sys/block/sdb/queue/scheduler

looks like there is a slightly improvement, about 1%.


在 2014年11月25日星期二UTC-8上午11时55分25秒,Marc Smith写道:

Marc Smith

unread,
Nov 25, 2014, 4:24:24 PM11/25/14
to esos-...@googlegroups.com
So, just a couple quick tests I did on FC...

Target: ESOS trunk_r698; (2) QLogic 8 Gb FC HBAs

SAN: (2) QLogic 8 Gb FC switches

Initiator: Old Dell PE 2950 with (8) cores; (2) QLogic 4 Gb FC HBAs; (1) Emulex 8 Gb FC HBA


Testing on the ESOS side with vdisk_nullio so we can just test the SAN, initiator, and target without worrying about any back-end storage... reads/writes are discarded. So this should show us the maximums of what those three different pieces can do (target, SAN, initiator).

I see no difference between reads/writes in this testing, which makes sense, since no real IO is being done on the back-end.


fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=64 --name=task_sdd --filename=/dev/sdd
fio 1.50-rc4
task_sdd: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
Starting 1 process
^Cbs: 1 (f=1): [r] [0.1% done] [320.8M/0K /s] [80.2K/0  iops] [eta 02h:25m:36s]

80K 4K IOPS on the first QLogic 4 Gb initiator.


fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=64 --name=task_sdc --filename=/dev/sdc
fio 1.50-rc4
task_sdc: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
Starting 1 process
^Cbs: 1 (f=1): [r] [0.1% done] [303.7M/0K /s] [75.8K/0  iops] [eta 02h:31m:49s]

75K 4K IOPS on the second QLogic 4 Gb initiator.


fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=64 --name=task_sdb --filename=/dev/sdb
fio 1.50-rc4
task_sdb: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
Starting 1 process
^Cbs: 1 (f=1): [r] [0.1% done] [368.7M/0K /s] [92.2K/0  iops] [eta 02h:05m:40s]

92K 4K IOPS on the Emulex 8 Gb initiator.


fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=64 --name=task_sdb --filename=/dev/sdb --name=task_sdc --filename=/dev/sdc --name=task_sdd --filename=/dev/sdd
fio 1.50-rc4
task_sdb: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
task_sdc: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
task_sdd: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
Starting 3 processes
Jobs: 3 (f=3): [rrr] [0.2% done] [739.7M/0K /s] [185K/0  iops] [eta 03h:30m:14s]

185K 4K IOPS across all (3) initiators to the (2) 8 Gb FC target interfaces (to 3 different vdisk_nullio devices).


Now lets take a look at a real volume in the test target (ESOS) system... a RAID0 volume consisting of (6) Intel 36GB SSDs, write-through, no read ahead. On the ESOS target side testing with fio I see this:

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=64 --name=task_sdb --filename=/dev/sdb
task_sdb: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.0.13
Starting 1 process
^Cbs: 1 (f=1): [r] [1.9% done] [489.9M/0K/0K /s] [125K/0 /0  iops] [eta 07m:57s]

125K 4K IOPS 100% random, 100% read. Tested on the ESOS host itself.


fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=64 --name=task_sdb --filename=/dev/sdb
task_sdb: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.0.13
Starting 1 process
^Cbs: 1 (f=1): [w] [0.4% done] [0K/196.9M/0K /s] [0 /50.4K/0  iops] [eta 18m:48s]

50K 4K IOPS 100% random, 100% write. Tested on the ESOS host itself.


Now, I'll use the Emulex 8Gb FC initiator, since it clearly seems to be the better performer after the tests done above. So the device is mapped as a LUN on one ESOS-side target (an 8 Gb FC QLogic HBA)...

raspberry ~ # fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=64 --name=task_sde --filename=/dev/sde
fio 1.50-rc4
task_sde: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
Starting 1 process
^Cbs: 1 (f=1): [r] [4.2% done] [282.4M/0K /s] [70.6K/0  iops] [eta 13m:33s]

70K 4K IOPS 100% random, 100% read. Test from the Linux initiator.


fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=64 --name=task_sde --filename=/dev/sde
fio 1.50-rc4
task_sde: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
Starting 1 process
^Cbs: 1 (f=1): [w] [1.4% done] [0K/196.4M /s] [0 /49.1K iops] [eta 20m:13s]

49K 4K IOPS 100% random, 100% write. Tested from the Linux initiator.


Now, I'll use all (3) initiators (two 4 Gb QLogic, one 8 Gb Emulex) and produce IO to the same SSD volume (as used above) that is mapped as LUNs to the Linux initiator host.


fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=64 --name=task_sde --filename=/dev/sde -name=task_sdf --filename=/dev/sdf -name=task_sdg --filename=/dev/sdg
fio 1.50-rc4
task_sde: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
task_sdf: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
task_sdg: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
Starting 3 processes
^Cbs: 3 (f=3): [rrr] [1.9% done] [540.1M/0K /s] [135K/0  iops] [eta 23m:34s]

135K 4K IOPS 100% random, 100% read. Tested from the Linux initiator.


fio --bs=4k --direct=1 --rw=randwrite --ioengine=libaio --iodepth=64 --name=task_sde --filename=/dev/sde -name=task_sdf --filename=/dev/sdf -name=task_sdg --filename=/dev/sdg
fio 1.50-rc4
task_sde: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
task_sdf: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
task_sdg: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
Starting 3 processes
^Cbs: 3 (f=3): [www] [0.2% done] [0K/207.6M /s] [0 /51.9K iops] [eta 59m:58s]    

51K 4K IOPS 100% random, 100% write. Tested from the Linux initiator.


The SCST configuration highlights, so you can see the vdisk_blockio device and target configuration:
--snip--
HANDLER vdisk_blockio {
        DEVICE bio_ssd_test_1 {
                filename /dev/disk-by-id/LUN_NAA-600062b2000312401c07a63a417e2c7d
                rotational 0

                # Non-key attributes
                blocksize 512
                nv_cache 0
                pr_file_name /var/lib/scst/pr/bio_ssd_test_1
                prod_id bio_ssd_test_1
                prod_rev_lvl " 310"
                read_only 0
                removable 0
                t10_dev_id e0158f98-bio_ssd_test_1
                t10_vend_id SCST_BIO
                thin_provisioned 0
                threads_num 1
                threads_pool_type per_initiator
                tst 1
                usn e0158f98
                vend_specific_id e0158f98-bio_ssd_test_1
                write_through 0
        }
}
        TARGET 21:00:00:24:ff:01:1c:88 {
                HW_TARGET

                enabled 1
                rel_tgt_id 1

                # Non-key attributes
                addr_method PERIPHERAL
                black_hole 0
                cpu_mask ffffffff,ffffffff
                explicit_confirmation 0
                io_grouping_type auto
                node_name 20:00:00:24:ff:01:1c:88

                GROUP raspberry {
                        LUN 0 nullio_test_1 {
                                # Non-key attributes
                                read_only 0
                        }
                        LUN 1 bio_ssd_test_1 {
                                # Non-key attributes
                                read_only 0
                        }

                        INITIATOR 21:00:00:1b:32:87:cf:00

                        # Non-key attributes
                        addr_method PERIPHERAL
                        black_hole 0
                        cpu_mask ffffffff,ffffffff
                        io_grouping_type auto
                }

                GROUP raspberry_emulex {
                        LUN 0 nullio_test_2 {
                                # Non-key attributes
                                read_only 0
                        }
                        LUN 1 bio_ssd_test_1 {
                                # Non-key attributes
                                read_only 0
                        }

                        INITIATOR 10:00:00:00:c9:99:03:c3

                        # Non-key attributes
                        addr_method PERIPHERAL
                        black_hole 0
                        cpu_mask ffffffff,ffffffff
                        io_grouping_type auto
                }
        }
--snip--


The ESOS host is a Supermicro (24) core Intel something. To sum up the results, we definitely see the performance is FC interface bound. In the local write vs. remote (FC initiator via SCST) write, we see almost no performance loss. The local read vs. remote test was definitely limited by the FC interface. Not sure why we didn't see the same 92K IOPS number we got with the vdisk_nullio test (instead we got 70K), perhaps the 92K was a fluke. All of those values are not averages over time, they are the number that was sustained on the screen... not an in-depth test.

When we kick it up and add all (3) initiators against both FC target interfaces, we clearly see there is no performance loss (or next to nothing). Actually we got better numbers on the read test remotely vs. locally... the VD may have still be initializing or something on the MegaRAID controller when I did the first local test.


My advice: Leave all of the defaults (thread_num, write_through, etc.) as they are (default) and test. Get your baseline performance numbers, then tweak and test again. Record the numbers and see what happens.


Hope this helps.


--Marc

Marc Smith

unread,
Nov 25, 2014, 4:47:47 PM11/25/14
to esos-...@googlegroups.com
And just to follow-up on the difference (the increase in performance) for the local vs. remote read test. I re-tested the SSD volume locally (in ESOS) and got the same number I was getting remotely:

fio --bs=4k --direct=1 --rw=randread --ioengine=libaio --iodepth=128 --name=task_sdb --filename=/dev/sdb
task_sdb: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=128
fio-2.0.13
Starting 1 process
^Cbs: 1 (f=1): [r] [13.3% done] [525.6M/0K/0K /s] [135K/0 /0  iops] [eta 06m:11s]


--Marc

newchan...@gmail.com

unread,
Nov 27, 2014, 12:42:50 AM11/27/14
to esos-...@googlegroups.com
For simplification, this time I test 2 SSD:

1. create device using vdisk_nullio => to just test the SAN, initiator, and target without worrying back-end storage.
   map it to both target FC1, and FC2.
   (1-a)4k - read iops @FC1 randread:54k    
   (1-b)4k - read iops @FC2 randread:51k    
   (1-c)4k - read iops @(FC1+FC2) randread:105k    

   (2-a)4k - write iops @FC1   randwrite:56k    
   (2-b)4k - write iops @FC2 randwrite:55k   
   (2-c)4k - write iops @(FC1+FC2)    randwrite:100k    

2. create device using vdisk_blockio => consider the backend storage
   (0-a)4k - read iops: local FIO: randread:140k    
   (0-b)4k - write iops: local FIO:  randwrite:57k    

   (1-a)4k - read iops @FC1 randread:52k   
   (1-b)4k - read iops @FC2  randread:50k   
   (1-c)4k - read iops @(FC1+FC2)  randread: 90k    

   (2-a)4k - write iops @FC1  randwrite:39k    
   (2-b)4k - write iops @FC2   randwrite:39k    
   (2-c)4k - write iops @(FC1+FC2)     randwrite: 50k   

Now it seems a little bit normal.
Based on #1, one FC-4G only gives me around only 50K iops. (I think I have bought a low-quality FC-Cable).
I guess I need to get a better FC-cable first.

Thanks for your kind reply. That really helps!
I really appreciate it.



在 2014年11月25日星期二UTC-8下午1时47分47秒,Marc Smith写道:
Reply all
Reply to author
Forward
0 new messages