ESOS Project Status & Updates (March 2013)

278 views
Skip to first unread message

Marc Smith

unread,
Mar 10, 2013, 11:49:53 PM3/10/13
to esos-...@googlegroups.com
Hi,

I wanted to give a quick update on the ESOS project, and an overview
of the features/enhancements that have been added during the last six
months.

New features:
- InfiniBand subnet manager now included (OpenSM)
- Added TUI create-support-package for creating a config. file / log
file archive
- Added new "LUN layout" dialog to see an overview of devices/targets/LUNs
- If a failed or missing ESOS USB flash drive is detected, an email is
sent to the configured email address, along with an archive of the
configuration files for that host
- Distributed Replicated Block Device (DRBD)
- Logical Volume Manager (LVM2)
- Linux Software RAID (md)
- Pacemaker + Corosync cluster stack (configuration tool: crmsh)
- Added the mhVTL project (virtual tape library) along with the SCST
modules for tape drives/changers
- Support for in-line data de-duplication using the lessfs project
(can be used for vdisk_fileio file systems, and for VTL file system
backing)
- Fibre Channel over Ethernet (FCoE) target support
- Improved functionality for the conf_sync.sh script
- System rc/init enhancements including the ability to enable/disable
certain services
- Added the 'fio' tool -- a very useful tool for measuring
performance, specifically in the case of ESOS for troubleshooting
performance issues
- Linux kernel and SCST refresh (updated versions)
- New TUI functionality: Create/remove file systems, create/remove
virtual disk files (for vdisk_fileio devices), and a number of new
"status" dialogs for new features (DRBD, LVM2, CRM, etc.)
- ESOS is no longer required to be built from source -- we now have
binary package releases that are built and uploaded to Google Code
automatically via Buildbot
- New install script and support for additional local RAID controllers
and CLI tools (LSI MegaRAID, Adaptec AACRAID, Areca, 3ware RAID, and
HP Smart Array controllers)
- Date & Time TUI dialog
- A number of software packages that are mostly dependencies for other
packages (including real bash, and Python)
- Lots of other bug/issue fixes and software package additions

What's next:
- At this point, I would like to finish validating/testing all of the
new features and functionality mentioned above, and add any other
obvious requirements
- Finish updating the wiki documentation for all of the above
- Work on adding/implementing SCST-related resource agents
- A few other minor house keeping tasks (valgrind on TUI, man page
updates, etc.)

Once the above is finished, I'd like to call it, and copy trunk into a
new stable branch which will only receive bug fixes, and new features
will be added to trunk. I'm hoping this will be possibly in the next
1-2 months.

Please let me know if there are any glaring features/functionality
that need to be added, or additional requests that should be
considered.


--Marc

Valentin Atanassov

unread,
Mar 11, 2013, 3:43:33 PM3/11/13
to esos-...@googlegroups.com
Amazing work so far. All basic functionality is in it including FCoE. Particularly FCoE is working 100% although it needs manual configuration . Few points. TUI can be upgraded to include 'edit and view' commands on most menus, also menu for RAID  LVM and HA (corosync pacemaker etc.) it will be perfect. WEB GUI it will be added bonus for beginners. All this said ESOS is a perfectly usable as is at the moment !

Todd Hunter

unread,
Mar 11, 2013, 5:17:10 PM3/11/13
to esos-...@googlegroups.com

Agreed, great job!

I have been running an earlier version for 4-5 months now as an iSCSI SAN for several Hyper-V VMs.  Not had a single problem since getting the config down. Box has been up for 3 months without a reboot and everything is stable,  VMs are humming alone fine.

I need to download the new version, want to try it out over infiniband.

Valentin Atanassov

unread,
Mar 11, 2013, 5:24:54 PM3/11/13
to esos-...@googlegroups.com
Today I have a time to play with Thecus N16000. Replaced with esos 411 Install Qlogic QLE220. Now I have perfectly working FC target on Thecus NAS .

On Monday, 11 March 2013 05:49:53 UTC+2, Marc Smith wrote:

Todd Hunter

unread,
Mar 16, 2013, 11:07:56 PM3/16/13
to esos-...@googlegroups.com

Marc,

Loaded up r412 today.   Load and install want fine and I was up and running in no time.  You have been busy.  A lot of new features that I am looking forward to trying out.    Its really coming along. 

I am finally getting around to to testing infiniband with ESOS.  Is there IPoIB in ESOS? 

I'm using Windows Server 2012 and it doesn't look like the drivers include SRP. 

Todd

Marc Smith

unread,
Mar 19, 2013, 1:34:10 PM3/19/13
to esos-...@googlegroups.com
Hi Todd,

I am not very familiar with the InfiniBand stuff, but I believe SCST
would need to support this and I'm pretty sure they don't.


--Marc
> --
> You received this message because you are subscribed to the Google Groups
> "esos-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to esos-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Todd Hunter

unread,
Mar 20, 2013, 1:43:01 PM3/20/13
to esos-...@googlegroups.com

I understand.   That said I know IPoIB is part of most Linux distros..   Would it be possible to add it?  IPoIB offers IP communications that can exceed 10Gb Ethernet.

Todd



Marc Smith

unread,
Mar 20, 2013, 2:32:47 PM3/20/13
to esos-...@googlegroups.com
I'll need to look into it more. Would you then use iSCSI over IPoIB?

Todd Hunter

unread,
Mar 20, 2013, 3:15:34 PM3/20/13
to esos-...@googlegroups.com
Yes,  iSCSI over IPoIB

Jon Busey

unread,
Mar 21, 2013, 6:36:43 AM3/21/13
to esos-...@googlegroups.com
To be clear, SCST and Infiniband (henceforth IB) work very well together.  A number of enterprise storage vendors are really simply Linux controllers running the IB stack underneath with LVM, sometimes MD, and almost always SCST.  The IB is uses to communicate between highly available nodes and as the target ports (i.e. the front-facing side your clients use srp to hit).  This use of IB for targers is the modern alternative to FC.

So what's really needed to make a storage OS IB capable is OFED.  There are vendor specific distros (mellanox has an OFED) and of course the community one you get with most distros.  I have not attempted to compile this against ulibc but would like to attempt.  See https://www.openfabrics.org OpenSM is part of this.  

The problem with iSCSI using IPoIB is that you do not take advantage of RDMA.  IPoIB is typically used to negotiate a channel, at which point RDMA ensues.  See for example Lustre.  Note also that throughput on modern IB (FDR @ 56Gbps)  *destroys* anything iSCSI.  RDMA has no TCP, i.e. no retransmit, congestion control, etc. to "slow it down" (i.e. ensure communication stability).  The MTU is 64k and it's almost all data (you get 54ish Gbps effective data rate on 56Gpbs channel speeds).  That's several GB/s per port and actually works in practice.  FDR IB is 40Gbs in ethernet mode, so even running your newer IB cards in eth mode gets you great speeds, but they pale in comparison to sticking with native IB and using srp for initiators.  Note that QDR, while 40Gbps RDMA, is only 10Gbps in eth mode.

Sorry to highjack this thread, Marc and Todd, but I think the description of ESOS is a fantastic read (I haven't *used* it yet) and poised to give a lot of enterprise storage vendors a run for their money.

I hope this helps,

Cheers!

Jon

Marc Smith

unread,
Mar 21, 2013, 9:38:48 AM3/21/13
to esos-...@googlegroups.com
On Thu, Mar 21, 2013 at 6:36 AM, Jon Busey <jonb...@gmail.com> wrote:
> To be clear, SCST and Infiniband (henceforth IB) work very well together. A
> number of enterprise storage vendors are really simply Linux controllers
> running the IB stack underneath with LVM, sometimes MD, and almost always
> SCST. The IB is uses to communicate between highly available nodes and as
> the target ports (i.e. the front-facing side your clients use srp to hit).
> This use of IB for targers is the modern alternative to FC.

Yes, good explanation. At our institution, we have an upcoming project
where we will be using ESOS to create a two node disk array cluster.
We will be using InfiniBand Sockets Direct Protocol (SDP) as the
replication link for DRBD between the two hosts with LVM on top, and
Fibre Channel front-end target ports (via SCST of course).


>
> So what's really needed to make a storage OS IB capable is OFED. There are
> vendor specific distros (mellanox has an OFED) and of course the community
> one you get with most distros. I have not attempted to compile this against
> ulibc but would like to attempt. See https://www.openfabrics.org OpenSM is
> part of this.

ESOS actually already includes a number of parts from the OFED project
(including OpenSM). Additional items may need to be added to support
the use of SDP mentioned above, but I haven't got there yet.


>
> The problem with iSCSI using IPoIB is that you do not take advantage of
> RDMA. IPoIB is typically used to negotiate a channel, at which point RDMA
> ensues. See for example Lustre. Note also that throughput on modern IB
> (FDR @ 56Gbps) *destroys* anything iSCSI. RDMA has no TCP, i.e. no
> retransmit, congestion control, etc. to "slow it down" (i.e. ensure
> communication stability). The MTU is 64k and it's almost all data (you get
> 54ish Gbps effective data rate on 56Gpbs channel speeds). That's several
> GB/s per port and actually works in practice. FDR IB is 40Gbs in ethernet
> mode, so even running your newer IB cards in eth mode gets you great speeds,
> but they pale in comparison to sticking with native IB and using srp for
> initiators. Note that QDR, while 40Gbps RDMA, is only 10Gbps in eth mode.
>
> Sorry to highjack this thread, Marc and Todd, but I think the description of
> ESOS is a fantastic read (I haven't *used* it yet) and poised to give a lot
> of enterprise storage vendors a run for their money.

Not a problem; thank you for the very nice explanation. =)


--Marc

Todd Hunter

unread,
Mar 21, 2013, 1:08:24 PM3/21/13
to esos-...@googlegroups.com

IPoIB is what I have been dealing and seems to be a universal for connecting Infiniband to storage devices.  I know it lacks RDMA and is not ideal, but is well supported in Windows 2012 along with limited other protocols as outlined in the driver documentation.  If there is a better option toI am certainly open to it. 

http://www.mellanox.com/related-docs/prod_software/MLNX_VPI_WinOF_Release_Notes_v4.2.pdf

There is a lot of talk about using Windows Storage Server 2012 with SMB 3.0 and Infiniband for storage of Hyper-V clusters but that all avoids discussion of some of the basic drawbacks of Windows, being security, the frequent updates and the reboots to keep the OS stable.  The Core version addresses some of that but I am not 100% sold. 


Todd


Jon Busey

unread,
Mar 21, 2013, 10:03:02 PM3/21/13
to esos-...@googlegroups.com
Thanks 


On Thursday, March 21, 2013 1:08:24 PM UTC-4, Todd Hunter wrote:

IPoIB is what I have been dealing and seems to be a universal for connecting Infiniband to storage devices.  I know it lacks RDMA and is not ideal, but is well supported in Windows 2012 along with limited other protocols as outlined in the driver documentation.  If there is a better option toI am certainly open to it. 

Todd--I think you will be hooked on RDMA applications once you've had a taste.  I know that IPoIB is fast when you're used to ethernet, even 10GbE (if you have FDR, that is, since QDR is only 10GbE anyway), but RDMA is the heroin of the interconnect world.  I look for every opportunity to go ethernet-free now.
 
IPoIB is more of an afterthought since it is possible, but many applications cannot use it due to the strange MAC addresses (in linux they are 20 bytes) and the lack of full implementation.  An IB card in eth mode (if you have a switch that supports it), is certainly handier.

http://www.mellanox.com/related-docs/prod_software/MLNX_VPI_WinOF_Release_Notes_v4.2.pdf

There is a lot of talk about using Windows Storage Server 2012 with SMB 3.0 and Infiniband for storage of Hyper-V clusters but that all avoids discussion of some of the basic drawbacks of Windows, being security, the frequent updates and the reboots to keep the OS stable.  The Core version addresses some of that but I am not 100% sold. 


Thanks for that link -- I had not read anything about the Windows implementation of OFED, only that 2008 wasn't worth the trouble and that 2012 is really killer in every way, right out of the box.

I just want to leave you with a final word of endorsement: IB to storage is hotter than anything I've seen in a long time.  Think back to when everything was FC and then SAS hit.  Suddenly the VxWorks or FPGA-based controllers with arbitrated loops weren't the only game in town.  You could use any linux server and a supported card with a try of drives to get affordable HA.  IB is hitting the storage industry in the same way.  No more Fibre Channel anything.  There are data centers now with no ethernet and no storage fibre, just IB all throughout, from the PXEbooting to the RDMA-nfs server to the 20GB/s lustre filesystems... sorry, I get really excited about this.  

If you ever get a chance to swap an iSCSI lun over to SRP, please do a speed test on either end and *watch your cpu* as you do so.

And Marc:

> At our institution, we have an upcoming project 
> where we will be using ESOS to create a two node disk array cluster. 
> We will be using InfiniBand Sockets Direct Protocol (SDP) as the 
> replication link for DRBD between the two hosts with LVM on top, and 
> Fibre Channel front-end target ports (via SCST of course). 

You can later swap that fibre to IB if you choose.  Even with 16Gbps fibre, you can't get the same throughput as one FDR port.   No LIP resets, but the same multipath, lsscsi, and scsi_rescan tools you're already using.  

I like the SDP for DRBD, btw ;-)

Cheers!
Jon

Todd Hunter

unread,
Mar 22, 2013, 5:56:58 PM3/22/13
to esos-...@googlegroups.com

Spoke with Mellanox and they confirmed that SRP and SDP are not currently supported in Windows 2012, so for now it's IPoIB.  I am trying to get some info out of them as to whether they will be added back into the next version of the drivers. 

Jon Busey

unread,
Mar 22, 2013, 8:18:01 PM3/22/13
to esos-...@googlegroups.com
The person who told me it was ideal is also a Mellanox employee.  He might have been referring to 2.0 beta, though, I'll have to ask.  He did say that the key changes are in SR-IOV and not IPoIB or SRP (referring to linux, that is), and not to worry about the status of beta.  

I hope this helps,
Jon

Marc Smith

unread,
Mar 25, 2013, 10:41:31 PM3/25/13
to esos-...@googlegroups.com
Hi,

So, it turns out that IB SDP is deprecated; see this post:
http://comments.gmane.org/gmane.network.openfabrics.enterprise/5371

Another couple interest posts specifically to IBoIP, SDP, and DRBD:

From Florian Haas:
--snip--
We serve customers that use both, and in general recent distributions
support both OFED (for IB) and 10 GbE quite well. If your main pain
point is latency, you'll want to go with IB; if it's throughput,
you're essentially free to pick and choose -- although of course _not_
having to install any of the OFED libraries may be a plus for 10 GbE.
Cost of switches is usually not much of a factor in the decision, as
most people tend to wire their DRBD clusters back-to-back, but if
you're planning on a switched topology you may have to factor that in,
also.

Both IB and 10 GbE do require a fair amount of kernel and DRBD tuning
so that DRBD can actually max them out. Don't expect to be able to use
your distro's standard set of sysctls, and default DRBD config, and
then everything magically goes a million times faster.

Generally speaking, also don't expect too much of a performance boost
when using SDP (Sockets Direct Protocol) over IB. In general, we've
found that the performance effect in comparison to IPoIB is negligable
or even negative, but that's fine -- chances are you'll likely max out
your underlying storage hardware with IPoIB anyhow. :) SDP is also
currently suffering from a module refcount issue that is fixed in git
(http://git.drbd.org/gitweb.cgi?p=drbd-8.3.git;a=commit;h=c2c2067c661c7cba213b0301e2b39f17c1419e51)
but as yet unreleased, so that's a bit of an SDP show-stopper too...
but as pointed out, IPoIB does do the trick nicely.
--snip--
http://lists.linbit.com/pipermail/drbd-user/2012-April/018331.html

From a Linbit employee (Kavan Smith):
--snip--
It all depends on what you are looking to accomplish and what hardware
you are comparing.

Newer Infiniband QDR cards are rated at 40 Gbit/s which are a bit
quicker than the 10GbE alternative.

Today, IPoIB provides great performance, but would be better with native
RDMA support. 10GbE is a great solution, but you really miss out on low
latency high bandwidth capabilities that Infiniband brings to the table.
Right now, IPoIB is the best solution with DRBD if you want to exceed
current 10GbE benchmarks.

We have a tech guide that will be announced this week, but since you
asked so kindly, please check this out at your leisure:

http://www.linbit.com/en/education/tech-guides/infiniband-and-drbd-technical-guide/

Also...stay tuned... :)

In regards to how DRBD is going to support this in the future:

LINBIT is working to develop native RDMA support for Infiniband (this
will make DRBD on Infiniband much much quicker), but we still need
assistance from the community to make this feature-set possible.

HA and DRBD experts. We could always use the help! -
feedback at linbit.com if you would like to assist in developing or
sponsoring this feature for the DRBD Community. Don't just claim you're
an expert, show it! :)
--snip--
http://lists.linbit.com/pipermail/drbd-user/2012-May/018335.html

So, it sounds like IPoIB can work quite well, especially if tuned
correctly. Yes, its not RDMA, but if your application doesn't support
RDMA, what else are you going to do? Initially there was no SRP
support for VMware ESXi 5.x, but it was added back in eventually:
http://communities.vmware.com/thread/393784?start=30&tstart=0
Perhaps SRP support in Windows Server 2012 will be added at some point
in the future.


--Marc

Jon Busey

unread,
Mar 26, 2013, 4:48:52 PM3/26/13
to esos-...@googlegroups.com
Not to throw another monkey wrench in there, but I see you have really focused on IPoIB rather than the other option of straight up eth-mode.  I think eth mode makes more sense for DRBD.  Since a port cannot be in both eth mode and rdma mode (i.e. you load, for Mellanox, mlx4_eth for eth mode...), you have to choose.  And your switch must support and be in eth mode.  IPoIB makes sense if you only have one port or one bundle (i.e. bonded set) and need some of the traditional IB (i.e. RDMA) features.  If you just need a DRBD link, eth mode seems perfect.

IPoIB has some extra overhead and gives you the strange 20byte addresses.  Eth mode has your IB card show up as a normal ethernet card (i.e. eth2, eth3, or whatever).  If DRBD isn't yet taking advantage of RDMA, then using a direct connect cable with both ports on your linux servers in eth-mode is what I'd try to do first.  This is my armchair opinion... I'm not an expert ;-(

Note that eth mode for QDR is 10GbE, and for FDR it's 40GbE.  Both are nominally less than the IPoIB potentials of 40Gb/s (32Gb/s after overhead) and 56Gb/s (54ish after overhead) of IPoIB on QDR and FDR, respectively.

I hope this helps,

Jon
Reply all
Reply to author
Forward
0 new messages