Raw disk throughput of an Isilon cluster

1,892 views
Skip to first unread message

Youssef Ghorbal

unread,
Nov 21, 2013, 9:08:15 AM11/21/13
to isilon-u...@googlegroups.com
While tackling down a "slow SMB read" problem reported by multiple OSX users I came across some findings I'd like to share and discuss with you.

The problem reported initially was quiet simple : an OSX (10.6, 10.7, 10.8) can not read from our isilon cluster faster thant ~8MB/s (~100Mb/s for network guys) but it can write at at ~50MB/s (~400Mb/s) (using SMB) => read speed is no good at all.

At first, we investigated the client, than the network, that the Isilon config etc. Everything was fine.
In the process we noticed that NFS was not affected. We had decent read speeds (~70MB/s) that we can push to 90MB/s with some tuning.
We also noticed that Windows 7 was able to read at wirespeed using SMB (same files and same network as the MACs)

We ended firing up iperf and playing with it between the Isilon cluster and the MAC and that's where things started to get weird.

1 - iperf with default options was able to sturate the 1Gb/s link : nothing special here.
2 - iperf with the -F option (reading date from a file and sending it over the wire) had shitty throughput (~8MB/s) : something is wrong here.

At this point we started suspecting the disk subsystem, since we had 3 tiers of nodes (and even some SSDs) we created different 40G files on different storage tiers. We tested different buffer sizes (-l option) and different tcp window sizes (-w option) we also disabled Nagle to avoid delays (-N option) we came up with these figures : 

- iperf with -F reading a 40GB on NL node : ~8MB/s tops
- iperf with -F reading a 40GB on X node : ~8MB/s tops
- iperf with -F reading a 40GB on S node : ~75MB/s tops
- iperf with -F reading a 40GB on SSD disk : ~110MB/s tops

We had these same figures even between two nodes on the same Isilon cluster (and even using the infiniband backend) and no matter what is the iperf client and server (X, NL, S)

My conclusion is that 
1 - raw reads on Isilon are "very" slow (far behind what to expect from SATA/SAS/SSD throughputs)
2 - Windows 7 and NFS on OSX are getting more throupghput because NFS and SMB2 (the default on Windows 7) support some sort of pipelining natively allowing the system to accomplish a deterministic read ahead. The client is basically reading from the L2 cache systematically.
3 - OSX (before 10.9) only supports SMBv1 which does not have any sort of pipelining built in. It basically sends 60k read requests and waits until receiving all data to issue the next read request. The system is constantly reading from disks which leads to "shitty" throughput because requests are not satisfied from L1/L2 cache. iperf is behaving the same way.

My questions are : 
- Do you think that Isilon clusters are bound by these throughput limits by design ? 
- Are these "normal" figures and are not the symptom of deeper issue to investigate further ?

[context]
GNA (global namespace acceleration) is enabled (on SSDs embedded in X nodes)
Data is in "concurrent" access mode (i.e SmartCache is in action)
5 X nodes (24GB RAM)
4 NL nodes (12GB RAM)

10 S nodes (48GB RAM)

Saker Klippsten

unread,
Nov 21, 2013, 9:41:32 AM11/21/13
to isilon-u...@googlegroups.com, isilon-u...@googlegroups.com
1.What version of Onefs? 
2.Have you tested yet with 10.9 ,Just curious what iperf shows. Is it on par with windows smb2? 

We have a bunch of 10.8 transfer machines. Sole purpose is to transfer content from thunderbolt and FireWire drives to our Isilon cluster specifically NL nodes via smartpool folder policy and we get 60MB plus over smb and as you know better with NFS 

We are running onefs 7.0.4 (about to upgrade to latest) 

Saker Klippsten | CTO | Zoic Studios
--
You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Rob Peglar

unread,
Nov 21, 2013, 11:09:42 AM11/21/13
to isilon-u...@googlegroups.com
All Mac versions should be running with TCP delayed acks disabled. 

sudo sysctl -w net.inet.tcp.delayed_ack=0

This is your friend, especially for SMB.  See discussions.apple.com. Has been known since 2005. 

Peter Serocka

unread,
Nov 21, 2013, 11:35:41 AM11/21/13
to isilon-u...@googlegroups.com
Have you checked the Isilon & Mac Best Practices paper?
It suggests a couple of good tweaks and is more
recent than 2005 ;-)

On Thu 21 Nov '13 md, at 22:08 st, Youssef Ghorbal <youssef...@gmail.com> wrote:

>
>
> - iperf with -F reading a 40GB on NL node : ~8MB/s tops
> - iperf with -F reading a 40GB on X node : ~8MB/s tops
> - iperf with -F reading a 40GB on S node : ~75MB/s tops
> - iperf with -F reading a 40GB on SSD disk : ~110MB/s tops

Nevertheless:
that's truely amazing (assuming that the network paths
to the nodes and the nodes' background loads
are equivalent for NL/X nodes vs. S nodes).

A few thoughts:

Can you send the outputs of
isi statistic client -nall --long
for these scenarios? (using SMB)

By analysing oprates, request sizes, latencies
and throughput rates in one context I hope we
can see what is going on,
and what makes the big difference here.

> We had these same figures even between two nodes
> on the same Isilon cluster (and even using the infiniband backend)
> and no matter what is the iperf client and server (X, NL, S)

That’s weird because you also said that with NFS everything is fine.

What happens if you simple read the file(s)
locally on the cluster with ‘cat’ or ‘dd bs=1024k’?

You can even do that as a cross-test:
run the cat or dd command on an S node
accessing data on the X or NL pool, and vica versa.
That would confirm the problem is
on the disk side (or the actual target node pool.)
(The same is posssible with SMB mounting of course.)

Finally I would try to check the
cache and prefetch hit rates,
to see the difference between your actual
S nodes and X/NL nodes:
Maybe your S nodes have enough RAM for caching,
while the X/NL nodes’s caches are too busy.
Or prefetching is very bad on the X/NL nodes,
disabled? or too much fragmentation?

isi_cache_stats (-v) interval

is a great tool for this (reports for one node though),
and requires a pretty idle cluster to see a
clear signal from the test load.


And totally agree with Saker, OneFS 7 + OSX 10.9 is a completely new game…

Cheers

— Peter

>
> We had these same figures even between two nodes on the same Isilon cluster (and even using the infiniband backend) and no matter what is the iperf client and server (X, NL, S)
>
> My conclusion is that
> 1 - raw reads on Isilon are "very" slow (far behind what to expect from SATA/SAS/SSD throughputs)
> 2 - Windows 7 and NFS on OSX are getting more throupghput because NFS and SMB2 (the default on Windows 7) support some sort of pipelining natively allowing the system to accomplish a deterministic read ahead. The client is basically reading from the L2 cache systematically.
> 3 - OSX (before 10.9) only supports SMBv1 which does not have any sort of pipelining built in. It basically sends 60k read requests and waits until receiving all data to issue the next read request. The system is constantly reading from disks which leads to "shitty" throughput because requests are not satisfied from L1/L2 cache. iperf is behaving the same way.
>
> My questions are :
> - Do you think that Isilon clusters are bound by these throughput limits by design ?
> - Are these "normal" figures and are not the symptom of deeper issue to investigate further ?
>
> [context]
> GNA (global namespace acceleration) is enabled (on SSDs embedded in X nodes)
> Data is in "concurrent" access mode (i.e SmartCache is in action)
> 5 X nodes (24GB RAM)
> 4 NL nodes (12GB RAM)
> 10 S nodes (48GB RAM)
>
>

Dan Pritts

unread,
Nov 21, 2013, 1:06:10 PM11/21/13
to isilon-u...@googlegroups.com
Youssef Ghorbal wrote:
While tackling down a "slow SMB read" problem reported by multiple OSX users I came across some findings I'd like to share and discuss with you.
Hi -

I can't really comment from personal experience, but perhaps this is useful.

when I asked our implementation engineer about recommendations for accessing the cluster from a mac, he consulted some higher-level support team.  The final answer: I was told in no uncertain terms "use NFS, not SMB". 

This was pre-10.9.

danno
--
Dan Pritts
ICPSR Computing & Network Services
University of Michigan
+1 (734)615-7362

Youssef Ghorbal

unread,
Nov 21, 2013, 2:32:23 PM11/21/13
to isilon-u...@googlegroups.com
On Thursday, November 21, 2013 3:41:32 PM UTC+1, Saker Klippsten wrote:
1.What version of Onefs? 

v6.5.5.22

2.Have you tested yet with 10.9 ,Just curious what iperf shows. Is it on par with windows smb2? 

I've tried 10.9 with SMB2 (which is by default now) I have 35MB/s read speeds when reading from NL/X and 75MB/s when reading from S and SSD.
My analysis here is that even if it's SMB2, OSX does not activate TCP window scaling in all tests I did. The connection ends using 64K window size which is quiet limiting. Windows 7 on the other side uses TCP window scaling up to 1MB that let it reach the 1Gb/s.

We have a bunch of 10.8 transfer machines. Sole purpose is to transfer content from thunderbolt and FireWire drives to our Isilon cluster specifically NL nodes via smartpool folder policy and we get 60MB plus over smb and as you know better with NFS 

The MAC is writing data on the Isilon here. It's a different story. I don't have issues writing files

Youssef Ghorbal

unread,
Nov 21, 2013, 2:34:03 PM11/21/13
to isilon-u...@googlegroups.com
On Thursday, November 21, 2013 5:09:42 PM UTC+1, Rob Peglar wrote:
All Mac versions should be running with TCP delayed acks disabled. 

sudo sysctl -w net.inet.tcp.delayed_ack=0

This was my first shot but it has no effect (other than exploding CPU usage on the MAC) the default value works just fine.

Youssef Ghorbal

unread,
Nov 21, 2013, 2:38:11 PM11/21/13
to isilon-u...@googlegroups.com
I can't really comment from personal experience, but perhaps this is useful.

when I asked our implementation engineer about recommendations for accessing the cluster from a mac, he consulted some higher-level support team.  The final answer: I was told in no uncertain terms "use NFS, not SMB".  

NFS is not a viable solution for home directory access on a campus (id mapping is a nightmare)

Youssef

Erik Weiman

unread,
Nov 21, 2013, 2:39:49 PM11/21/13
to isilon-u...@googlegroups.com
Please upgrade to OneFS 7.x to be able to handle "large MTU" as windows calls it. Basically onefs 6.5 uses SMBv2.002 (which is what Vista SP1 and server 2008 non-R2 uses) and 7.0 and 7.1 use SMBv2.1 which started support for 1M windows. 

--
Erik Weiman 
Sent from my iPhone 4
--

Youssef Ghorbal

unread,
Nov 21, 2013, 2:53:56 PM11/21/13
to isilon-u...@googlegroups.com
On Thursday, November 21, 2013 8:39:49 PM UTC+1, Erik Weiman wrote:
Please upgrade to OneFS 7.x to be able to handle "large MTU" as windows calls it. Basically onefs 6.5 uses SMBv2.002 (which is what Vista SP1 and server 2008 non-R2 uses) and 7.0 and 7.1 use SMBv2.1 which started support for 1M windows. 

That's what I'm planning to do sooner or later.
I'm just trying to understand the hard limits on the cluster and the use cases that leads to reach them (in order to avoid them in my use cases and data workflows)
My initial investigations where around SMB, but I ended up reproduicing the issue with plain iperf.

Youssef

Youssef Ghorbal

unread,
Nov 21, 2013, 3:08:56 PM11/21/13
to isilon-u...@googlegroups.com
On Thursday, November 21, 2013 5:35:41 PM UTC+1, Pete wrote:
Have you checked the Isilon & Mac Best Practices paper?
It suggests a couple of good tweaks and is more
recent than 2005 ;-)

Already checked that.
 
On Thu 21 Nov '13 md, at 22:08 st, Youssef Ghorbal <youssef...@gmail.com> wrote:

>
>
> - iperf with -F reading a 40GB on NL node : ~8MB/s tops
> - iperf with -F reading a 40GB on X node : ~8MB/s tops
> - iperf with -F reading a 40GB on S node : ~75MB/s tops
> - iperf with -F reading a 40GB on SSD disk : ~110MB/s tops

Nevertheless:
that's truely amazing (assuming that the network paths
to the nodes and the nodes' background loads
are equivalent for NL/X nodes vs. S nodes).

A few thoughts:

Can you send the outputs of
isi statistic client -nall --long
for these scenarios? (using SMB)

I'll try to do that for tomorrow. 

By analysing oprates, request sizes, latencies
and throughput rates in one context I hope we
can see what is going on,
and what makes the big difference here.

> We had these same figures even between two nodes
> on the same Isilon cluster (and even using the infiniband backend)
> and no matter what is the iperf client and server (X, NL, S)

That’s weird because you also said that with NFS everything is fine.

Yeah, for me NFS (and SMB2) will work fine because they use pipelining. Instead of sending one read request (offset + bytes to read) they send many at once. The first read will be done on the disk, and while sending this read request on the wire the system already loaded the other requests from disk (into RAM ) so they are served from cache when they are sent on the wire. That's what I call deterministic read ahead.
 
What happens if you simple read the file(s)
locally on the cluster with ‘cat’ or ‘dd bs=1024k’?

dd bs=1024k if=/my/file of=/dev/null

On NL/X files I get ~50MB/s
On S files I get ~270MB/s
On SSD files I get ~500MB/s  

=> It's much better in deed. In fact the network is not in action here.

You can even do that as a cross-test:
run the cat or dd command on an S node
accessing data on the X or NL pool, and vica versa.
That would confirm the problem is
on the disk side (or the actual target node pool.)
(The same is posssible with SMB mounting of course.)

Finally I would try to check the
cache and prefetch hit rates,
to see the difference between your actual
S nodes and X/NL nodes:
Maybe your S nodes have enough RAM for caching,
while the X/NL nodes’s caches are too busy.
Or prefetching is very bad on the X/NL nodes,
disabled? or too much fragmentation?

I'll give a shot tomorrow.
 
isi_cache_stats (-v) interval

is a great tool for this (reports for one node though),
and requires a pretty idle cluster to see a
clear signal from the test load.

It's not a idle cluster unfortunately.
 
And totally agree with Saker, OneFS 7 + OSX 10.9 is a completely new game…

Clear !

Youssef 

Youssef Ghorbal

unread,
Nov 27, 2013, 8:51:05 AM11/27/13
to isilon-u...@googlegroups.com

Yeah, for me NFS (and SMB2) will work fine because they use pipelining. Instead of sending one read request (offset + bytes to read) they send many at once. The first read will be done on the disk, and while sending this read request on the wire the system already loaded the other requests from disk (into RAM ) so they are served from cache when they are sent on the wire. That's what I call deterministic read ahead.
 
What happens if you simple read the file(s)
locally on the cluster with ‘cat’ or ‘dd bs=1024k’?

dd bs=1024k if=/my/file of=/dev/null

On NL/X files I get ~50MB/s
On S files I get ~270MB/s
On SSD files I get ~500MB/s  

=> It's much better in deed. In fact the network is not in action here.

After some digging I found out that the problem came from iperf itself. I don't know why, but there is a magic interaction between iperf (or the way it reads data) the network and the Filesystem.

Netcat on the other hand gives more realistic throughputs (practically the same as the dd results above)

People, if you want to test throughput with disks in the loop, Netcat is the way.

As for the initial problem "SMB read on OSX" support advised to disable TCP inflight (net.inet.tcp.inflight.enable = 0) which let an OSX to reach 30MB/s read throughput.

On 10.9 (and OneFS 6.5) I have a slight push on the throughput but nothing mind blowing. Others pointed out earlier that it should be better under OneFS 7.x. 

Last but not least, with Windows 7 (and OneFS 6.5) one is able to achieve 1Gb/s wirespeed throughput.

Youssef Ghorbal


Reply all
Reply to author
Forward
0 new messages