Isilon small files performance

3,960 views
Skip to first unread message

rgera...@gmail.com

unread,
Jun 25, 2013, 8:51:07 AM6/25/13
to isilon-u...@googlegroups.com
Hi,

We have a cluster with 4 NL400 nodes mainly for our PACS (Medical Imaging) system to store long term data on.

The data structure the PACS supplier(s) use is kinda .....crappy. Creating directories (thousands) with thousands of small files. For example, a "ls" on such a directory takes 2 minutes (if not from cache). Is there any way to speed that up ? besides investing in metadata accelerator solutions (ssd's) ?

Remco Gerards
LUMC Leiden
Netherlands

Steven Kreuzer

unread,
Jun 25, 2013, 9:12:40 AM6/25/13
to isilon-u...@googlegroups.com
What version of OneFS are you running? Isilon made some significant changes to dealing with small files in OneFS 7. Are you mounting the volume via nfs? if so, you might see some improvements with rdirplus set

Peter Serocka

unread,
Jun 25, 2013, 11:06:01 AM6/25/13
to isilon-u...@googlegroups.com
Remco:

- yes, rdirplus is a must

- not sure about OneFS 7, it has reduced latency but not clear wether
it helps for metadata. Actual findings anyone?

- if you are really desperate, you can try keeping metadata in the
cache by running a find or ls -lR on (selected) directories every few
minutes (while still in cache, of course :) This has helped us once
with a single but pretty large software directory (like /shared/bin/ )
until SSD-enabled nodes arrived. Probably less suitable for a complete
imaging archive, though. Check the current cache coverage (in
seconds!) with:
isi statistics query -nall -snode.ifs.cache.oldest_page_age


Cheers

Peter
> --
> You received this message because you are subscribed to the Google
> Groups "Isilon Technical User Group" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to isilon-user-gr...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

LinuxRox

unread,
Jun 25, 2013, 1:31:52 PM6/25/13
to isilon-u...@googlegroups.com
Does rdirplus help with directory listing only or are there some other performance benefits ? Google did not find much on "rdirplus"


To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-group+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.


--
You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-group+unsubscribe@googlegroups.com.

Blake Golliher

unread,
Jun 25, 2013, 1:36:12 PM6/25/13
to isilon-u...@googlegroups.com

ReaddirPlus is a nfsv3 specific command for reading though directories a little faster (many caveats apply of course, YMMV and all that).


I'd add that possibly using http access on the isilon might be a way to abstract some of those problem.  If you have a predictable directory structure provided by the application, pointing curl at url's might be easier then using "cd" into a path and "ls" to pull up a file.



To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.

rgera...@gmail.com

unread,
Jun 25, 2013, 1:46:11 PM6/25/13
to isilon-u...@googlegroups.com
We are running onefs 7.0.2.1 and share it through CIFS. Currently dealing with 400mil files spread over subdirs with approx 10000 in one directory. Obviously there isnt a problem when you request a file, but dont try browsing it ;)
 
I also wonder how much it would improve if u were to use SSD metadata acceleration and what exactly is required for only accelerating metadata. Does that also require 3 nodes or is there a more friendly way to achieve that ?

Peter Serocka

unread,
Jun 25, 2013, 11:22:03 PM6/25/13
to rgera...@gmail.com, isilon-u...@googlegroups.com
On 2013 Jun 26. md, at 01:46 st, rgera...@gmail.com wrote:

We are running onefs 7.0.2.1 and share it through CIFS. Currently dealing with 400mil files spread over subdirs with approx 10000 in one directory. Obviously there isnt a problem when you request a file, but dont try browsing it ;)

Correct, the B-tree metadata structures makes 
accessing one single file by its name pretty fast.

The problem occurs when listing a large number of files, 
because the whole B-tree is spread on many blocks to be read 
non-sequentially (more like a "random" pattern").

And there is the overhead in the NFS protocol,
somewhat reducible by rdirplus (=readdirplus).

Also the client might try to sort the list
alphabetically - thus, not requesting a sort might 
speed up things further (like with "ls -f").


 
I also wonder how much it would improve if u were to use SSD metadata acceleration and what exactly is required for only accelerating metadata. Does that also require 3 nodes or is there a more friendly way to achieve that ?

About a factor of ten. Exact settings, aka "SSD strategies", 
depend on wether SmartPools are licensed/used or not, 
but they go with the protection settings (per SmartPools Policy, if used).

Just select "Metadata on SSD" (OneFS 7 lets you choose 
read-only or read/write for the metadata on SSD).

Where read-only means that SSD is used only for reading the
metadata while it always gets written to spinning disk for protection.
So the speedup is only for reading, not for writing (like creating files).
For OneFS 6, this is the only way to have metadata on SSD.
Unless you choose full data on SSD as strategy, of course...

If you have "enough" (some few %) SSD in you cluster, 
you can enable "global" metadata acceleration - even your
NL pools can benefit from it. But in many clusters the 
total SSD capacity is much less than 1% of the whole cluster,
so no luck.

Peter



--
You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Peter Serocka
CAS-MPG Partner Institute for Computational Biology (PICB)
Shanghai Institutes for Biological Sciences (SIBS)
Chinese Academy of Sciences (CAS)
320 Yue Yang Rd, Shanghai 200031, China





Peter Serocka

unread,
Jun 25, 2013, 11:25:10 PM6/25/13
to rgera...@gmail.com, isilon-u...@googlegroups.com
On 2013 Jun 26. md, at 01:46 st, rgera...@gmail.com wrote:


 
I also wonder how much it would improve if u were to use SSD metadata acceleration and what exactly is required for only accelerating metadata. Does that also require 3 nodes or is there a more friendly way to achieve that ?

Forgot to mention this - yes. It requires 3. 
The friendly thing is the "global" acceleration (see my other post).

Peter


--
You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Peter Serocka

unread,
Jun 25, 2013, 11:29:50 PM6/25/13
to Blake Golliher, isilon-u...@googlegroups.com
On 2013 Jun 26. md, at 01:36 st, Blake Golliher wrote:


ReaddirPlus is a nfsv3 specific command for reading though directories a little faster (many caveats apply of course, YMMV and all that).


I'd add that possibly using http access on the isilon might be a way to abstract some of those problem.  If you have a predictable directory structure provided by the application, pointing curl at url's might be easier then using "cd" into a path and "ls" to pull up a file.


Accessing a file by path+name is fast on both protocols!

If you can do without listing in HTTP, then you (your application)
can do so via NFS, too. Say you keep your own metadata in
a DB and access files only by path+name. If it seems there is a
strong trend towards object storage these days -- this is why...


Peter

scott

unread,
Jun 25, 2013, 11:38:40 PM6/25/13
to isilon-u...@googlegroups.com
I'm also running a few PACS systems on a 4 NL node Isilon cluster on 6.5.5.11.  For comparison, if I ssh into my cluster and cd to a random (uncached) PACS directory and 

# time ls

...many files...
 
ls -z  0.00s user 0.27s system 2% cpu 12.070 total
isilon-1# ls |wc -l
    2971

the system took 12 seconds to return the list of 2971 files.

The current Isilon job you have running will make a difference.  I currently have a MultiScan job running at LOW.  

-Scott 

rgera...@gmail.com

unread,
Jun 26, 2013, 4:46:11 AM6/26/13
to isilon-u...@googlegroups.com
Iam in the 1-2 minutes department with 5-10k files. We have a huge initial sync running, but I paused that. The only running job is a mediascan on low.

Remco

Op woensdag 26 juni 2013 05:38:40 UTC+2 schreef scott het volgende:

Peter Serocka

unread,
Jun 26, 2013, 6:17:16 AM6/26/13
to rgera...@gmail.com, isilon-u...@googlegroups.com
On 2013 Jun 26. md, at 16:46 st, rgera...@gmail.com wrote:

Iam in the 1-2 minutes department with 5-10k files. We have a huge initial sync running, but I paused that. The only running job is a mediascan on low.

Remco

3k files in 12 sec = 250 files/sec (Scott)

10k files in 2 mins = 83 files/sec (Remco)


Remco, did you test on the Isilon like Scott, or on a client? 


My findings are closer to Scott's (also on the Isilon):

4 x NL108 (3T SATA, with MultiScan running on SLOW):

150k files in 8 mins = 310 f/s  (and an NDMP level 1 "differential" job scanned that in 5 minutes...)


And compared to SSD:

3 x X200 (1T SATA plus metadata-read on SSD):

3.9 Mio files in 15 mins = 4300 f/s 

=> matches the rule of thumb, metadata on SSD is 10x faster than on SATA


Peter 
  

Op woensdag 26 juni 2013 05:38:40 UTC+2 schreef scott het volgende:
I'm also running a few PACS systems on a 4 NL node Isilon cluster on 6.5.5.11.  For comparison, if I ssh into my cluster and cd to a random (uncached) PACS directory and 

# time ls

...many files...
 
ls -z  0.00s user 0.27s system 2% cpu 12.070 total
isilon-1# ls |wc -l
    2971

the system took 12 seconds to return the list of 2971 files.

The current Isilon job you have running will make a difference.  I currently have a MultiScan job running at LOW.  

-Scott 


--
You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

rgera...@gmail.com

unread,
Jun 26, 2013, 6:22:50 AM6/26/13
to isilon-u...@googlegroups.com, rgera...@gmail.com
Yes, I tested it on one of the nodes in the cluster. I will wait for the sync to complete and try again.

Possible relevant other settings:

Protection scheme: +2:1
Nodes: 4 + 1 back accelerator
Data access pattern: Concurrency

Cluster is about 70% full and data is synced with SYNCIQ to another node. Also the PACS data is protected through snaplock (WORM) in enterprise mode.

I have also calculated the protection overhead (% diff between with and without overhead) and thats a stunning 58% where normal office data is approx 27%.

Remco

Op woensdag 26 juni 2013 12:17:16 UTC+2 schreef Pete het volgende:

Peter Serocka

unread,
Jun 26, 2013, 6:44:10 AM6/26/13
to rgera...@gmail.com, isilon-u...@googlegroups.com
On 2013 Jun 26. md, at 18:22 st, rgera...@gmail.com wrote:

Yes, I tested it on one of the nodes in the cluster. I will wait for the sync to complete and try again.

If it was paused there shouldn't be much difference.

No idea whether WORM affects read performance... 
You might check without WORM for reference.

thats a stunning 58%

As this thread is about "small" files, your small files 
most probably have become replicated rather than striped...

You can check with "isi get ..." 

With +2:1 protection you accept two disks to fail.

Thus, replication to _three_ disks would occur for 
small files (isi get -> "3x"). 

In other words, 1/3 of consumed space is actual data, 
and 2/3 = 67% are overhead. 

You are seeing 58% overhead (if that was your calculation),
a bit better than 67%, probably due to a few larger files 
that got  efficiently striped rather than replicated.

Peter

PS 
We also work with bio-images over here, and
would assume that dealing with small files would
be rather rare in this field...

rgera...@gmail.com

unread,
Jun 26, 2013, 6:57:50 AM6/26/13
to isilon-u...@googlegroups.com, rgera...@gmail.com
Hi,

Thanks for the quick replies. I did some tests:

WORM vs NOWORM: no change

I checked with isi get a small (4kb) and a larger (6,5mb) file:

Ampere-1# ls -l 005033C2.DCM
xxxxxxxxxxxxxxxxxxxxxxxxxxxx 6620360 Jun 17 16:57 005033C2.DCM
Ampere-1# isi get 005033C2.DCM
 POLICY  LEVEL PERFORMANCE COAL  FILE
default  6+2/2 concurrency on    005033C2.DCM


Ampere-1# ls -al 001273C2.DCM
xxxxxxxxxxxxxxxxxxxxxxxxxxxx  3174 Jun 16 02:44 001273C2.DCM
Ampere-1# isi get 001273C2.DCM
 POLICY  LEVEL PERFORMANCE COAL  FILE
default  6+2/2 concurrency on    001273C2.DCM


Since going lower then +2:1 would be kinda foolish in my eyes. Would changing from concurrency to random help my issues ?

Remco

Op woensdag 26 juni 2013 12:44:10 UTC+2 schreef Pete het volgende:

Peter Serocka

unread,
Jun 26, 2013, 7:24:48 AM6/26/13
to rgera...@gmail.com, isilon-u...@googlegroups.com
On 2013 Jun 26. md, at 18:57 st, rgera...@gmail.com wrote:

Hi,

Thanks for the quick replies. I did some tests:

WORM vs NOWORM: no change

Good (and what we have hoped for, of course...)



I checked with isi get a small (4kb) and a larger (6,5mb) file:


Ok, simple "isi get" doesn't provide the insight we needed.
"isi get -g" shows much more, BUT currently I can't
show how to read the relevant information about replication.
Sorry for that.

Here is what actually works ;)

isilon-1# du -Ah x y
8.0K    x
 38M    y

isilon-1# du -h x y 
 26K    x
 50M    y

(-A for "apparent" size as ls -l, otherwise shows actual storage comsumption)

One clearly sees the threefold overhead for the small file x,
and largely better ratio for file y.


Since going lower then +2:1 would be kinda foolish in my eyes

Fully agree.

 Would changing from concurrency to random help my issues ?

Don't think so; my understanding is that it affects data, not metadata.

Peter

rgera...@gmail.com

unread,
Jun 26, 2013, 9:40:37 AM6/26/13
to isilon-u...@googlegroups.com, rgera...@gmail.com
Would concurrency -> random take a few % off the overhead ? if so, any idea how much ?


Op woensdag 26 juni 2013 13:24:48 UTC+2 schreef Pete het volgende:

Peter Serocka

unread,
Jun 26, 2013, 9:53:39 AM6/26/13
to isilon-u...@googlegroups.com, rgera...@gmail.com
I can't see how that would avoid triplicating smallest files...

P.

Jason Davis

unread,
Jun 26, 2013, 10:10:47 AM6/26/13
to isilon-u...@googlegroups.com, rgera...@gmail.com
From what I understand, changing the data access pattern only would effect SmartCache pre-fetching aggressiveness and data placement (striping).





To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-group+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.


--
You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-group+unsubscribe@googlegroups.com.

Keith Nargi

unread,
Jun 26, 2013, 12:33:21 PM6/26/13
to isilon-u...@googlegroups.com, isilon-u...@googlegroups.com, rgera...@gmail.com
That's correct.  Random will not change the mirroring of a file.  If you have a cluster with smartpools you could create a policy that says files smaller than 128k protect at 2x versus n+2:1 which will give you 3 copies but I do want to air on the side of caution with doing that.  You effectively could lose 2 drives or nodes and not have the file available where with 3x mirroring at n+2:1 will ensure that you have data available. 

Sent from my iPhone

scott

unread,
Jun 27, 2013, 2:41:51 PM6/27/13
to isilon-u...@googlegroups.com
In an earlier post I shared my results listing PACS files on a 4 NL node cluster similar to yours.  In my tests I was listing about 246 files per second.
 
I have a second cluster with 4 NL nodes and 4 Xseries nodes, with the options for metadata on SSD and global namespace acceleration switched on.  Only about .17% (point one seven percent) of my total storage is on SSD which is below the optimal 2% mark for global namespace acceleration.   I have the same MultiScan job running at LOW.
 
On this cluster I was able to 526 files per second.  Bumping MultiScan up to MEDIUM causes the 'files per second' count to drop down to around 375/second which is where I normally run this job at and is acceptable to my users. 
 
-Scott

rgera...@gmail.com

unread,
Jun 30, 2013, 3:13:14 PM6/30/13
to isilon-u...@googlegroups.com
I notice some huge differences that I have a hard time explaning.

The pacs data is present on 2 identical Isilon clusters. Both are still in a test phase and allmost not being used.

On one cluster it takes 2 min to ls a directory. Where the same (synciq'd) dir on the other cluster takes 10s. The main difference between them is snapshots, the "slow" one has approx 60 snapshots (1 month / 2 per day) on it, while the other one has none. When I copy that same directory to a non-snap folder I get same same 10second results. Now I understand how deletes are slowed down by snapshots, but how does this work ?

Am I making an error somewhere ?

Remco



Op donderdag 27 juni 2013 20:41:51 UTC+2 schreef scott het volgende:

Peter Serocka

unread,
Jul 1, 2013, 4:09:00 AM7/1/13
to isilon-u...@googlegroups.com
Are these revertible snapshots (SnapRevert, OneFS 7)?

Peter


--
You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Luc Simard

unread,
Jul 2, 2013, 4:29:50 AM7/2/13
to isilon-u...@googlegroups.com, isilon-u...@googlegroups.com
Can you cover the node types for the source cluster and the target ?

Can you discuss your file layout , deep or shallow directories , number of files per directories , file sizes on avenge per directory? Do you make use of GNA?

Per cluster , please provide the output of :
# isi stat -d
# isi_for_array -s 'isi_hw_status -i'
# uname -a
# isi pkg info



Luc Simard - 415-793-0989
Messages may contain confidential information.
Sent from my iPhone

rgera...@gmail.com

unread,
Jul 2, 2013, 4:42:47 AM7/2/13
to isilon-u...@googlegroups.com
Hi,

Both clusters are 4x NL400.

File layout: Typical office data and PACS data from 2 PACS systems. The data I discussed earlier is the PACS with the very small files (I think avg is about 80-100kb and there are approx 200m files).
Directory structure: Office data goes very deep. PACS data is shallow with 5-10k files per directory. 

We are not using GNA.

I have attached files with the output of the commands. Currently deleting a lot of checkpoints on AMPERE.

Remco

Op dinsdag 2 juli 2013 10:29:50 UTC+2 schreef lsimard het volgende:
AMPERE.txt
VOLTA.txt

Luc Simard

unread,
Jul 2, 2013, 4:58:45 AM7/2/13
to isilon-u...@googlegroups.com, isilon-u...@googlegroups.com
The NL product is the deep archive model , if you feel this is not meeting your expectations I would do the following, in order :

- regardless,  upgrade to 7.0.1.6 or better, you are aware of the known published emc ETAs , if not have a look on support.emc.com

- work with GS Support team, for a closer inspection action of the source and target clusters under a performance case investigation. You may have other things happening here which be very long to troubleshoot by eMail. It's quite possible nothing is wrong or you may need to revisit some practices or configuration.

- raise the issue with you SE/TC and Sales team, they can be creative there as well. I would investigate the GNA path ( not avail on NL).


Luc Simard - 415-793-0989
Messages may contain confidential information.
Sent from my iPhone
<AMPERE.txt>
<VOLTA.txt>

rgera...@gmail.com

unread,
Jul 2, 2013, 5:04:04 AM7/2/13
to isilon-u...@googlegroups.com
We are running 7.0.2.1 which is higher then 7.0.1.6 :)



Op dinsdag 2 juli 2013 10:58:45 UTC+2 schreef lsimard het volgende:

Luc Simard

unread,
Jul 2, 2013, 5:08:39 AM7/2/13
to isilon-u...@googlegroups.com, isilon-u...@googlegroups.com
My mistake, I just re I sites the text files provides, 7.0.2.1 you have already.



Luc Simard - 415-793-0989
Messages may contain confidential information.
Sent from my iPhone
On Jul 2, 2013, at 1:42, rgera...@gmail.com wrote:

<AMPERE.txt>
<VOLTA.txt>

Peter Serocka

unread,
Jul 2, 2013, 5:24:25 AM7/2/13
to rgera...@gmail.com, isilon-u...@googlegroups.com
On 2013 Jul 2. md, at 17:04 st, rgera...@gmail.com wrote:

We are running 7.0.2.1 which is higher then 7.0.1.6 :)

He said "better" ;-)))



The stowed jobs don't look good to me.
SnapshotDelete running for 15 hours?
(There is that thread where I advocated overlapping 
snapshots, but I have never seen 15 hours. On NL108, 6.5.5.x
and 100+ Mio Files.)

You might want to watch the progress of SnapshotDelete
with isi job status -v  (if that's still the syntax for 7.0)
to see wether there is reasonable progress and some end in sight.

We found it better to have MediaScan NEVER interrupted
by SnapshotDelete. (6.5. MediaScan can get sisyphos'ed
at some phase by recurrent SnapshotDeletes.)

When your SnapshotDelete will have finished,
the stowed jobs will run in the order of priority:
MultiScan, QuotaScan, MediaScan

MultiScan might take days to weeks, even
if the cluster appears to be balanced.

So you will wait a long time for QuotaScan
to proceed. Unless the cluster is highly
unbalanced, I'd prefer to have QuotaScan
run next, then finish MediaScan (2-3 days
when run on MEDIUM impact), then let MultiScan
proceed in the background (LOW Impact and low prio).

Whenever MultiScan runs, monitor it carefully;
(at least on 6.5) it likes to get silently 
"System canceled" on a bare disk stall incident 
without further notice nor a new attempt.

That said, you recent experiment with copying
a dir out of a snapshot area to a fresh area
on the same cluster is still a mystery to me.
Or was the SnapshotDelete aleady running at
that time?

Cheers

Peter

rgera...@gmail.com

unread,
Jul 2, 2013, 5:30:42 AM7/2/13
to isilon-u...@googlegroups.com, rgera...@gmail.com
The snapshotdelete is deleting a lot of snapshots (approx 16tb): Progress: Deleted 19643147/31684008 LINs and 34/158 snapshots. It was not running when I did that test, I will test again when the job is completed and all snapshots are gone.

Iam removing the snapshots and changing the schedules (influenced by the other thread on that subject).

I will follow your advise on the job order and raise the impact of the mediascan job (been running for 8 days on LOW) to get it finished.


Op dinsdag 2 juli 2013 11:24:25 UTC+2 schreef Pete het volgende:

Peter Serocka

unread,
Jul 2, 2013, 5:43:41 AM7/2/13
to rgera...@gmail.com, isilon-u...@googlegroups.com
Thanks -- seem there is hope for the deletes to end... and good luck!

Peter
Reply all
Reply to author
Forward
0 new messages