Isilon small files performance

rgera...@gmail.com

unread,

Jun 25, 2013, 8:51:07 AM6/25/13

to isilon-u...@googlegroups.com

Hi,

We have a cluster with 4 NL400 nodes mainly for our PACS (Medical Imaging) system to store long term data on.

The data structure the PACS supplier(s) use is kinda .....crappy. Creating directories (thousands) with thousands of small files. For example, a "ls" on such a directory takes 2 minutes (if not from cache). Is there any way to speed that up ? besides investing in metadata accelerator solutions (ssd's) ?

Remco Gerards

LUMC Leiden

Netherlands

Steven Kreuzer

unread,

Jun 25, 2013, 9:12:40 AM6/25/13

to isilon-u...@googlegroups.com

What version of OneFS are you running? Isilon made some significant changes to dealing with small files in OneFS 7. Are you mounting the volume via nfs? if so, you might see some improvements with rdirplus set

Peter Serocka

unread,

Jun 25, 2013, 11:06:01 AM6/25/13

to isilon-u...@googlegroups.com

Remco:

- yes, rdirplus is a must

- not sure about OneFS 7, it has reduced latency but not clear wether
it helps for metadata. Actual findings anyone?

- if you are really desperate, you can try keeping metadata in the
cache by running a find or ls -lR on (selected) directories every few
minutes (while still in cache, of course :) This has helped us once
with a single but pretty large software directory (like /shared/bin/ )
until SSD-enabled nodes arrived. Probably less suitable for a complete
imaging archive, though. Check the current cache coverage (in
seconds!) with:
isi statistics query -nall -snode.ifs.cache.oldest_page_age

Cheers

Peter

> --
> You received this message because you are subscribed to the Google
> Groups "Isilon Technical User Group" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to isilon-user-gr...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

LinuxRox

unread,

Jun 25, 2013, 1:31:52 PM6/25/13

to isilon-u...@googlegroups.com

Does rdirplus help with directory listing only or are there some other performance benefits ? Google did not find much on "rdirplus"

To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-group+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-group+unsubscribe@googlegroups.com.

Blake Golliher

unread,

Jun 25, 2013, 1:36:12 PM6/25/13

to isilon-u...@googlegroups.com

http://tools.ietf.org/html/rfc1813#page-80

ReaddirPlus is a nfsv3 specific command for reading though directories a little faster (many caveats apply of course, YMMV and all that).

I'd add that possibly using http access on the isilon might be a way to abstract some of those problem. If you have a predictable directory structure provided by the application, pointing curl at url's might be easier then using "cd" into a path and "ls" to pull up a file.

To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.

rgera...@gmail.com

unread,

Jun 25, 2013, 1:46:11 PM6/25/13

to isilon-u...@googlegroups.com

We are running onefs 7.0.2.1 and share it through CIFS. Currently dealing with 400mil files spread over subdirs with approx 10000 in one directory. Obviously there isnt a problem when you request a file, but dont try browsing it ;)

I also wonder how much it would improve if u were to use SSD metadata acceleration and what exactly is required for only accelerating metadata. Does that also require 3 nodes or is there a more friendly way to achieve that ?

Peter Serocka

unread,

Jun 25, 2013, 11:22:03 PM6/25/13

to rgera...@gmail.com, isilon-u...@googlegroups.com

On 2013 Jun 26. md, at 01:46 st, rgera...@gmail.com wrote:

We are running onefs 7.0.2.1 and share it through CIFS. Currently dealing with 400mil files spread over subdirs with approx 10000 in one directory. Obviously there isnt a problem when you request a file, but dont try browsing it ;)

Correct, the B-tree metadata structures makes

accessing one single file by its name pretty fast.

The problem occurs when listing a large number of files,

because the whole B-tree is spread on many blocks to be read

non-sequentially (more like a "random" pattern").

And there is the overhead in the NFS protocol,

somewhat reducible by rdirplus (=readdirplus).

Also the client might try to sort the list

alphabetically - thus, not requesting a sort might

speed up things further (like with "ls -f").

I also wonder how much it would improve if u were to use SSD metadata acceleration and what exactly is required for only accelerating metadata. Does that also require 3 nodes or is there a more friendly way to achieve that ?

About a factor of ten. Exact settings, aka "SSD strategies",

depend on wether SmartPools are licensed/used or not,

but they go with the protection settings (per SmartPools Policy, if used).

Just select "Metadata on SSD" (OneFS 7 lets you choose

read-only or read/write for the metadata on SSD).

Where read-only means that SSD is used only for reading the

metadata while it always gets written to spinning disk for protection.

So the speedup is only for reading, not for writing (like creating files).

For OneFS 6, this is the only way to have metadata on SSD.

Unless you choose full data on SSD as strategy, of course...

If you have "enough" (some few %) SSD in you cluster,

you can enable "global" metadata acceleration - even your

NL pools can benefit from it. But in many clusters the

total SSD capacity is much less than 1% of the whole cluster,

so no luck.

Peter

--
You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Peter Serocka

CAS-MPG Partner Institute for Computational Biology (PICB)

Shanghai Institutes for Biological Sciences (SIBS)

Chinese Academy of Sciences (CAS)

320 Yue Yang Rd, Shanghai 200031, China

pser...@picb.ac.cn

Peter Serocka

unread,

Jun 25, 2013, 11:25:10 PM6/25/13

to rgera...@gmail.com, isilon-u...@googlegroups.com

On 2013 Jun 26. md, at 01:46 st, rgera...@gmail.com wrote:

I also wonder how much it would improve if u were to use SSD metadata acceleration and what exactly is required for only accelerating metadata. Does that also require 3 nodes or is there a more friendly way to achieve that ?

Forgot to mention this - yes. It requires 3.

The friendly thing is the "global" acceleration (see my other post).

Peter

--
You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Peter Serocka

unread,

Jun 25, 2013, 11:29:50 PM6/25/13

to Blake Golliher, isilon-u...@googlegroups.com

On 2013 Jun 26. md, at 01:36 st, Blake Golliher wrote:

http://tools.ietf.org/html/rfc1813#page-80

ReaddirPlus is a nfsv3 specific command for reading though directories a little faster (many caveats apply of course, YMMV and all that).

I'd add that possibly using http access on the isilon might be a way to abstract some of those problem. If you have a predictable directory structure provided by the application, pointing curl at url's might be easier then using "cd" into a path and "ls" to pull up a file.

Accessing a file by path+name is fast on both protocols!

If you can do without listing in HTTP, then you (your application)

can do so via NFS, too. Say you keep your own metadata in

a DB and access files only by path+name. If it seems there is a

strong trend towards object storage these days -- this is why...

Peter

scott

unread,

Jun 25, 2013, 11:38:40 PM6/25/13

to isilon-u...@googlegroups.com

I'm also running a few PACS systems on a 4 NL node Isilon cluster on 6.5.5.11. For comparison, if I ssh into my cluster and cd to a random (uncached) PACS directory and

# time ls

...many files...

ls -z 0.00s user 0.27s system 2% cpu 12.070 total

isilon-1# ls |wc -l

2971

the system took 12 seconds to return the list of 2971 files.

The current Isilon job you have running will make a difference. I currently have a MultiScan job running at LOW.

-Scott

rgera...@gmail.com

unread,

Jun 26, 2013, 4:46:11 AM6/26/13

to isilon-u...@googlegroups.com

Iam in the 1-2 minutes department with 5-10k files. We have a huge initial sync running, but I paused that. The only running job is a mediascan on low.

Remco

Op woensdag 26 juni 2013 05:38:40 UTC+2 schreef scott het volgende:

Peter Serocka

unread,

Jun 26, 2013, 6:17:16 AM6/26/13

to rgera...@gmail.com, isilon-u...@googlegroups.com

On 2013 Jun 26. md, at 16:46 st, rgera...@gmail.com wrote:

Iam in the 1-2 minutes department with 5-10k files. We have a huge initial sync running, but I paused that. The only running job is a mediascan on low.

Remco

3k files in 12 sec = 250 files/sec (Scott)

10k files in 2 mins = 83 files/sec (Remco)

Remco, did you test on the Isilon like Scott, or on a client?

My findings are closer to Scott's (also on the Isilon):

4 x NL108 (3T SATA, with MultiScan running on SLOW):

150k files in 8 mins = 310 f/s (and an NDMP level 1 "differential" job scanned that in 5 minutes...)

And compared to SSD:

3 x X200 (1T SATA plus metadata-read on SSD):

3.9 Mio files in 15 mins = 4300 f/s

=> matches the rule of thumb, metadata on SSD is 10x faster than on SATA

Peter

Op woensdag 26 juni 2013 05:38:40 UTC+2 schreef scott het volgende:
I'm also running a few PACS systems on a 4 NL node Isilon cluster on 6.5.5.11. For comparison, if I ssh into my cluster and cd to a random (uncached) PACS directory and

# time ls

...many files...

ls -z 0.00s user 0.27s system 2% cpu 12.070 total
isilon-1# ls |wc -l
2971

the system took 12 seconds to return the list of 2971 files.

The current Isilon job you have running will make a difference. I currently have a MultiScan job running at LOW.

-Scott

--
You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

rgera...@gmail.com

unread,

Jun 26, 2013, 6:22:50 AM6/26/13

to isilon-u...@googlegroups.com, rgera...@gmail.com

Yes, I tested it on one of the nodes in the cluster. I will wait for the sync to complete and try again.

Possible relevant other settings:

Protection scheme: +2:1

Nodes: 4 + 1 back accelerator

Data access pattern: Concurrency

Cluster is about 70% full and data is synced with SYNCIQ to another node. Also the PACS data is protected through snaplock (WORM) in enterprise mode.

I have also calculated the protection overhead (% diff between with and without overhead) and thats a stunning 58% where normal office data is approx 27%.

Remco

Op woensdag 26 juni 2013 12:17:16 UTC+2 schreef Pete het volgende:

Peter Serocka

unread,

Jun 26, 2013, 6:44:10 AM6/26/13

to rgera...@gmail.com, isilon-u...@googlegroups.com

On 2013 Jun 26. md, at 18:22 st, rgera...@gmail.com wrote:

Yes, I tested it on one of the nodes in the cluster. I will wait for the sync to complete and try again.

If it was paused there shouldn't be much difference.

No idea whether WORM affects read performance...

You might check without WORM for reference.

thats a stunning 58%

As this thread is about "small" files, your small files

most probably have become replicated rather than striped...

You can check with "isi get ..."

With +2:1 protection you accept two disks to fail.

Thus, replication to _three_ disks would occur for

small files (isi get -> "3x").

In other words, 1/3 of consumed space is actual data,

and 2/3 = 67% are overhead.

You are seeing 58% overhead (if that was your calculation),

a bit better than 67%, probably due to a few larger files

that got efficiently striped rather than replicated.

Peter

PS

We also work with bio-images over here, and

would assume that dealing with small files would

be rather rare in this field...

rgera...@gmail.com

unread,

Jun 26, 2013, 6:57:50 AM6/26/13

to isilon-u...@googlegroups.com, rgera...@gmail.com

Hi,

Thanks for the quick replies. I did some tests:

WORM vs NOWORM: no change

I checked with isi get a small (4kb) and a larger (6,5mb) file:

Ampere-1# ls -l 005033C2.DCM

xxxxxxxxxxxxxxxxxxxxxxxxxxxx 6620360 Jun 17 16:57 005033C2.DCM

Ampere-1# isi get 005033C2.DCM

POLICY LEVEL PERFORMANCE COAL FILE

default 6+2/2 concurrency on 005033C2.DCM

Ampere-1# ls -al 001273C2.DCM

xxxxxxxxxxxxxxxxxxxxxxxxxxxx 3174 Jun 16 02:44 001273C2.DCM

Ampere-1# isi get 001273C2.DCM

POLICY LEVEL PERFORMANCE COAL FILE

default 6+2/2 concurrency on 001273C2.DCM

Since going lower then +2:1 would be kinda foolish in my eyes. Would changing from concurrency to random help my issues ?

Remco

Op woensdag 26 juni 2013 12:44:10 UTC+2 schreef Pete het volgende:

Peter Serocka

unread,

Jun 26, 2013, 7:24:48 AM6/26/13

to rgera...@gmail.com, isilon-u...@googlegroups.com

On 2013 Jun 26. md, at 18:57 st, rgera...@gmail.com wrote:

Hi,

Thanks for the quick replies. I did some tests:

WORM vs NOWORM: no change

Good (and what we have hoped for, of course...)

I checked with isi get a small (4kb) and a larger (6,5mb) file:

Ok, simple "isi get" doesn't provide the insight we needed.

"isi get -g" shows much more, BUT currently I can't

show how to read the relevant information about replication.

Sorry for that.

Here is what actually works ;)

isilon-1# du -Ah x y

8.0K x

38M y

isilon-1# du -h x y

26K x

50M y

(-A for "apparent" size as ls -l, otherwise shows actual storage comsumption)

One clearly sees the threefold overhead for the small file x,

and largely better ratio for file y.

Since going lower then +2:1 would be kinda foolish in my eyes

Fully agree.

Would changing from concurrency to random help my issues ?

Don't think so; my understanding is that it affects data, not metadata.

Peter

rgera...@gmail.com

unread,

Jun 26, 2013, 9:40:37 AM6/26/13

to isilon-u...@googlegroups.com, rgera...@gmail.com

Would concurrency -> random take a few % off the overhead ? if so, any idea how much ?

Op woensdag 26 juni 2013 13:24:48 UTC+2 schreef Pete het volgende:

Peter Serocka

unread,

Jun 26, 2013, 9:53:39 AM6/26/13

to isilon-u...@googlegroups.com, rgera...@gmail.com

I can't see how that would avoid triplicating smallest files...

P.

Jason Davis

unread,

Jun 26, 2013, 10:10:47 AM6/26/13

to isilon-u...@googlegroups.com, rgera...@gmail.com

From what I understand, changing the data access pattern only would effect SmartCache pre-fetching aggressiveness and data placement (striping).

To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-group+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-group+unsubscribe@googlegroups.com.

Keith Nargi

unread,

Jun 26, 2013, 12:33:21 PM6/26/13

to isilon-u...@googlegroups.com, isilon-u...@googlegroups.com, rgera...@gmail.com

That's correct. Random will not change the mirroring of a file. If you have a cluster with smartpools you could create a policy that says files smaller than 128k protect at 2x versus n+2:1 which will give you 3 copies but I do want to air on the side of caution with doing that. You effectively could lose 2 drives or nodes and not have the file available where with 3x mirroring at n+2:1 will ensure that you have data available.

Sent from my iPhone

scott

unread,

Jun 27, 2013, 2:41:51 PM6/27/13

to isilon-u...@googlegroups.com

In an earlier post I shared my results listing PACS files on a 4 NL node cluster similar to yours. In my tests I was listing about 246 files per second.

I have a second cluster with 4 NL nodes and 4 Xseries nodes, with the options for metadata on SSD and global namespace acceleration switched on. Only about .17% (point one seven percent) of my total storage is on SSD which is below the optimal 2% mark for global namespace acceleration. I have the same MultiScan job running at LOW.

On this cluster I was able to 526 files per second. Bumping MultiScan up to MEDIUM causes the 'files per second' count to drop down to around 375/second which is where I normally run this job at and is acceptable to my users.

-Scott

rgera...@gmail.com

unread,

Jun 30, 2013, 3:13:14 PM6/30/13

to isilon-u...@googlegroups.com

I notice some huge differences that I have a hard time explaning.

The pacs data is present on 2 identical Isilon clusters. Both are still in a test phase and allmost not being used.

On one cluster it takes 2 min to ls a directory. Where the same (synciq'd) dir on the other cluster takes 10s. The main difference between them is snapshots, the "slow" one has approx 60 snapshots (1 month / 2 per day) on it, while the other one has none. When I copy that same directory to a non-snap folder I get same same 10second results. Now I understand how deletes are slowed down by snapshots, but how does this work ?

Am I making an error somewhere ?

Remco

Op donderdag 27 juni 2013 20:41:51 UTC+2 schreef scott het volgende:

Peter Serocka

unread,

Jul 1, 2013, 4:09:00 AM7/1/13

to isilon-u...@googlegroups.com

Are these revertible snapshots (SnapRevert, OneFS 7)?

Peter

--
You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Luc Simard

unread,

Jul 2, 2013, 4:29:50 AM7/2/13

to isilon-u...@googlegroups.com, isilon-u...@googlegroups.com

Can you cover the node types for the source cluster and the target ?

Can you discuss your file layout , deep or shallow directories , number of files per directories , file sizes on avenge per directory? Do you make use of GNA?

Per cluster , please provide the output of :

# isi stat -d

# isi_for_array -s 'isi_hw_status -i'

# uname -a

# isi pkg info

Luc Simard - 415-793-0989

simard.j...@gmail.com

Messages may contain confidential information.

Sent from my iPhone

rgera...@gmail.com

unread,

Jul 2, 2013, 4:42:47 AM7/2/13

to isilon-u...@googlegroups.com

Hi,

Both clusters are 4x NL400.

File layout: Typical office data and PACS data from 2 PACS systems. The data I discussed earlier is the PACS with the very small files (I think avg is about 80-100kb and there are approx 200m files).

Directory structure: Office data goes very deep. PACS data is shallow with 5-10k files per directory.

We are not using GNA.

I have attached files with the output of the commands. Currently deleting a lot of checkpoints on AMPERE.

Remco

Op dinsdag 2 juli 2013 10:29:50 UTC+2 schreef lsimard het volgende:

AMPERE.txt

VOLTA.txt

Luc Simard

unread,

Jul 2, 2013, 4:58:45 AM7/2/13

to isilon-u...@googlegroups.com, isilon-u...@googlegroups.com

The NL product is the deep archive model , if you feel this is not meeting your expectations I would do the following, in order :

- regardless, upgrade to 7.0.1.6 or better, you are aware of the known published emc ETAs , if not have a look on support.emc.com

- work with GS Support team, for a closer inspection action of the source and target clusters under a performance case investigation. You may have other things happening here which be very long to troubleshoot by eMail. It's quite possible nothing is wrong or you may need to revisit some practices or configuration.

- raise the issue with you SE/TC and Sales team, they can be creative there as well. I would investigate the GNA path ( not avail on NL).

Luc Simard - 415-793-0989

simard.j...@gmail.com

Messages may contain confidential information.

Sent from my iPhone

<AMPERE.txt>

<VOLTA.txt>

rgera...@gmail.com

unread,

Jul 2, 2013, 5:04:04 AM7/2/13

to isilon-u...@googlegroups.com

We are running 7.0.2.1 which is higher then 7.0.1.6 :)

Op dinsdag 2 juli 2013 10:58:45 UTC+2 schreef lsimard het volgende:

Luc Simard

unread,

Jul 2, 2013, 5:08:39 AM7/2/13

to isilon-u...@googlegroups.com, isilon-u...@googlegroups.com

My mistake, I just re I sites the text files provides, 7.0.2.1 you have already.

Luc Simard - 415-793-0989

simard.j...@gmail.com

Messages may contain confidential information.

Sent from my iPhone

On Jul 2, 2013, at 1:42, rgera...@gmail.com wrote:

<AMPERE.txt>

<VOLTA.txt>

Peter Serocka

unread,

Jul 2, 2013, 5:24:25 AM7/2/13

to rgera...@gmail.com, isilon-u...@googlegroups.com

On 2013 Jul 2. md, at 17:04 st, rgera...@gmail.com wrote:

We are running 7.0.2.1 which is higher then 7.0.1.6 :)

He said "better" ;-)))

The stowed jobs don't look good to me.

SnapshotDelete running for 15 hours?

(There is that thread where I advocated overlapping

snapshots, but I have never seen 15 hours. On NL108, 6.5.5.x

and 100+ Mio Files.)

You might want to watch the progress of SnapshotDelete

with isi job status -v (if that's still the syntax for 7.0)

to see wether there is reasonable progress and some end in sight.

We found it better to have MediaScan NEVER interrupted

by SnapshotDelete. (6.5. MediaScan can get sisyphos'ed

at some phase by recurrent SnapshotDeletes.)

When your SnapshotDelete will have finished,

the stowed jobs will run in the order of priority:

MultiScan, QuotaScan, MediaScan

MultiScan might take days to weeks, even

if the cluster appears to be balanced.

So you will wait a long time for QuotaScan

to proceed. Unless the cluster is highly

unbalanced, I'd prefer to have QuotaScan

run next, then finish MediaScan (2-3 days

when run on MEDIUM impact), then let MultiScan

proceed in the background (LOW Impact and low prio).

Whenever MultiScan runs, monitor it carefully;

(at least on 6.5) it likes to get silently

"System canceled" on a bare disk stall incident

without further notice nor a new attempt.

That said, you recent experiment with copying

a dir out of a snapshot area to a fresh area

on the same cluster is still a mystery to me.

Or was the SnapshotDelete aleady running at

that time?

Cheers

Peter

rgera...@gmail.com

unread,

Jul 2, 2013, 5:30:42 AM7/2/13

to isilon-u...@googlegroups.com, rgera...@gmail.com

The snapshotdelete is deleting a lot of snapshots (approx 16tb): Progress: Deleted 19643147/31684008 LINs and 34/158 snapshots. It was not running when I did that test, I will test again when the job is completed and all snapshots are gone.

Iam removing the snapshots and changing the schedules (influenced by the other thread on that subject).

I will follow your advise on the job order and raise the impact of the mediascan job (been running for 8 days on LOW) to get it finished.

Op dinsdag 2 juli 2013 11:24:25 UTC+2 schreef Pete het volgende:

Peter Serocka

unread,

Jul 2, 2013, 5:43:41 AM7/2/13

to rgera...@gmail.com, isilon-u...@googlegroups.com

Thanks -- seem there is hope for the deletes to end... and good luck!

Peter

Reply all

Reply to author

Forward