Full Pool - Should I be Concerned

gtjones

unread,

Apr 11, 2013, 10:33:17 AM4/11/13

to isilon-u...@googlegroups.com

I have a pool that filled yesterday after changing the pool policies. The file pool policies are based on access time. Should I be concerned about a pool filling when we have pool spillover turned on. We do have a VHS size of 21TB on the full pool to help with any drive failures. The pool in question is the X40066 pool.

Here's the stats:

Cluster Storage: HDD SSD

Size: 948T (988T Raw) 11T (11T Raw)

VHS Size: 40T

Used: 705T (74%) 1.8T (16%)

Avail: 244T (26%) 9.2T (84%)

Throughput (bps) HDD Storage SSD Storage

Name Health In Out Total | Used / Size |Used / Size

-------------------+----+-----+-----+-----+------------------+-----------------

performance | OK | 1.5G| 4.2G| 5.7G| 173T/ 374T( 46%)| (No SSDs)

ssd | OK | 2.8G| 3.9G| 6.7G| 17T/ 61T( 29%)| 330G/ 1.2T( 26%)

x40066 | OK | 1.3G| 2.9G| 4.1G| 514T/ 514T(> 99%)| 1.5T/ 9.7T( 15%)

-------------------+----+-----+-----+-----+------------------+-----------------

Luc Simard

unread,

Apr 11, 2013, 10:40:57 AM4/11/13

to isilon-u...@googlegroups.com

Definitely,

Which version of OneFS are you using at this time ?

Ideally your max is 88%, this gives you enough to stay afloat when you have snapshots enabled and in use. In your use case, this cause for concern, if your Smartpools policies are set to migrate files your x400 storage pool, you are in 'no disk space situation', make sure you have 'spill over' enabled.

Then I would start working to either move content back to s200/x200 pool for the short term and/or chase down heavy users and clean up some content off the x400 pool asap.

--
You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Chris Pepper

unread,

Apr 11, 2013, 10:59:52 AM4/11/13

to isilon-u...@googlegroups.com

Note that VHS has 2 problems: it's too big and it doesn't work.

1) We find that VHS configures twice as many of each HDD drive type in the cluster as we specify. Isilon Support gave me a complicated equation, but it comes out to double. In a cluster with 1tb & 2tb drives, VHS set to 1 drive should reserve 3tb (1 * 1tb and 1 * 2tb), but it actually (at least as of OneFS v6.5) reserves twice that much (6tb).

2) VHS is intended to provide space for emergencies. If you run out of space VHS should provide a buffer of extra space so the cluster can function. But as far as I can tell VHS space is never actually *used* by the system. You need to *manually* remove or decrease VHS to return space to the disk pools before it offers any value. So when we had a full cluster which needed to migrate data, nothing worked right until I temporarily disabled VHS.

Chris

Peter Serocka

unread,

Apr 11, 2013, 11:50:10 AM4/11/13

to isilon-u...@googlegroups.com

On Thu 11 Apr '13 md, at 22:59 st, Chris Pepper wrote:
> 2) VHS is intended to provide space for emergencies. If you

Only for disk failures, actually. Providing spare space to a filled-up
pool can be done only manually, as reported. (One might write a script
for those kinds of emergencies...)

Can someone share any further experience:
With a spill-over configured and available,
are there user-noticable effects when a pool runs full?
(other than the possibly slower performance of the spill-over pool.)
In other words, can a full pool behave unexpectedly
despite the spill-over?

Good luck to gtjones.

Peter

Chris Pepper

unread,

Apr 11, 2013, 12:09:23 PM4/11/13

to isilon-u...@googlegroups.com

We had a disk failure. The cluster thrashed for days without making any real progress getting out of the degraded state until I manually set VHS to 0.

There are also job priority problems. SnapshotDelete wasn't prioritized high enough, so I had to manually pause FlexProtect, set the sysctl to enable other jobs to run with FlexProtect outstanding, kick off SnapshotDelete, and then restore the sysctl and resume FlexProtect.

We were able to free additional space by running SmartPools to move snapshots into another pool, but we gave up using SmartPools this way because it created too much overhead and took too long.

Job priorities were improved from v6.0 to v6.5 but still not quite right.

Chris

Luc Simard

unread,

Apr 11, 2013, 4:19:57 PM4/11/13

to isilon-u...@googlegroups.com

GTJones

I just had a chat with your SE, please talk to him for advice. I believe the Isilon team can assist you in optimizing your current configuration,

LS

Peter Serocka

unread,

Apr 11, 2013, 11:33:55 PM4/11/13

to isilon-u...@googlegroups.com

Thanks for sharing this.

We also found that SmartPools is too slow, and too
busy scanning irrevalant pools. We usually run it only
on individual subtrees.

Job priorities (6.5.5.*): SnapshotDelete used to
keep MediaScan from ever finishing on our cluster. ;-(

In general, during the MediaScan's repair phase each
SnapshotDelete cancels any progress and makes the
repair phase start over...

We now have set SnapshotDelete's priority lower than MedianScan's,
and MediaScan's policy to medium. Thus, MediaScan finishes
within just 3 days (NL108), during which no SnapshotsDeletes
take place, ok for us.

Peter

> --
> You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>

Peter Serocka
CAS-MPG Partner Institute for Computational Biology (PICB)
Shanghai Institutes for Biological Sciences (SIBS)
Chinese Academy of Sciences (CAS)
320 Yue Yang Rd, Shanghai 200031, China
pser...@picb.ac.cn

Luc Simard

unread,

Apr 12, 2013, 2:19:59 AM4/12/13

to isilon-u...@googlegroups.com, isilon-u...@googlegroups.com

If smartpools is perceived as scanning irrelevant pools, then we should revisit the cluster's configuration and optimize.

Mediascan is only checking media, it's not a cluster critical job, not impact on workflow . It only checks if are exceeding ECC threshold and a few other values, if the drive shows to have too many errors, it will be marked for ejection with Flexprotect.

Snap delete has a higher priority setting, given it is run on a regular basis , it will maintain proper management based on the policy setting. do not delete snapshots out of order, always from oldest to newest.

If you feel you are not getting adequate performance with the job engine, I recommend you open a case with support, there are possibilities for fine tuning.

Luc Simard - 415-793-0989
simard.j...@gmail.com
Messages may contain confidential information.
Sent from my iPhone

Peter Serocka

unread,

Apr 12, 2013, 3:14:44 AM4/12/13

to Luc Simard, isilon-u...@googlegroups.com

On 2013 Apr 12. md, at 14:19 st, Luc Simard wrote:

> If smartpools is perceived as scanning irrelevant pools, then we should revisit the cluster's configuration and optimize.

It's by design as confirmed by Isilon. Because it operates on
"the cluster" rather than the pools individually.

So even if you have just a perf and nl pool,
and a single rule for migrating inactive files
to the nl pool, SmartPools will scan both pools to find
inactive files (according to the rule), just to find
most of them already on nl...

SmartPools also does some balancing sort of similar to AutoBalance,
which can produce further unexpected load. Also confirmed by Isilon.

> Mediascan is only checking media, it's not a cluster critical job, not impact on workflow . It only checks if are exceeding ECC threshold and a few other values, if the drive shows to have too many errors, it will be marked for ejection with Flexprotect.
>

I just feel better if the monthly scheduled MediaScan jobs
do finish properly. MediaScan surely has been built-in on purpose,
and with a multi-100TB single filesystem I want to have everything
checked and fixed as designed.

> Snap delete has a higher priority setting, given it is run on a regular basis , it will maintain proper management based on the policy setting. do not delete snapshots out of order, always from oldest to newest.

Yes, asking for deletion of individual snapshots
not in the order they were taken produces avoidable overhead.

But holding SnapshotDelete at rest (by priority, or by policy/schedule)
should not change the order. Just the snapshot usage piles
up for that period (3 days for us, but that's fine).

>
> If you feel you are not getting adequate performance with the job engine, I recommend you open a case with support, there are possibilities for fine tuning.

Thanks, we have been through this (took us most of 2012, 'thanks'
to additional disk stall issues), and this is where we have arrived.

Some occasional disk stalls still keep MultiScan (AutoBalance/Collect)
from ever finishing properly :-((( but that is another story...
and yes, most likely another case.

Our Isilon cluster is the main NAS storage over here,
and while it has proven pretty stable from the users' view,
controlling and maintaining the mechanisms
under the hood (media/balance/locks/alerts/notifications)
takes much more work than expected.

Thanks for your supportive feedback anyway!

Peter

Chris Pepper

unread,

Apr 12, 2013, 8:37:03 AM4/12/13

to isilon-u...@googlegroups.com

On Apr 12, 2013, at 2:19 AM, Luc Simard <simard.j...@gmail.com> wrote:

> If smartpools is perceived as scanning irrelevant pools, then we should revisit the cluster's configuration and optimize.

No, it was moving snapshots from X nodes to NL nodes. It was doing real work but imposing unacceptable load on the cluster.

> Mediascan is only checking media, it's not a cluster critical job, not impact on workflow . It only checks if are exceeding ECC threshold and a few other values, if the drive shows to have too many errors, it will be marked for ejection with Flexprotect.

But MediaScan needs to run eventually, and we have several times encountered situations where it never completes.

> Snap delete has a higher priority setting, given it is run on a regular basis , it will maintain proper management based on the policy setting. do not delete snapshots out of order, always from oldest to newest.

How could I even influence this? I just run SnapshotDelete which removes in whatever order it chooses.

> If you feel you are not getting adequate performance with the job engine, I recommend you open a case with support, there are possibilities for fine tuning.

Oh, I have! Techs are happy to discuss how to manipulate priorities on our clusters, but nobody seems interested in problems with the default priorities chosen -- aside from the changes introduced in v6.5.

Peter Serocka

unread,

Apr 12, 2013, 8:44:30 AM4/12/13

to Chris Pepper, isilon-u...@googlegroups.com

Thanks Chris...

How did you finally deal with MediaScan?

Peter

Chris Pepper

unread,

Apr 12, 2013, 8:52:24 AM4/12/13

to Peter Serocka, isilon-u...@googlegroups.com

Peter,

We set up a new OFF_HOURS policy so MediaScan could run as long as necessary without blocking other maintenance. Also some rev of v6.5 reduced the frequency with which group changes initiate MediaScan.

Chris

Peter Serocka

unread,

Apr 12, 2013, 9:07:13 AM4/12/13

to Chris Pepper, isilon-u...@googlegroups.com

I see. Perhaps keep an eye on MediaScan: if it finds ECC errors,
there will be a surpisingly looooong repair phase (5/7),
which is sensitive to interruptions. Any single SnapshotDelete
will restart the repair phase with all prior repair progress ignored.
If the interval between SnapshotsDeletes is shorter than the
repair phase, it can hardly succeed.
(Observed, and confirmed by support.)

Peter

Chris Pepper

unread,

Apr 12, 2013, 9:25:43 AM4/12/13

to Peter Serocka, isilon-u...@googlegroups.com

Peter,

That's broken. Thanks for the information, which none of the many techs I spoke to had.

Chris

Reply all

Reply to author

Forward