Jobs (MultiScan/Collect/AutoBalance), Priorities, & Stalls

Chris Pepper

unread,

Oct 5, 2011, 2:32:51 PM10/5/11

to isilon-u...@googlegroups.com, jerry

Is anyone else having trouble with priorities? We have a couple
6.0.3.10 clusters, and on both we had a lot of trouble with MultiScan.
Whenever MultiScan is running, deletion of snapshotted files changes
from an instantaneous operation to a slow one, and our users get NFS hangs.

We got MultiScan to complete once and then turned it off. We then had
the same problem with Collect. MultiScan combines Collect and
AutoBalance, and if MultiScan runs Collect & AutoBalance are skipped. We
disabled Collect, although we will probably need to run it manually at
some point.

Then we started having trouble with need. need doesn't have the same
effect on deletion, but it does appear to scan all (used?) blocks in the
cluster, and takes a long time -- 6 days before we aborted last time.

Unfortunately, drive stalls prevent MultiScan, Collect, and AutoBalance
from completing. We eventually got a recipe from Isilon to log stalls
but suppress launching the jobs which stalls normally trigger, and that
has helped.

When Collect ran, we discovered that it is prioritized above
SnapshotDelete and SmartPools, both of which we use to reclaim space in
our more constrained Disk Pool. The disk pool ran out of space due to
all the old unreclaimed snapshots and data not being migrated out via
SmartPools, and we had to do a lot of juggling.

The solution here was to reprioritize AutoBalance below (higher number)
SnapshotDelete, SmartPools, and QuotaScan, so those other jobs can
reclaim space and AutoBalance can then resume.

Chris

Richard

unread,

Nov 29, 2011, 5:05:50 PM11/29/11

to isilon-u...@googlegroups.com

How many stalls do you have? At this point I'm seeing one every couple of days (we had a lot more until we swapped out a marginal drive).

Isilon is telling me that stalls may be a vibration issue - they want me to test a command to lower the fan speed as the fans in our 72000X nodes apparently vibrate too much and can cause occasional disk errors. If the command helps they will send someone to swap out all of our fans. Frankly I doubt that the fans cause as much vibration as our Lieberts do but it's worth investigating.

"grep stalled /var/log/messages" finds the stalls. The last argument (e.g., stalled: 1:15) is the node number and logical drive number of the drive. "isi device" run on the node shows the mapping to the actual bay. Lots of events from the same drive indicate that it needs to be replaced.

Chris Pepper

unread,

Nov 29, 2011, 5:16:14 PM11/29/11

to isilon-u...@googlegroups.com, Richard

On 11/29/11 5:05 PM, Richard wrote:
> How many stalls do you have? At this point I'm seeing one every couple
> of days (we had a lot more until we swapped out a marginal drive).

Our main cluster's /var/log/messages actually shows 3 different kinds
of "stalled" events since March 24th, mostly against
bam_safe_write.c:7164. It looks like our *drive* stalls ended in August
when we replaced the last troublesome drive.

We also got *many* 150ms delay warnings until we raised the threshold
to 300ms, and still see some.

> Isilon is telling me that stalls may be a vibration issue - they want me
> to test a command to lower the fan speed as the fans in our 72000X nodes
> apparently vibrate too much and can cause occasional disk errors. If the
> command helps they will send someone to swap out all of our fans.
> Frankly I doubt that the fans cause as much vibration as our Lieberts do
> but it's worth investigating.

We replaced a bunch of fans and upgraded our drive firmware, which
helped. Note that Isilon has no way to upgrade drive firmware without
node downtime, so we sent back our shelf spare drives (which were
downrev) to get current drives.

Chris

J. Lasser

unread,

Nov 30, 2011, 6:01:31 PM11/30/11

to isilon-u...@googlegroups.com, Richard

It's probably worth mentioning here that, between 6.0.0 and 6.5.3, we
had incorrect drive stall interrupt code that generated many false
stalls under high load. These situations are fixed in 6.5.4, and we
also fix a number of issues around post-stall job kickoff, to reduce
those further.

For all 6.x customers, I would highly commend the latest 6.5.4 GA
release, in no small part because of these fixes.

Jon Lasser

--
Jon Lasser j...@lasser.org 206-326-0614
. . . and the walls became the world all around . . . (Maurice Sendak)

Jerry Uanino

unread,

Nov 30, 2011, 7:49:32 PM11/30/11

to isilon-u...@googlegroups.com, Richard

It's also worth mentioning my previous thread where I was whining about handumps filling /var/crash were due to excessive drive stalls.

I've replaced 3 suspect drives and we'll see if that fixes it.

Richard Kunert

unread,

Nov 30, 2011, 8:28:00 PM11/30/11

to isilon-u...@googlegroups.com

Yes, I know I need to find a way to fit in the upgrade to 6.5.4. It's only a matter of scheduling an outage for about 100 servers…

Richard

jerry

unread,

Dec 2, 2011, 3:47:52 PM12/2/11

to Isilon Technical User Group

100?!
We have 720!

Reply all

Reply to author

Forward