We got MultiScan to complete once and then turned it off. We then had
the same problem with Collect. MultiScan combines Collect and
AutoBalance, and if MultiScan runs Collect & AutoBalance are skipped. We
disabled Collect, although we will probably need to run it manually at
some point.
Then we started having trouble with need. need doesn't have the same
effect on deletion, but it does appear to scan all (used?) blocks in the
cluster, and takes a long time -- 6 days before we aborted last time.
Unfortunately, drive stalls prevent MultiScan, Collect, and AutoBalance
from completing. We eventually got a recipe from Isilon to log stalls
but suppress launching the jobs which stalls normally trigger, and that
has helped.
When Collect ran, we discovered that it is prioritized above
SnapshotDelete and SmartPools, both of which we use to reclaim space in
our more constrained Disk Pool. The disk pool ran out of space due to
all the old unreclaimed snapshots and data not being migrated out via
SmartPools, and we had to do a lot of juggling.
The solution here was to reprioritize AutoBalance below (higher number)
SnapshotDelete, SmartPools, and QuotaScan, so those other jobs can
reclaim space and AutoBalance can then resume.
Chris
Our main cluster's /var/log/messages actually shows 3 different kinds
of "stalled" events since March 24th, mostly against
bam_safe_write.c:7164. It looks like our *drive* stalls ended in August
when we replaced the last troublesome drive.
We also got *many* 150ms delay warnings until we raised the threshold
to 300ms, and still see some.
> Isilon is telling me that stalls may be a vibration issue - they want me
> to test a command to lower the fan speed as the fans in our 72000X nodes
> apparently vibrate too much and can cause occasional disk errors. If the
> command helps they will send someone to swap out all of our fans.
> Frankly I doubt that the fans cause as much vibration as our Lieberts do
> but it's worth investigating.
We replaced a bunch of fans and upgraded our drive firmware, which
helped. Note that Isilon has no way to upgrade drive firmware without
node downtime, so we sent back our shelf spare drives (which were
downrev) to get current drives.
Chris
For all 6.x customers, I would highly commend the latest 6.5.4 GA
release, in no small part because of these fixes.
Jon Lasser
--
Jon Lasser j...@lasser.org 206-326-0614
. . . and the walls became the world all around . . . (Maurice Sendak)
Richard