Long running jobs

2,259 views
Skip to first unread message

Lloyd

unread,
May 6, 2013, 12:44:53 PM5/6/13
to isilon-u...@googlegroups.com
Since upgrading from 6.5.x.x to 7.1.0.3 all the admin jobs have been very slow to finish. 
Of late it is Autobalance that we are fighting with.  It has been running for 8 days on this attempt.
It gets to phase 2 and after a long the Progress field stops changing.
Does that mean it's stuck? 
Or is there a better way to monitor progress of Autobalance?

Chris Pepper

unread,
May 6, 2013, 1:42:20 PM5/6/13
to isilon-u...@googlegroups.com
Under OneFS v6 it's "isi job status -v". I don't know if that still works in v7.

Chris
> --
> You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Andrew Stack

unread,
May 6, 2013, 1:56:45 PM5/6/13
to isilon-u...@googlegroups.com
The command isi job status -v still works in 7 and it is the best way to monitor job progress.  How many nodes do you have and what type(s)?
--
Andrew Stack
Sr. Storage Administrator
Genentech

Saker Klippsten

unread,
May 6, 2013, 1:58:58 PM5/6/13
to isilon-u...@googlegroups.com
Output of these would be very handy


isi stat -q -d
isi job status -v
isi_for_array -s "uptime"     ( to get load averages )

Jason Davis

unread,
May 6, 2013, 2:14:43 PM5/6/13
to isilon-u...@googlegroups.com
Should still work. 

I ran into similar issues when rolling from 7.0.1.2 to 7.0.1.4. Seems that canceling out that multiscan job (this bad boy ran for 6 days!) and running the jobs manually isi job start collect & isi job start autobalance seems to have worked on my cluster. 

Afterwards running multiscan seems to be working. That said, I only have 60TB of data on my cluster so times will of course vary.

Also check and see if you see any "Block" in /var/log/messages. 

grep "Block" /var/log/messages

Add isi_for_array of course to make this run across cluster.


Saker Klippsten

unread,
May 6, 2013, 2:34:54 PM5/6/13
to isilon-u...@googlegroups.com
Also If you are on 7.0.1.4 or earlier I would  upgrade asap to OneFS 7.0.1.5. There are a # of bug fixes.
This one being the most important as it effects your data.


** #103260
If  a FlexProtect Lin job was running for a smartfailed drive when a second drive
failed, data on the second drive was not repaired.


https://support.emc.com/docu46984_OneFS-7.0.1.5-Release-Notes.pdf?language=en_US



If a FlexProtectLin job was running for a smartfailed drive when a second drive
failed, data on the second drive was not repaired

Lloyd

unread,
May 6, 2013, 2:41:12 PM5/6/13
to isilon-u...@googlegroups.com
yes, isi job status -v still works, but after a few days the "progress" doesn't change.
For example it may say "Processed 32458883 lins..." and stay that way for days.

Lloyd

unread,
May 6, 2013, 2:43:52 PM5/6/13
to isilon-u...@googlegroups.com
Cluster is 8 nodes.  5 S200's and 3 X200's.
The S200's are the ones out of balance.

On Monday, May 6, 2013 1:56:45 PM UTC-4, Andrew Stack wrote:

Jason Davis

unread,
May 6, 2013, 2:44:14 PM5/6/13
to isilon-u...@googlegroups.com
Yeah, saw that... gives me the jitters. Will be attempting to rolling update soon :)

Lloyd

unread,
May 6, 2013, 2:48:17 PM5/6/13
to isilon-u...@googlegroups.com

PRIISI-1# isi stat -q -d
Cluster Name: PRIISI
Cluster Health:     [  OK ]
Cluster Storage:  HDD                 SSD
Size:             157T (157T Raw)     1.8T (1.8T Raw)
VHS Size:         0
Used:             61T (39%)           125G (7%)
Avail:            96T (61%)           1.7T (93%)

                           Throughput (bps)  HDD Storage      SSD Storage
Name               Health|  In   Out  Total| Used / Size     |Used / Size
-------------------+-----+-----+-----+-----+-----------------+-----------------
s200_13tb_400gb-ssd|  OK | 9.0M|  35M|  44M|  48T/  59T( 81%)| 125G/ 1.8T(  7%)
  _48gb-ram        |     |     |     |     |                 |
x200_36tb_12gb     |  OK |  55M| 8.1M|  63M|  12T/  97T( 13%)|    (No SSDs)
-------------------+-----+-----+-----+-----+-----------------+-----------------
PRIISI-1# isi job status -v
Running jobs:
Job                        Impact Pri Policy     Phase Run Time
-------------------------- ------ --- ---------- ----- ----------
AutoBalance[52814]         Low    4   LOW        2/5   8d 21:06
        Progress: Processed 32458883 lins; 0 zombies and 0 errors
Paused and waiting jobs:
Job                        Impact Pri Policy     Phase Run Time   State
-------------------------- ------ --- ---------- ----- ---------- -------------
MediaScan[50899]           Low    8   LOW        1/7   0:00:00    Waiting
        Progress: n/a
FSAnalyze[52868]           Low    6   LOW        1/2   0:00:00    Waiting
        Progress: n/a
QuotaScan[52824]           Low    6   LOW        1/2   0:00:00    Waiting
        Progress: n/a
SmartPools[52637]          Low    6   LOW        1/2   0:00:37    Waiting
        Progress: Processed 35604 LINs and 31 GB: 31018 files, 4586 directories;
        0 errors Block Based Estimate: 20h 41m 1s Remaining (0% Complete)
MultiScan[53244]           Low    4   LOW        1/4   0:00:00    Waiting
        (Actions: Collect, AutoBalance)
        Progress: n/a
No failed jobs.
Recent job results:
Time            Job                        Event
--------------- -------------------------- ------------------------------
05/05 00:00:06  ShadowStoreDelete[53247]   Succeeded (LOW)
05/05 01:07:34  SnapshotDelete[53248]      Succeeded (MEDIUM)
05/05 04:36:40  SnapshotDelete[53249]      Succeeded (MEDIUM)
05/05 10:39:15  SnapshotDelete[53250]      Succeeded (MEDIUM)
05/05 12:10:10  SnapshotDelete[53251]      Succeeded (MEDIUM)
05/05 20:24:08  SnapshotDelete[53252]      Succeeded (MEDIUM)
05/05 20:44:22  SnapshotDelete[53253]      Succeeded (MEDIUM)
05/06 01:14:49  SnapshotDelete[53254]      Succeeded (MEDIUM)
PRIISI-1#
PRIISI-1# isi job status -v
Running jobs:
Job                        Impact Pri Policy     Phase Run Time
-------------------------- ------ --- ---------- ----- ----------
AutoBalance[52814]         Low    4   LOW        2/5   8d 21:06
        Progress: Processed 32458883 lins; 0 zombies and 0 errors
Paused and waiting jobs:
Job                        Impact Pri Policy     Phase Run Time   State
-------------------------- ------ --- ---------- ----- ---------- -------------
MediaScan[50899]           Low    8   LOW        1/7   0:00:00    Waiting
        Progress: n/a
FSAnalyze[52868]           Low    6   LOW        1/2   0:00:00    Waiting
        Progress: n/a
QuotaScan[52824]           Low    6   LOW        1/2   0:00:00    Waiting
        Progress: n/a
SmartPools[52637]          Low    6   LOW        1/2   0:00:37    Waiting
        Progress: Processed 35604 LINs and 31 GB: 31018 files, 4586 directories;
        0 errors Block Based Estimate: 20h 41m 1s Remaining (0% Complete)
MultiScan[53244]           Low    4   LOW        1/4   0:00:00    Waiting
        (Actions: Collect, AutoBalance)
        Progress: n/a
No failed jobs.
Recent job results:
Time            Job                        Event
--------------- -------------------------- ------------------------------
05/05 00:00:06  ShadowStoreDelete[53247]   Succeeded (LOW)
05/05 01:07:34  SnapshotDelete[53248]      Succeeded (MEDIUM)
05/05 04:36:40  SnapshotDelete[53249]      Succeeded (MEDIUM)
05/05 10:39:15  SnapshotDelete[53250]      Succeeded (MEDIUM)
05/05 12:10:10  SnapshotDelete[53251]      Succeeded (MEDIUM)
05/05 20:24:08  SnapshotDelete[53252]      Succeeded (MEDIUM)
05/05 20:44:22  SnapshotDelete[53253]      Succeeded (MEDIUM)
05/06 01:14:49  SnapshotDelete[53254]      Succeeded (MEDIUM)
PRIISI-1:  2:47PM  up 57 days, 13:39, 4 users, load averages: 0.54, 0.71, 0.71
PRIISI-2:  2:47PM  up 57 days, 13:30, 0 users, load averages: 0.61, 0.32, 0.22
PRIISI-3:  2:47PM  up 57 days, 13:37, 1 user, load averages: 0.07, 0.21, 0.24
PRIISI-4:  2:47PM  up 2 days, 5 mins, 0 users, load averages: 0.20, 0.23, 0.24
PRIISI-5:  2:47PM  up 57 days, 13:07, 3 users, load averages: 0.32, 0.30, 0.26
PRIISI-6:  2:47PM  up 30 days, 20:04, 0 users, load averages: 2.46, 0.89, 0.36
PRIISI-7:  2:47PM  up 30 days, 19:49, 0 users, load averages: 4.93, 4.71, 3.31
PRIISI-8:  2:47PM  up 30 days, 19:40, 0 users, load averages: 2.02, 1.52, 1.15

Lloyd

unread,
May 6, 2013, 2:52:17 PM5/6/13
to isilon-u...@googlegroups.com
I do see some "Blocked" entries on node 1.   What does that mean? 
The most recent one is over a week ago though.

Saker Klippsten

unread,
May 6, 2013, 3:01:01 PM5/6/13
to isilon-u...@googlegroups.com
When you say "out of balance" do you mean just on the s200s the data is not spread even across the 5 of them?
what is the output of "isi stat -q"

Or are you saying there is more data on the s200's and there should be more on the x200's if thats the case. I would pause or kill the autobalance and change the priority of the smartpools to say 3 and then start the smartpools job which ( if your smart pool polices are setup properly will move the data to the x200's )

Then the multiscan will come back around which runs a collect and autobalance at the same time..  and see if that works.


-s

Lloyd

unread,
May 6, 2013, 5:09:14 PM5/6/13
to isilon-u...@googlegroups.com
Data on the S200 "tier" is not spread evenly across all 5 S200 nodes.  Output is below, the S200's are nodes 1-5.
 
PRIISI-1# isi status -q
Cluster Name: PRIISI
Cluster Health:     [  OK ]
Cluster Storage:  HDD                 SSD
Size:             157T (157T Raw)     1.8T (1.8T Raw)
VHS Size:         0
Used:             61T (39%)           126G (7%)
Avail:            96T (61%)           1.7T (93%)
                   Health  Throughput (bps)  HDD Storage      SSD Storage
ID |IP Address     |DASR |  In   Out  Total| Used / Size     |Used / Size
-------------------+-----+-----+-----+-----+-----------------+-----------------
  1|10.252.0.50    | OK  | 2.0K|   25| 2.0K|  10T/  12T( 84%)|  25G/ 367G(  7%)
  2|10.252.0.51    | OK  | 1.1M| 247K| 1.3M| 9.6T/  12T( 81%)|  25G/ 367G(  7%)
  3|10.252.0.52    | OK  | 394K|  320| 394K|  10T/  12T( 84%)|  25G/ 367G(  7%)
  4|10.252.0.53    | OK  | 483M|  27M| 509M| 9.4T/  12T( 79%)|  25G/ 367G(  7%)
  5|10.252.0.54    | OK  | 2.7M| 754K| 3.4M| 9.5T/  12T( 80%)|  25G/ 367G(  7%)
  6|10.252.0.60    | OK  |    0| 360K| 360K| 4.2T/  32T( 13%)|    (No SSDs)
  7|10.252.0.61    | OK  |    0|  13M|  13M| 4.2T/  32T( 13%)|    (No SSDs)
  8|10.252.0.62    | OK  |  92M|  512|  92M| 4.2T/  32T( 13%)|    (No SSDs)
-------------------+-----+-----+-----+-----+-----------------+-----------------
Cluster Totals:          | 579M|  41M| 620M|  61T/ 157T( 39%)| 126G/ 1.8T(  7%)

Erik Weiman

unread,
May 6, 2013, 5:13:41 PM5/6/13
to isilon-u...@googlegroups.com
While they aren't perfectly balanced Isilon has spec'd autobalance to balance to within 5% on each node. 
You can get imbalanced when drives get failed out. 

Is there a reason you are running autobalance on its own rather than using MultiScan that runs collect and autobalance together?

--
Erik Weiman 
Sent from my iPhone 4

Lloyd

unread,
May 6, 2013, 5:26:26 PM5/6/13
to isilon-u...@googlegroups.com
It was running after a disk was replaced.  We speculate that they started it when replacing the disk.  I've gotten conflicting answers from support.  (They do confirm it will not start all by itself.)  They said autobalance was shooting for 1% difference.  One guy said to cancel it and let nightly multiscan do this and later told us they recommended running autobalance.  Support also told us we need to get that pool below 80% used or we will have performance problems.  But it can't carry out our file pool policy because the SmartPools job is waiting on the autobalance.

Trinh Tran

unread,
May 6, 2013, 5:45:33 PM5/6/13
to isilon-u...@googlegroups.com
Sounds like the job is hung to me. It may be worth to cancel it and let it start on it own again on its next schedule to run. 

Trinh Tran

Sent from my iPhone
Reply all
Reply to author
Forward
0 new messages