Sync Snapshots - Problems

scott

unread,

Dec 21, 2016, 8:06:40 PM12/21/16

to Isilon Technical User Group

Hello folks -

I'm having trouble with the time it takes SnapshotDelete to clean-up snapshots that are created during sync operations.

I first trouble was first noticed with my 'fast/small 20TB cluster which should have had 75% free space filled up because the sync policy was running too often. [ there were few changes but many tiny files ]

I'd like to prevent this torpedo from sinking my big/main cluster, but the command to check fails to complete -

On the fast/small cluster this is my command:

isi snapshot snapshots list --state=deleting --format=table -v

On the big/main cluster, the same command fails with:

CLI timeout exceeded while waiting for the server to respond; the request still may have completed. Use the --timeout options to adjust the timeout and try again.

I'm hoping someone knows a trick to obtain a list of snapshots in 'deleting' state so I can figure how which job is the troublemaker.

Anyone else fighting this one?

Thanks

Scott

Saker Klippsten

unread,

Dec 21, 2016, 8:23:01 PM12/21/16

to isilon-u...@googlegroups.com

What does "isi job history" show you?

Example of my output

JFK-1 # isi job history

Job events:

Time Job Event

--------------- -------------------------- ------------------------------

12/21 15:09:19 MultiScan[31605] Running (HIGH)

12/21 15:09:17 SnapshotDelete[31744] Succeeded (MEDIUM)

12/21 15:09:17 SnapshotDelete[31744] Phase 1: end delete

12/21 15:08:26 SnapshotDelete[31744] Phase 1: begin delete

12/21 15:08:26 SnapshotDelete[31744] Running (MEDIUM)

12/21 15:08:25 MultiScan[31605] Waiting

12/21 15:08:23 SnapshotDelete[31744] Waiting

12/21 13:08:14 MultiScan[31605] Running (HIGH)

12/21 13:08:12 SnapshotDelete[31743] Succeeded (MEDIUM)

12/21 13:08:11 SnapshotDelete[31743] Phase 1: end delete

S

--
You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-group+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

scott

unread,

Dec 21, 2016, 8:38:19 PM12/21/16

to Isilon Technical User Group

On Wednesday, December 21, 2016 at 5:23:01 PM UTC-8, Saker Klippsten wrote:

What does "isi job history" show you?

mbisilon-2# isi_classic job history --limit 200

Job events:
Time Job Event
--------------- -------------------------- ------------------------------

12/21 17:11:05 SnapshotDelete[33228]      Policy MEDIUM -> HIGH
12/21 09:54:21 TreeDelete[33233]          Succeeded (MEDIUM)
12/21 09:54:20 TreeDelete[33233]          Phase 1: end delete
12/21 09:35:44 TreeDelete[33233]          Phase 1: begin delete
12/21 09:35:43 TreeDelete[33233]          Running (MEDIUM)
12/21 09:35:42 TreeDelete[33232]          Succeeded (MEDIUM)
12/21 09:35:41 TreeDelete[33232]          Phase 1: end delete
12/21 09:17:06 TreeDelete[33233]          Waiting
12/21 09:16:50 TreeDelete[33232]          Phase 1: begin delete
12/21 09:16:50 TreeDelete[33232]          Running (MEDIUM)
12/21 09:16:49 TreeDelete[33232]          Waiting
12/18 08:31:59 FSAnalyze[33231]           Phase 1: begin scan
12/18 08:31:58 FSAnalyze[33231]           Running (MEDIUM)
12/18 08:31:58 FSAnalyze[33231]           Waiting
12/18 08:31:57 FSAnalyze[33212]           Succeeded (MEDIUM)
12/18 08:31:54 FSAnalyze[33212]           Phase 2: end merge
12/18 08:09:37 FSAnalyze[33212]           Running (MEDIUM)
12/18 08:09:37 FSAnalyze[33212]           Waiting
12/18 06:22:09 FSAnalyze[33212]           Running (MEDIUM)
12/18 06:22:09 FSAnalyze[33212]           Waiting
12/18 04:35:33 FSAnalyze[33212]           Running (MEDIUM)
12/18 04:35:33 FSAnalyze[33212]           Waiting
12/18 02:48:38 FSAnalyze[33212]           Running (MEDIUM)
12/18 02:48:37 SnapshotDelete[33228]      Running (MEDIUM)
12/18 02:47:57 FSAnalyze[33212]           Waiting
12/18 02:47:57 SnapshotDelete[33228]      Waiting
12/18 02:40:58 FSAnalyze[33212]           System Paused
12/18 02:40:55 SnapshotDelete[33228]      System Paused
12/18 02:40:48 FSAnalyze[33212]           Running (MEDIUM)
12/18 02:40:48 SnapshotDelete[33228]      Running (MEDIUM)
12/18 02:40:13 FSAnalyze[33212]           Waiting
12/18 02:40:13 SnapshotDelete[33228]      Waiting
12/18 02:34:34 SnapshotDelete[33228]      System Paused
12/18 02:34:06 SnapshotDelete[33228]      Running (MEDIUM)
12/18 02:34:06 SnapshotDelete[33228]      Waiting
12/18 02:30:54 FSAnalyze[33212]           Running (MEDIUM)
12/18 02:30:54 SnapshotDelete[33228]      Running (MEDIUM)
12/18 02:29:06 SnapshotDelete[33228]      Waiting
12/18 02:28:31 FSAnalyze[33212]           Running (MEDIUM)
12/18 02:28:30 SnapshotDelete[33228]      Running (MEDIUM)
12/18 02:27:50 FSAnalyze[33212]           Waiting
12/18 02:27:50 SnapshotDelete[33228]      Waiting
12/18 02:20:40 SnapshotDelete[33228]      System Paused
12/18 02:20:10 SnapshotDelete[33228]      Running (MEDIUM)
12/18 02:20:10 SnapshotDelete[33228]      Waiting
12/18 02:15:07 FSAnalyze[33212]           Running (MEDIUM)
12/18 02:15:07 SnapshotDelete[33228]      Running (MEDIUM)
12/18 02:12:39 SnapshotDelete[33228]      Waiting
12/18 02:12:38 FSAnalyze[33212]           Waiting
12/18 02:12:22 FSAnalyze[33212]           Running (MEDIUM)
12/18 02:12:21 SnapshotDelete[33228]      Running (MEDIUM)
12/18 02:11:41 FSAnalyze[33212]           Waiting
12/18 02:11:41 SnapshotDelete[33228]      Waiting
12/18 02:03:14 FSAnalyze[33212]           System Paused
12/18 02:03:12 SnapshotDelete[33228]      System Paused
12/18 02:02:58 FSAnalyze[33212]           Running (MEDIUM)
12/18 02:02:57 SnapshotDelete[33228]      Running (MEDIUM)
12/18 02:02:18 FSAnalyze[33212]           Waiting
12/18 02:02:18 SnapshotDelete[33228]      Waiting
12/18 01:54:37 FSAnalyze[33212]           System Paused
12/18 01:54:36 SnapshotDelete[33228]      System Paused
12/18 01:54:24 FSAnalyze[33212]           Running (MEDIUM)
12/18 01:54:24 SnapshotDelete[33228]      Running (MEDIUM)
12/18 01:53:44 FSAnalyze[33212]           Waiting
12/18 01:53:44 SnapshotDelete[33228]      Waiting
12/18 01:45:08 FSAnalyze[33212]           System Paused
12/18 01:45:04 MultiScan[33229]           System Cancelled
12/18 01:45:04 SnapshotDelete[33228]      System Paused
12/18 01:45:02 MultiScan[33229]           Running (MEDIUM)
12/18 01:44:54 FSAnalyze[33212]           Running (MEDIUM)
12/18 01:44:54 SnapshotDelete[33228]      Running (MEDIUM)
12/18 01:44:20 MultiScan[33229]           Waiting
12/18 01:44:20 FSAnalyze[33212]           Waiting
12/18 01:44:20 SnapshotDelete[33228]      Waiting
12/18 01:36:22 MultiScan[33229]           System Paused
12/18 01:36:15 FSAnalyze[33212]           System Paused
12/18 01:36:15 SnapshotDelete[33228]      System Paused
12/18 01:35:33 MultiScan[33229]           Waiting
12/18 01:35:33 FSAnalyze[33212]           Waiting
12/18 01:35:33 SnapshotDelete[33228]      Waiting
12/18 01:28:12 SnapshotDelete[33228]      System Paused
12/18 01:28:11 FSAnalyze[33212]           System Paused
12/18 01:28:11 MultiScan[33229]           System Paused
12/18 01:27:34 FSAnalyze[33212]           Running (MEDIUM)
12/18 01:27:33 SnapshotDelete[33228]      Running (MEDIUM)
12/18 01:26:54 MultiScan[33229]           Waiting
12/18 01:26:54 FSAnalyze[33212]           Waiting
12/18 01:26:54 SnapshotDelete[33228]      Waiting
12/18 01:18:34 MultiScan[33229]           System Paused
12/18 01:18:31 FSAnalyze[33212]           System Paused
12/18 01:18:31 SnapshotDelete[33228]      System Paused
12/18 01:17:48 MultiScan[33229]           Waiting
12/18 01:17:48 FSAnalyze[33212]           Waiting
12/18 01:17:48 SnapshotDelete[33228]      Waiting
12/18 01:10:08 SnapshotDelete[33228]      System Paused
12/18 01:10:04 MultiScan[33229]           System Paused
12/18 00:09:53 MultiScan[33229]           Running (MEDIUM)
12/18 00:09:50 FSAnalyze[33212]           Running (MEDIUM)
12/18 00:09:50 SnapshotDelete[33228]      Running (MEDIUM)
12/18 00:09:35 SnapshotDelete[33228]      Waiting
12/18 00:09:35 MultiScan[33229]           Waiting
12/18 00:09:00 MultiScan[33229]           Running (MEDIUM)
12/18 00:09:00 SnapshotDelete[33228]      Running (MEDIUM)
12/18 00:08:39 SnapshotDelete[33228]      Waiting
12/18 00:08:36 MultiScan[33229]           Waiting
12/18 00:05:38 MultiScan[33229]           Running (MEDIUM)
12/18 00:05:14 FSAnalyze[33212]           Running (MEDIUM)
12/18 00:05:14 SnapshotDelete[33228]      Running (MEDIUM)
12/18 00:03:38 SnapshotDelete[33228]      Waiting
12/18 00:03:34 MultiScan[33229]           Waiting
12/18 00:00:32 MultiScan[33229]           Running (MEDIUM)
12/18 00:00:31 ShadowStoreDelete[33230]   Succeeded (LOW)
12/18 00:00:31 ShadowStoreDelete[33230]   Phase 1: end delete
12/18 00:00:29 ShadowStoreDelete[33230]   Phase 1: begin delete
12/18 00:00:28 ShadowStoreDelete[33230]   Running (LOW)
12/18 00:00:28 MultiScan[33229]           Waiting
12/18 00:00:27 ShadowStoreDelete[33230]   Waiting
12/17 23:08:56 FSAnalyze[33212]           Running (MEDIUM)
12/17 23:08:56 FSAnalyze[33212]           Waiting
12/17 21:26:53 FSAnalyze[33212]           Running (MEDIUM)
12/17 21:26:53 FSAnalyze[33212]           Waiting
12/17 19:44:51 FSAnalyze[33212]           Running (MEDIUM)
12/17 19:44:51 FSAnalyze[33212]           Waiting
12/17 18:02:50 FSAnalyze[33212]           Running (MEDIUM)
12/17 18:02:50 FSAnalyze[33212]           Waiting
12/17 16:20:50 FSAnalyze[33212]           Running (MEDIUM)
12/17 16:20:50 FSAnalyze[33212]           Waiting
12/17 14:38:49 FSAnalyze[33212]           Running (MEDIUM)
12/17 14:38:49 FSAnalyze[33212]           Waiting
12/17 12:56:47 FSAnalyze[33212]           Running (MEDIUM)
12/17 12:56:47 FSAnalyze[33212]           Waiting
12/17 11:14:43 FSAnalyze[33212]           Running (MEDIUM)
12/17 11:14:43 FSAnalyze[33212]           Waiting
12/17 09:32:40 FSAnalyze[33212]           Running (MEDIUM)
12/17 09:32:40 FSAnalyze[33212]           Waiting
12/17 07:50:35 FSAnalyze[33212]           Running (MEDIUM)
12/17 07:50:35 FSAnalyze[33212]           Waiting
12/17 06:08:33 FSAnalyze[33212]           Running (MEDIUM)
12/17 06:08:33 FSAnalyze[33212]           Waiting
12/17 04:27:10 MultiScan[33229]           Running (MEDIUM)
12/17 04:26:37 FSAnalyze[33212]           Running (MEDIUM)
12/17 04:26:37 SnapshotDelete[33228]      Running (MEDIUM)
12/17 04:26:22 SnapshotDelete[33228]      Waiting
12/17 04:26:22 FSAnalyze[33212]           Waiting
12/17 04:26:22 MultiScan[33229]           Waiting
12/17 02:44:20 FSAnalyze[33212]           Running (MEDIUM)
12/17 02:44:20 FSAnalyze[33212]           Waiting
12/17 01:02:20 FSAnalyze[33212]           Running (MEDIUM)
12/17 01:02:20 FSAnalyze[33212]           Waiting
12/16 23:20:16 FSAnalyze[33212]           Running (MEDIUM)
12/16 23:20:16 FSAnalyze[33212]           Waiting
12/16 21:38:15 FSAnalyze[33212]           Running (MEDIUM)
12/16 21:38:15 FSAnalyze[33212]           Waiting
12/16 19:56:15 FSAnalyze[33212]           Running (MEDIUM)
12/16 19:56:15 FSAnalyze[33212]           Waiting
12/16 18:14:15 FSAnalyze[33212]           Running (MEDIUM)
12/16 18:14:15 FSAnalyze[33212]           Waiting
12/16 16:32:17 FSAnalyze[33212]           Running (MEDIUM)
12/16 16:32:17 FSAnalyze[33212]           Waiting
12/16 14:50:15 FSAnalyze[33212]           Running (MEDIUM)
12/16 14:50:15 FSAnalyze[33212]           Waiting
12/16 13:08:15 FSAnalyze[33212]           Running (MEDIUM)
12/16 13:08:15 FSAnalyze[33212]           Waiting
12/16 11:26:14 FSAnalyze[33212]           Running (MEDIUM)
12/16 11:26:14 FSAnalyze[33212]           Waiting
12/16 09:44:11 FSAnalyze[33212]           Running (MEDIUM)
12/16 09:44:11 FSAnalyze[33212]           Waiting
12/16 08:02:11 FSAnalyze[33212]           Running (MEDIUM)
12/16 08:02:11 FSAnalyze[33212]           Waiting
12/16 06:20:11 FSAnalyze[33212]           Running (MEDIUM)
12/16 06:20:11 FSAnalyze[33212]           Waiting
12/16 04:38:11 FSAnalyze[33212]           Running (MEDIUM)
12/16 04:38:11 FSAnalyze[33212]           Waiting
12/16 02:56:07 FSAnalyze[33212]           Running (MEDIUM)
12/16 02:56:07 FSAnalyze[33212]           Waiting
12/16 01:14:01 MultiScan[33229]           Phase 1: begin lin scan and mark
12/16 01:14:00 MultiScan[33229]           Running (MEDIUM)
12/16 01:13:26 FSAnalyze[33212]           Running (MEDIUM)
12/16 01:13:26 SnapshotDelete[33228]      Running (MEDIUM)
12/16 01:13:26 MultiScan[33229]           Waiting
12/16 01:12:41 SnapshotDelete[33228]      Waiting
12/16 01:11:57 SnapshotDelete[33228]      System Paused
12/16 01:11:55 MultiScan[33227]           System Cancelled
12/16 00:18:57 FSAnalyze[33212]           Running (MEDIUM)
12/16 00:18:57 FSAnalyze[33212]           Waiting
12/15 22:36:52 FSAnalyze[33212]           Running (MEDIUM)
12/15 22:36:52 FSAnalyze[33212]           Waiting
12/15 20:54:51 FSAnalyze[33212]           Running (MEDIUM)
12/15 20:54:51 FSAnalyze[33212]           Waiting
12/15 19:12:50 FSAnalyze[33212]           Running (MEDIUM)
12/15 19:12:50 FSAnalyze[33212]           Waiting
12/15 17:30:50 FSAnalyze[33212]           Running (MEDIUM)
12/15 17:30:50 FSAnalyze[33212]           Waiting
12/15 15:48:48 FSAnalyze[33212]           Running (MEDIUM)
12/15 15:48:48 FSAnalyze[33212]           Waiting
12/15 14:06:47 FSAnalyze[33212]           Running (MEDIUM)
12/15 14:06:47 FSAnalyze[33212]           Waiting
12/15 12:24:45 FSAnalyze[33212]           Running (MEDIUM)
12/15 12:24:45 FSAnalyze[33212]           Waiting
12/15 10:42:45 FSAnalyze[33212]           Running (MEDIUM)

Adam Fox

unread,

Dec 21, 2016, 10:13:39 PM12/21/16

to isilon-u...@googlegroups.com

It looks like the job is getting pre-empted a lot. I assume you are running at least 7.1 so you can run more than one job at a time. If that's not the case, upgrade to a release that has the newer JE and that will probably fix it. But if you are on a modern OneFS, this usually occurs when you have two jobs that need to run that are in the same exclusion set. However, SnapDelete is not in an exclusion set so that implies that you either have 3 other jobs running at a higher priority or you have a FlexProtect job running which blocks all other jobs when it needs to run.

You may want to keep an eye on your active jobs and see if it's worth tweaking the priority (which is very different from impact policy) to keep it running.

Just a thought. Hope this helps.

-- Adam Fox

adam...@yahoo.com

--

You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.

Alistair Stewart

unread,

Dec 22, 2016, 3:42:59 AM12/22/16

to isilon-u...@googlegroups.com

Consider pausing FSAnalyze until you have cleared down your snapshots. This job is scheduled to run every night at 10pm so you might also want to consider running it only once per week, say on Friday nights. This job provides your file data to InsightIQ so you'll just lose a bit of granularity if you do this.

OneFS 7.2 is much more efficient at deleting snapshots so upgrading would be of benefit if you are running an earlier release.

Al...

To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-group+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-group+unsubscribe@googlegroups.com.

scott

unread,

Dec 22, 2016, 11:34:10 AM12/22/16

to Isilon Technical User Group

I am running 7.2 but so far haven't seen any improvements with jobs.

I agree - the preempting jobs seems to be the root of the problem. FSAnalyze takes a month to complete - and were 'flying blind' in a lot of ways until it does. SnapShotDelete and FSAnalyze cannot run at the same time, and if FSAnalyze cannot complete, the cluster will fill up.

Peter Serocka

unread,

Dec 22, 2016, 11:51:20 AM12/22/16

to isilon-u...@googlegroups.com

What is the general situation on the cluster
in terms of CPU load, disk IOPS/latencies
and disk stalls or other events causing group changes?

Do you have SSDs for metadata acceleration?

— Peter

> On 2016 Dec 22 Thu, at 16:34, scott <huntm...@gmail.com> wrote:
>
> I am running 7.2 but so far haven't seen any improvements with jobs.
>
> I agree - the preempting jobs seems to be the root of the problem. FSAnalyze takes a month to complete - and were 'flying blind' in a lot of ways until it does. SnapShotDelete and FSAnalyze cannot run at the same time, and if FSAnalyze cannot complete, the cluster will fill up.
>
>
>

> --
> You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.

scott

unread,

Dec 22, 2016, 12:48:17 PM12/22/16

to Isilon Technical User Group

On Thursday, December 22, 2016 at 8:51:20 AM UTC-8, Pete wrote:

What is the general situation on the cluster

in terms of CPU load, disk IOPS/latencies
and disk stalls or other events causing group changes?

Healthy. Looked at isi_gather_info with IsilonAdvisor and everything is looking good.

Do you have SSDs for metadata acceleration?

No. We were running this cluster with x200 (fast) nodes spilling into nl400 (storage) nodes, but the jobs weren't completing, so we split things out. We have considered the option of adding SSD's to each node, but getting over the 2%? mark for metadata acceleration was cost prohibitive.

The problem seemed to appear around the time we added a couple of new shares which may have lots of small files. This really seems to be the Achilles heel of ifs. When the LIN count / TB gets too high the jobs have trouble keeping up.

Kenneth Van Kley

unread,

Dec 22, 2016, 12:50:52 PM12/22/16

to Isilon Technical User Group

isi_classic snapshot usage | grep delete

Kenneth Van Kley

unread,

Dec 22, 2016, 12:52:45 PM12/22/16

to Isilon Technical User Group

We're running FSAnalyze weekly. Waste of time to run it daily. Once you get a lot of files, the FSAnalyze job starts taking multiple days to run anyway.

To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.

Peter Serocka

unread,

Dec 22, 2016, 1:38:01 PM12/22/16

to isilon-u...@googlegroups.com

I’m not yet sure wether the Advisor is sensitive enough
for performance-related issues, and for impacts of group changes.

It’s like a car with all sorts of electronic
component health checks, but still it’s the driver
who experiences the actual driving conditions…

For example, if your cluster continuously runs at
90+ percent CPU load, or disk latencies stick
at 10s of milliseconds, the OneFS jobs do have
a hard time. Monitoring those performance metrics
with InsightIQ (or isi statistics) is an effort
well spent imho.

You job events (earlier post) were showing lots of
System Paused items, and even a MultiScan job got
System *Cancelled*. I’d really get my hands
dirty in the system logs to find out the
causes, like group changes, and sort them out.

As for the metadata and LINs/TB: yeah -- my old
rule of thumb (OneFS 6.5, no SSDs, cluster NOT
all the time running at 100% CPU+disk performance)
used to be:

"Number of LINs (files+dirs) in a pool
divided by the number of disk drives in the pool
should be less than 1 million."

(not phrasing it as "LINs per disk”,
because that might lead to confusion
on wether multiple metadata copies
are aleady counted in or not.)

With SSDs and modern OneFS the ratio
— in the rule of thumb — has improved
perhaps by one, but not multiple, orders
of magnitude; with the larger benefit
coming from the SSDs and some benefit from
the better codes in 7.x and 8.x.

— Peter

Adam Fox

unread,

Dec 22, 2016, 1:40:04 PM12/22/16

to isilon-u...@googlegroups.com

Ok, as I understand it, FSA and SnapshotDelete should be able to run at the same time. Here's a chart from the job engine white paper:

So if they can't, you may want to check with support.

If you can get to 8.x, FSA will be greatly improved as well. Your first run will be long as usual, but your subsequent runs should be much faster. I've seen this improve 80-90%. You'll also need IIQ 4.x at that time but together they really help the FSA run time.

-- Adam Fox

adam...@yahoo.com

From: scott <huntm...@gmail.com>
To: Isilon Technical User Group <isilon-u...@googlegroups.com>
Sent: Thursday, December 22, 2016 11:34 AM
Subject: Re: Isilon-Users Sync Snapshots - Problems

Chris Pepper

unread,

Dec 22, 2016, 1:55:42 PM12/22/16

to isilon-u...@googlegroups.com

Check your file counts per pool and SSD utilization. If you have enough capacity you should ask your sales team for a GNA threshold variance so you can enable it. We are using a lot more SSD capacity since we started using SSDs to cache data for our NL400s, but it worked out.

Chris

scott

unread,

Dec 22, 2016, 6:10:02 PM12/22/16

to Isilon Technical User Group

On Thursday, December 22, 2016 at 10:55:42 AM UTC-8, Chris Pepper wrote:

Check your file counts per pool and SSD utilization. If you have enough capacity you should ask your sales team for a GNA threshold variance so you can enable it. We are using a lot more SSD capacity since

No SSD in this cluster.

I think I might almost have things working -

I killed the 10-day SnapShotDelete job and paused FSAnalyze (as it keeps fighting with SnapshotDelete)

I tar-ed a few dirs. that had many tiny files that didn't need to be accessed right away, then kicked up SnapshotDelete again and its making progress more quickly.

Steve Bogdanski

unread,

Jan 23, 2017, 11:06:11 AM1/23/17

to Isilon Technical User Group

I had some recent issues with FSAnalyze jobs failing to complete. Support had the below resolution for my issue, but not sure if it applies to your circumstance:

This is a known issue with the code version you are on along with FSA failing, an explanation is listed below.

FSAnalyze fails with 'database is locked' due to isi_job_d restarting and briefly having 2 locks on the database simultaneously. What usually causes isi_job_d to restart is the worker manager queue gets filled up with messages sent from the coordinator to the manager every minute when checking cluster load to throttle the job engine. The size of the message queue is 100. After 100 minutes this queue gets full and causes isi_job_d to stop and restart.

We can do a workaround to avoid the issue in the future. What we need to do is to increase the sysctl timeout rate. The default rate for the below command is 60 seconds, we want to increase that timeout to 240 seconds. If you are available for a Webex we can make that change.

Command that identifies the timeout rate:

isi_gconfig -t job-config core.load_balance_interval_sec=

saurabh chaudhary

unread,

Jan 25, 2017, 7:25:44 AM1/25/17

to isilon-u...@googlegroups.com

We had a same issue with 2 of our Isilons, but to go-ahead with changing the defaults value we change the FSA job schedule from daily to weekly [now running on every Friday off-business hours]. The only impact from daily to weekly we see on our InsightIQ tool data log sets are now week wise instead of daily.

Thanks and Regards,

Saurabh Chaudhary

--

You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-group+unsubscribe@googlegroups.com.

Reply all

Reply to author

Forward