AutoBalance running time

Jean-Baptiste Denis

unread,

Jan 6, 2015, 6:08:12 AM1/6/15

to isilon-u...@googlegroups.com

Hi,

we've got 14 x400 already balanced nodes and we've just integrated a new one.
We've launched an AutoBalanceLin :

# isi_classic job status -v

Running jobs:
Job Impact Pri Policy Phase Run Time
-------------------------- ------ --- ---------- ----- ----------
AutoBalanceLin[9683] Low 4 LOW 1/3 16:10:08
Progress: Processed 4404967 LINs and approx. 8412 GB: 4039552 files,
365280 directories; 0 errors
LIN Estimate based on LIN count of 36 done on Jan 6 04:56:01 2015
LIN Based Estimate: N/A Remaining (>99% Complete)
Block Based Estimate: 1941h 29m 36s Remaining (0% Complete)

8412 GB in 16 hours, 150 MBytes/s. For 1 PB, approx 2000 hours, almost 3 months...

We are running OneFS 7.1.1.1 :

# isi version
Isilon OneFS v7.1.1.1 B_7_1_1_84(RELEASE)

Our DSE says that the AutoBalance job can be quite slow and this is normal. I
could change the impact to MEDIUM and see what happened, but I'm more concerned
about the fact that this is considered "normal". How could that be ? We are
basically saved by the fact that our OneFS version can run multiple jobs in
parallel (imagine the AutoBalance being interrupted by SnapshotDelete or other
job simply being delayed during multiple months just for the sake of the
AutoBalance)

I can't wait for the day when I'll have a performance problem (even not related)
and the support tell me that this is normal because I've got an AutoBalance in
progress and I've to wait months before it ends. In the meantime, my other x400
nodes could go full (I've got 224T left on my X nodes. It could go full simply
by writing 2T (not each day during 3 months - 24 MBytes/s, not something I call
heavy use), the estimated run time of the AutoBalance job).

I'm really afraid to be (again) in a messy situation =)

What do you think ?
What are you observing on your side ?
How do you manage the data balance when you add a node to your isilon cluster ?

Should I just forget about it and consider this not as important as I think it
is (even if I'm convinced that is not given the previous scenario) ?

Thank you for your input.

Jean-Baptiste

Peter Serocka

unread,

Jan 6, 2015, 6:46:17 AM1/6/15

to isilon-u...@googlegroups.com

Hi Jean-Baptiste:

job progress can be highly uneven over time,
it’s too early to panic right now…

I wonder why there is no valid LIN count?
Can you run the LinCount job manually and then restart AutoBalanceLin?
(to get the LIN progress as a second metric.)

And as usual, things also depend on CPU load
and disk IO saturation. With a fresh node just
added, you can see the effect of file layout fragmentation
by comparing the IO request sizes (isi statistics drive)
between old nodes and the new one: if the old nodes
read and write with significantly smaller sizes, and already
max out in the 100 IO/s range, there is not much to do
about it…

Cheers

— Peter

> --
> You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

Jean-Baptiste Denis

unread,

Jan 6, 2015, 8:13:41 AM1/6/15

to isilon-u...@googlegroups.com

On 01/06/2015 12:45 PM, Peter Serocka wrote:
> Hi Jean-Baptiste:

Hi Peter,

> job progress can be highly uneven over time,
> it’s too early to panic right now…

OK.

> I wonder why there is no valid LIN count?

Don't know.

> Can you run the LinCount job manually and then restart AutoBalanceLin?
> (to get the LIN progress as a second metric.)

I know from a previous run that we've got 400 millions LIN approx.

> And as usual, things also depend on CPU load
> and disk IO saturation. With a fresh node just
> added, you can see the effect of file layout fragmentation
> by comparing the IO request sizes (isi statistics drive)
> between old nodes and the new one: if the old nodes
> read and write with significantly smaller sizes, and already
> max out in the 100 IO/s range, there is not much to do
> about it…

Make total sense. Ops/in are between 10 and 60 per second across all nodes, but
old node are under heavy usage regarding Ops/out (up to 900 on some sata nodes,
which seems to high to be true, but it gives a feeling about the actual load).

OK. So our cluster is quite loaded =)

Thank you very much for your answer !

Jean-Baptiste

Dan Pritts

unread,

Jan 7, 2015, 9:30:01 AM1/7/15

to isilon-u...@googlegroups.com

Jean-Baptiste Denis wrote:
> Make total sense. Ops/in are between 10 and 60 per second across all nodes, but
> old node are under heavy usage regarding Ops/out (up to 900 on some sata nodes,
> which seems to high to be true, but it gives a feeling about the actual load).

A 7200 RPM drive can typically do something like 75 IOPS (wikipedia
says "75-100"). At that rate, 900 IOPS would be maxing out 12 disks.

Sounds like you are under heavy load, or these old nodes have a lot of
fragmentation.

thanks
danno
--
Dan Pritts
ICPSR Computing & Network Services
University of Michigan

Adam Fox

unread,

Jan 7, 2015, 9:36:57 AM1/7/15

to isilon-u...@googlegroups.com

Keep in mind, that Ops/in and Ops/out are protocol ops, not disk IOPS. Of course, in many cases that could be worse with respect to actual disk IO bit there is no real correlation, especially when you account for cache.

-- Adam Fox
adam...@yahoo.com

Peter Serocka

unread,

Jan 8, 2015, 2:14:36 AM1/8/15

to 'Adam Fox' via Isilon Technical User Group

On 2015 Jan 7. md, at 22:36 st, 'Adam Fox' via Isilon Technical User Group wrote:

> Keep in mind, that Ops/in and Ops/out are protocol ops, not disk IOPS. Of course, in many cases that could be worse with respect to actual disk IO bit there is no real correlation, especially when you account for cache.
>
> -- Adam Fox
> adam...@yahoo.com

Not necessarily...

With OneFS we have:

isi statistics client (& protocol) --> NAS (NFS,SMB,...) protocol ops

isi statistics heat --> OneFS filesystem ops

isi statistics drive --> "IOPS" per drive as observed through SCSI/SATA protocols.
What actually happens between platter and head, stays in the drive...

500+ small reads/s can be seen on SATA drives with OneFS,
without overloading, but only with AutoBalance etc running.

This appears to be a "lucky case", and could
only be explained by examining the on-disk file system layout.
Think of lots of metadata that can be read
at low latency, i.e. without much head seeking
(same "cylinder" in old days' speak).

Cheers

-- Peter

>
>
>> On Jan 7, 2015, at 9:29 AM, Dan Pritts <da...@umich.edu> wrote:
>>
>> Jean-Baptiste Denis wrote:
>>> Make total sense. Ops/in are between 10 and 60 per second across all nodes, but
>>> old node are under heavy usage regarding Ops/out (up to 900 on some sata nodes,
>>> which seems to high to be true, but it gives a feeling about the actual load).
>> A 7200 RPM drive can typically do something like 75 IOPS (wikipedia says "75-100"). At that rate, 900 IOPS would be maxing out 12 disks.
>>
>> Sounds like you are under heavy load, or these old nodes have a lot of fragmentation.
>>
>> thanks
>> danno
>> --
>> Dan Pritts
>> ICPSR Computing & Network Services
>> University of Michigan
>>
>> --
>> You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

Peter Serocka
CAS-MPG Partner Institute for Computational Biology (PICB)
Shanghai Institutes for Biological Sciences (SIBS)
Chinese Academy of Sciences (CAS)
320 Yue Yang Rd, Shanghai 200031, China
pser...@picb.ac.cn

Neproshennie

unread,

Feb 20, 2015, 1:21:30 PM2/20/15

to isilon-u...@googlegroups.com, jbd...@pasteur.fr

Is there a reason why you elected a LIN based job over a regular job? A LIN based job will sequentially process every LIN on the cluster where-as a non-LIN based job will look for data that will need to be balanced. AutoBalance has multiple stages that seem to repeat themselves; it will do most of the heavy lifting initially and go back and do a more granular balance after that. AutoBalance can take a long time. I've seen clusters that have taken days to a week to complete a balance but that was also in-production and competing with client I/O.

When adding a node to a cluster, you do want AutoBalance to run otherwise the cluster will target writes to the drives with the lesser number of inodes, in this case the new node, so that will be bombarded with writes. What I've done in the past is setup specific schedules where AutoBalance will run at a medium impact at night, low impact during the day, and pause to let other jobs, such as SnapshotDelete, run.

As others have said, performance will also be impacted by how busy the other drives are but also by how full each drive is, what type of operations are hitting it, and whether you have SSDs. Typically, you can calculate the number of IOPS a SATA drive can handle with:

( 1 second / 5 ms seek) * 1000

So that gives us:

(1 / 5) * 1000 = 200

Why use a 5ms seek in this calculation? Simply because the cluster does write-leveling across all drives so there would be no reason that a drive needs to do a full sweep of the platter until the drive is full. That calculation is based on the best-case scenario but if you're running drives at >80% capacity then using the manufacturer full seek time gives you more of the worse-case scenario:

(1 / 8.3) * 1000 = 120

Most of the time you would just multiple the resulting IOPS by the number of SATA disks you have, but there are a few things to take into consideration:

1) Each node can access a maximum of 8 drives simultaneously, at any given moment, which is due to the storage controller. Generally this is not a problem considering writes are distributed amongst nodes in the cluster and typically one block goes to one disk pool and each node can have multiple disk pools so writes will be distributed quite well.
2) Writes operations are always prioritized over read operations.
3) If there are no SSDs then metadata reads and writes will go directly to the spindles. Because the I/O scheduler can't distinguish between a data block and metadata block, it just sees a read or write so refer to #2.
4) If there are SSDs, metadata read acceleration is usually enabled by default which leaves metadata writes hitting the spindles.
5) Snapshots and Sync jobs are very metadata intensive, so is having large quantities of snapshots on a single path or nested snapshots.
6) Client I/O will hurt the speed in which data is balanced. Keep in mind that one client write operation does not mean one filesystem write. For one client write there will be a block write, multiple parity blocks, metadata write, and metadata parity writes.

...and so forth.

So when running AutoBalance, data isn't just copied over and away you go -- data does get moved over but the protection blocks also get moved so a node that a data block is being read from may also be writing parity blocks, to a different disk pool, on the same node.

Also keep in mind that when running 'isi statistics ...' commands, they are averages of data over given periods of time and should always be looked at as a guide and not gospel. Try not to look at the percentage of busy as that is going to be dependent on a lot of factors and generally not accurate. With the industry standard calculations I've provided and a consideration of what is happening on the cluster, and its configuration, you can more accurately look at the output of (remove the pipe to head if you want to see them ALL):

isi statistics drive --nodes=all --type=sata --orderby=opsin | head -n 20

isi statistics drive --nodes=all --type=sata --orderby=opsout | head -n 20

The moral of the story is that all good things come to those who wait. Unfortunately, AutoBalance is one of the longer running jobs but the end result is worth it. You may want to create custom schedules for the job to run at differing priorities. Also, the job estimate of time remaining can fluctuate significantly. If there is competing I/O for the particular LIN it is working on, or if it has to traverse a large number of snapshots, progress will also be slowed down.

There have also been instances where AutoBalance is working on a LIN that is part of a snapshot and once that snapshot is removed, while AutoBalance is trying to work on it, then the job could stall on a particular LIN. That's when support would look for busy vnodes associated with the job engine while AutoBalance is trying to run. Another benefit to running AutoBalance instead of AutoBalanceLIN is that non-LIN based jobs support checkpoint files so the job wouldn't have to start over from the absolute beginning (unless there was a significant change). Metadata acceleration would greatly improve the performance of AutoBalance trying to determine what LINs need to be worked on as well.

Just some food for thought.

--Jamie Ivanov

Peter Serocka

unread,

Feb 28, 2015, 4:16:16 AM2/28/15

to isilon-u...@googlegroups.com, Jean-Baptiste Denis

Hi Jean-Babtiste

how has it been going with the rebalancing?

We have updated our mostly NL400 cluster
from 7.0.2.9 to 7.1.1.2 recently,
so it is the first time we are running
the new job engine; and so far we have seen
changes to the better side.

A normal MultiScan was started after
the upgrade (no nodes added at this time,
so just normal houskeeping, and probably
some metadata updates for GNA).

With LOW impact setting, the visible impact was
really not noticable, but also the expected
run time was quite high, 3 weeks or so.

With MEDIUM impact setting the impact
was somewhat noticable (in terms of NFS latency),
but not severe. And the job load showed
really to be adaptive to the general NAS load.
The job finished within 3 days (again, no
data rebalancing to new nodes here).

That was much faster that similar runs
in the past, so improvements are really
visible for us.

Did you experiment with LOW and MEDIUM
impact settings on your cluster?

-- Peter

On 2015 Jan 6 Tue, at 19:08, Jean-Baptiste Denis <jbd...@pasteur.fr> wrote:

Reply all

Reply to author

Forward