IBM pSeries Shutdown/Reboot - Recommended Frequency

steven_nospam at Yahoo! Canada

unread,

Oct 14, 2010, 10:31:06 AM10/14/10

to

Hi All,

This question has gone around a few times over the years with me
getting a variety of answers from both IBM and non-IBM personnel. Just
want to see what most people are doing nowadays with the latest
servers.

The question was: How often should an IBM pSeries server running AIX 5
or AIX 6 be rebooted? Is there a recommended practice?

Now, having worked with the IBM RS/6000 systems since about 1991, I've
always come down to one answer that I've been given from various
sources over those years: It depends on the applications that are
running on the server and how efficiently coded they are. If the
software is coded right, you may go a long time without needing a
reboot. On the other hand, if there are memory leaks and other issues,
it may require to be done more frequently.

Our company manages something like seventy AIX servers from the old
43Ps up to a new P750 recently purchased. I have seen many a server
run fine for 180+ days and had one system that was not rebooted for
over 500 days. But on the other hand, we've also had servers that
started behaving poorly after about 40 days.

So for our sites, we started implementing a policy of recommending a
reboot sometime after every 1000 hours of use (about 45 days). But
this is not always convenient or followed when you have a production
server that needs to be operational 24/7.

So I ask my extended AIX newsgroup family:
- Do you have any guidelines you follow in your business or department
regarding reboot frequency?
- Is there an official recommendation in a document from IBM anywhere
for pSeries systems running AIX?

Here are the typical setups we have:

SystemA: OLTP, AIX 5.3, no RDBMS, COBOL software using ISAM files.

SystemB: OLTP, AIX 5.3, Informix 11.50, COBOL software.

SystemC: OLTP, AIX 5.3, Oracle 10g, COBOL software

I will be continuing my search at IBM's websites and may even place a
call to them to see what they advise. But it is also nice to hear what
other people are doing, especially if you also have Informix or Oracle
running on the AIX platform.

Thx in advance

SteveN

Tony

unread,

Oct 14, 2010, 1:16:25 PM10/14/10

to

In comp.unix.aix, "steven_nospam at Yahoo! Canada" <steven...@yahoo.ca>
wrote:

>- Do you have any guidelines you follow in your business or department
>regarding reboot frequency?

Yes, we only reboot UNIX servers as a completely last line of resort during
problem resolution, or where we are forced to by patching. Never on a
schedule.

We've had AIX servers run for over 1500 days (in the days when ML updates
usually didn't require reboots).

These days it's less common to see such uptimes because full TL's often
require reboots (or at least, recommend them, I'm not about to risk
ignoring that recommendation).

>I will be continuing my search at IBM's websites and may even place a
>call to them to see what they advise. But it is also nice to hear what
>other people are doing, especially if you also have Informix or Oracle
>running on the AIX platform.

We've got Oracle, Informix, DB2, SAP, etc., etc., etc. No scheduled
reboots. I strongly argue against reboots to resolve issues as well.

For this reason, I hate zombied processes which have died during kernel
calls (and so, can't respond to kill -9's).
--
Tony Evans
Saving trees and wasting electrons since 1993
blog -> http://perceptionistruth.com/
books -> http://www.bookthing.co.uk/
[ anything below this line wasn't written by me ]

Uwe Auer

unread,

Oct 14, 2010, 2:30:18 PM10/14/10

to

Hi,

we do never reboot for other reasons than
- forced to by installing new TLs, drivers or other software upgrades
- last(!!!) resort on problems

To my knowledge there is no official recommendation from IBM to reboot your
servers. I doubt wou'll find any.

If you got systems behaving poorly after 40 days, find the cause! Implementing a
"reboot frequency"-policy is not the way to solve application software problems.

My opinion about 45-days reboot frequency policy as a standard: Not worth to
spend 5 minutes effort on such an idea.

Regards,
Uwe Auer

Message has been deleted

Hank M. Higher

unread,

Oct 14, 2010, 11:50:11 PM10/14/10

to

On 14.10.2010 16:31, steven_nospam at Yahoo! Canada wrote:

> The question was: How often should an IBM pSeries server running AIX 5
> or AIX 6 be rebooted? Is there a recommended practice?

rebooting unix servers as an attempt for problem solving reveals poor
understanding of the problem.
as a scheduled habit its only benefit is the entertainment of this
newsgroup.

Jan Gerrit Kootstra

unread,

Oct 15, 2010, 12:42:55 AM10/15/10

to

Hank M. Higher schreef:

Hank,

What do you mean by rebooting as an attempt for problem solving reveals

poor understanding of the problem.

1) A filesystem corruption in /. How to solve that on a running system?
I do not know how to do that. Please tell me.

2) 100% memory and swapspace use due to application memoryleaks.

Scheduled reboots, without having problems. I would not do that either.

Kind regards,

Jan Gerrit Kootstra

Michael Kraemer

unread,

Oct 15, 2010, 1:58:29 AM10/15/10

to

steven_nospam at Yahoo! Canada schrieb:
> Hi All,

> The question was: How often should an IBM pSeries server running AIX 5
> or AIX 6 be rebooted? Is there a recommended practice?

Yes: if there's no problem, don't fix it.

Tony

unread,

Oct 15, 2010, 3:32:56 AM10/15/10

to

My take on Hank's comment is that rebooting where you're not sure what the
problem is, or pre-emptive reboots, are poor solutions. There are always
situations where you may need to reboot a server to resolve a specific
fault (e.g. zombied DB2 processes holding TCP/IP ports open - if someone
can tell me how to fix that without a box reboot I'd be chuffed).

steven_nospam at Yahoo! Canada

unread,

Oct 15, 2010, 11:00:25 AM10/15/10

to

On Oct 15, 3:32 am, Tony <t...@darkstorm.invalid> wrote:

> [ anything below this line wasn't written by me ]- Hide quoted text -
>
> - Show quoted text -

Thanks guys, and for the record, I totally agree. If you can identify
and resolve a problem that is associated with the software, then by
all means that's the route to go. But sometimes you are living with a
system that has some restrictions, and you are not entirely in control
on the issue. Example: I have software that was purchased for use with
our proprietary software. The developer went out of business, and we
are slowly phasing it out by either developing our own replacement or
finding a new resource for that "component". But until the system is
perfected, rolled out, and replaces the old stuff, we may see memory
leaks, performance issues, and such. As most IT staff know the cost of
unplanned downtime is a lot higher than planned downtime, and so I'd
rather do a planned downtime on a weekend than have 50-100 people
unable to work for 20-40 minutes as I reboot an ailing production
server.

So taking the suggestions and agreeing with you. It's just something I
wanted to be sure was not common to a certain platform or when a
certain RDBMS is installed.

Thanks.

SteveN

Jose Pina Coelho

unread,

Oct 15, 2010, 11:54:22 AM10/15/10

to

Jan Gerrit Kootstra <jan.g...@kootstra.org.uk> wrote in news:66ba9
$4cb7dc61$4df909b6$27...@news.chello.nl:
> [...]

> 2) 100% memory and swapspace use due to application memoryleaks.

Those are solved by restarting the applications.

Jose Pina Coelho

unread,

Oct 15, 2010, 12:08:45 PM10/15/10

to

Tony <to...@darkstorm.invalid> wrote in
news:i99032$gjg$1...@matrix.darkstorm.co.uk:

> My take on Hank's comment is that rebooting where you're not sure what
> the problem is, or pre-emptive reboots, are poor solutions.

They aren't solutions at all, they're workarounds.

Granted that if you have an AIX 4.3.3, already at the latest level, with an
I/O card designed by Cthulhu and an application coded by a shoggoth,
scheduled reboots may be your only remaing option (*).

> There are always situations where you may need to reboot a server to
> resolve a specific fault (e.g. zombied DB2 processes holding TCP/IP ports
> open - if someone can tell me how to fix that without a box reboot I'd be
> chuffed).

IIRC, a zombie process has no resources at all (or at least should not
have), since it's just the PID entry in the process table, waiting for the
parent to read the return code.

This can be worked around by killing the parent which will re-parent the
child to it's grandparent, causing a SIGCHLD to be issued to the now-
parent, which should read the return code, causing the zombie PID to
disapear. Ultimately, by being reparented to init (PID 1) it will be
reaped.

On the other hand, a "hung" process is usually waiting for some I/O and
will be totaly irresponsive until it returns from the system call.

In both cases, a solution is opening a PMR on the faulty product to get the
code provider to fix the code.

And yes, during production it's frequently faster to just type "reboot" on
the LPAR.

* - Until you have mney to replace it with a sane solution.

Tony

unread,

Oct 15, 2010, 12:59:34 PM10/15/10

to

In comp.unix.aix, "steven_nospam at Yahoo! Canada" <steven...@yahoo.ca>
wrote:

>on the issue. Example: I have software that was purchased for use with

>our proprietary software. The developer went out of business, and we
>are slowly phasing it out by either developing our own replacement or
>finding a new resource for that "component". But until the system is
>perfected, rolled out, and replaces the old stuff, we may see memory
>leaks, performance issues, and such. As most IT staff know the cost of
>unplanned downtime is a lot higher than planned downtime, and so I'd
>rather do a planned downtime on a weekend than have 50-100 people
>unable to work for 20-40 minutes as I reboot an ailing production
>server.

Why do you need to reboot AIX for an application memory leak? Why not just
recycle the application? In my experience, this is more likely to be an
issue with shared memory segments being left behind and they can be cleared
down manually.

Jan Gerrit Kootstra

unread,

Oct 15, 2010, 3:22:37 PM10/15/10

to

Jose Pina Coelho schreef:

Jose,

Tell me how, 100% swapspace usage blocked the login process was my
experience, even for root.

I had to powercycle the LPAR.

Kind regards,

Jan Gerrit

Tony

unread,

Oct 15, 2010, 5:05:00 PM10/15/10

to

In comp.unix.aix, Jan Gerrit Kootstra <jan.g...@kootstra.org.uk> wrote:

>Jose Pina Coelho schreef:
>> Jan Gerrit Kootstra <jan.g...@kootstra.org.uk> wrote in news:66ba9
>> $4cb7dc61$4df909b6$27...@news.chello.nl:
>>> [...]
>>> 2) 100% memory and swapspace use due to application memoryleaks.
>>
>> Those are solved by restarting the applications.

>Tell me how, 100% swapspace usage blocked the login process was my

>experience, even for root.
>I had to powercycle the LPAR.

Yep, that can be an issue, sometimes a console login will work where an ssh
session won't, but it depends what daemons have been killed.

Jose Pina Coelho

unread,

Oct 15, 2010, 6:19:51 PM10/15/10

to

Jan Gerrit Kootstra <jan.g...@kootstra.org.uk> wrote in news:38ba9
$4cb8aa90$4df909b6$11...@news.chello.nl:

Jan, at times a reboot/reset is the only workaround available, however the
solution is keeping the system from getting into that state. WLM is your
friend.