Help about server which hangs during restart

Omar Muñoz

unread,

Feb 17, 2008, 12:34:55 PM2/17/08

to informix-l...@iiug.org, inform...@iiug.org

Hi.

Once - when we tried to bring a database online
after we had shut it down without problems as a part
of a weekly task- database hung out for about an hour.
Online.log looked like this:

08:19:13 On-Line Mode
08:19:13 Affinitied VP 1 to phys proc 4
08:27:11 VP Notify mechanism incomplete after 5
minutes. This can be due to slo
w network file access. Will try 12 more times
08:35:11 VP Notify mechanism incomplete after 5
minutes. This can be due to slo
w network file access. Will try 11 more times
08:43:09 VP Notify mechanism incomplete after 5
minutes. This can be due to slo
w network file access. Will try 10 more times
08:51:08 VP Notify mechanism incomplete after 5
minutes. This can be due to slo
w network file access. Will try 9 more times

After that, database crashed, and online.log showed:

09:54:56 notifyvp(): vp 3, pid 22717 of class 0
didn't rcv
09:54:56 notifyvp(): vp 4, pid 22718 of class 0
didn't rcv
09:54:56 notifyvp(): vp 5, pid 22719 of class 0
didn't rcv
09:54:56 notifyvp(): vp 6, pid 22720 of class 0
didn't rcv
09:54:56 notifyvp(): vp 7, pid 22721 of class 0
didn't rcv
09:54:56 notifyvp(): vp 8, pid 22722 of class 0
didn't rcv
09:54:56 notifyvp(): vp 9, pid 22723 of class 0
didn't rcv
09:54:56 Assert Failed: mt_notifyvp timed out
09:54:56 IBM Informix Dynamic Server Version 9.40.FC7
09:54:56 Who: Session(1, informix@orion, 0,
35930e028)
Thread(7, main_loop(), 3592cc028, 1)
File: mt.c Line: 11121
09:54:56 stack trace for pid 22637 written to
/respaldo2/informix/tmp/siisa/af.
3ef58cf
09:54:56 See Also:
/respaldo2/informix/tmp/siisa/af.3ef58cf,
shmem.3ef58cf.0
09:56:12 Error writing
'/respaldo2/informix/tmp/siisa/shmem.3ef58cf.0' errno
=
28
09:56:12 mt.c, line 11121, thread 7, proc id 22637,
mt_notifyvp timed out.
09:56:14 The Master Daemon Died
09:56:15 PANIC: Attempting to bring system down

We finally were able to bring informix online only by
restarting the entire server (a Sun server with
Solaris 9). Even so, I'd like to ask you which is the
reason why this kind of error happens. I'm talking
about an informix 9.40 FC7 database.

Thanks in advance

Omar Munoz

____________________________________________________________________________________
Looking for last minute shopping deals?
Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping

Sebastian, Norma J.

unread,

Feb 17, 2008, 12:42:58 PM2/17/08

to Omar "Muñoz, informix-l...@iiug.org, inform...@iiug.org

Did you do anything about the network messages?
Informix told you there were network problems at 8:27AM, and didn't crash until 10AM.
Did you try to fix or figure out the network problem in that 1.5 hours before informix gave up?
My guess is you have some important part of informix on a shared/network drive and informix could not get to it.

Hi.

Thanks in advance

Omar Munoz

_______________________________________________
Informix-list mailing list
Inform...@iiug.org
http://www.iiug.org/mailman/listinfo/informix-list
============================================================
The information contained in this message may be privileged
and confidential and protected from disclosure. If the reader
of this message is not the intended recipient, or an employee
or agent responsible for delivering this message to the
intended recipient, you are hereby notified that any reproduction,
dissemination or distribution of this communication is strictly
prohibited. If you have received this communication in error,
please notify us immediately by replying to the message and
deleting it from your computer. Thank you. Tellabs
============================================================

da...@smooth1.co.uk

unread,

Feb 17, 2008, 4:27:33 PM2/17/08

to

On 17 Feb, 17:42, "Sebastian, Norma J."

> ____________________________________________________________________________________

> Looking for last minute shopping deals?
> Find them fast with Yahoo! Search. http://tools.search.yahoo.com/newsearch/category.php?category=shopping
> _______________________________________________
> Informix-list mailing list

> Informix-l...@iiug.orghttp://www.iiug.org/mailman/listinfo/informix-list

> ============================================================
> The information contained in this message may be privileged
> and confidential and protected from disclosure. If the reader
> of this message is not the intended recipient, or an employee
> or agent responsible for delivering this message to the
> intended recipient, you are hereby notified that any reproduction,
> dissemination or distribution of this communication is strictly
> prohibited. If you have received this communication in error,
> please notify us immediately by replying to the message and
> deleting it from your computer. Thank you. Tellabs

> ============================================================- Hide quoted text -
>
> - Show quoted text -

No Norma, this does not have to be a network issue, you are wrong. All
it means if that the VP's timed out communicating.
Perhaps one of the threads was hung in an OS call or there is an
Informix bug? DO NOT MAKE ASSUMPTIONS.

There is a known vp notify issue in 9.40.UC7

http://www-1.ibm.com/support/docview.wss?rs=630&context=SSGU8G&context=SSHPYE&dc=D600&uid=swg21233887&loc=en_US&cs=utf-8&lang=en

Patch to 9.40.UC7W1X1 to fix. PS When will 9.40.xC8 be out?

Omar, there could be some other issue. Who knows, could be a hardware
issue an OS issue or an Informix issue.

You have DUMPSHMEM enabled but if you check under /usr/include/sys/
errno.h on your system you will find errno 28 is no more space.
There is either not enough space under "'/respaldo2/informix/tmp/siisa/
shmem" to dump shared memory or it is trying to write a
file >2Gb on a file system that is not largefile enabled.

What did onstat -g stk all give? Did you strace/truss the server
pids?
Where there any issues in the OS logs?

You need to do more investigation when the problem happens.

Omar Muñoz

unread,

Feb 17, 2008, 9:19:33 PM2/17/08

to da...@smooth1.co.uk, inform...@iiug.org

Hi.

At the beginning I thought something similar to
Norma and I edited sqlhosts in order to use IP instead
of names in order to avoid solving then, but that
didn't work, and anyway everything is at the same
machine.

You're right about shared memory dump, David. My
dump partition didn't get full, but since shared
memory is pretty big (13 Gb for all) I don't think I
had large file support on it. I gonna talk about it
with the OS administrator. Anyway, assert failure
happened before that.

Sadly, I didn't perform any onstat monitor activity
at that time, but I guess some information might be at
af file. What should I look for?

I gonna do some work on the patch you sent me, even
when supposedly this situation takes effect on Solaris
8 and we have solaris 9, according with the link.

Thanks a lot to everyone

Omar Munoz

--- "da...@smooth1.co.uk" <da...@smooth1.co.uk> wrote:

> _______________________________________________
> Informix-list mailing list
> Inform...@iiug.org
> http://www.iiug.org/mailman/listinfo/informix-list
>

____________________________________________________________________________________
Never miss a thing. Make Yahoo your home page.
http://www.yahoo.com/r/hs

Jack Parker

unread,

Feb 17, 2008, 9:49:53 PM2/17/08

to Omar Muñoz, da...@smooth1.co.uk, inform...@iiug.org

Doesn't seem like a network issue to me. I've never seen anything like
this, but my impression was that it could not affinitize. I.e. could not
grab a resource that it wanted - in this case a CPU. Rebooting cleared it
up, so IMHO, something was hanging onto the CPU that prevented IDS from
grabbing it

cheers
j.

Sane ego te vocavi. Forsitan capedictum tuum desit.

da...@smooth1.co.uk

unread,

Feb 19, 2008, 5:03:10 PM2/19/08

to

On 18 Feb, 02:49, "Jack Parker" <jack.park...@verizon.net> wrote:
> Doesn't seem like a network issue to me. I've never seen anything like
> this, but my impression was that it could not affinitize. I.e. could not
> grab a resource that it wanted - in this case a CPU. Rebooting cleared it
> up, so IMHO, something was hanging onto the CPU that prevented IDS from
> grabbing it
>
> cheers
> j.
>
> Sane ego te vocavi. Forsitan capedictum tuum desit.
>
>
>
> -----Original Message-----
> From: informix-list-boun...@iiug.org
>
> [mailto:informix-list-boun...@iiug.org]On Behalf Of Omar Muñoz

> Sent: Sunday, February 17, 2008 9:20 PM

> To: da...@smooth1.co.uk; informix-l...@iiug.org

> http://www-1.ibm.com/support/docview.wss?rs=630&context=SSGU8G&contex...

> E&dc=D600&uid=swg21233887&loc=en_US&cs=utf-8&lang=en
>
> > Patch to 9.40.UC7W1X1 to fix. PS When will 9.40.xC8
> > be out?
>
> > Omar, there could be some other issue. Who knows,
> > could be a hardware
> > issue an OS issue or an Informix issue.
>
> > You have DUMPSHMEM enabled but if you check under
> > /usr/include/sys/
> > errno.h on your system you will find errno 28 is no
> > more space.
> > There is either not enough space under
> > "'/respaldo2/informix/tmp/siisa/
> > shmem" to dump shared memory or it is trying to
> > write a
> > file >2Gb on a file system that is not largefile
> > enabled.
>
> > What did onstat -g stk all give? Did you
> > strace/truss the server
> > pids?
> > Where there any issues in the OS logs?
>
> > You need to do more investigation when the problem
> > happens.
>
> > _______________________________________________
> > Informix-list mailing list

> > Informix-l...@iiug.org
> >http://www.iiug.org/mailman/listinfo/informix-list
>
> ____________________________________________________________________________
> ________
> Never miss a thing. Make Yahoo your home page.http://www.yahoo.com/r/hs
> _______________________________________________
> Informix-list mailing list
> Informix-l...@iiug.orghttp://www.iiug.org/mailman/listinfo/informix-list- Hide quoted text -

>
> - Show quoted text -

"impression was that it could not affinitize." ??

If the system calls to pin a CPU VP onto a cpu fail then I would
expect an error in the online.log.

"something was hanging onto the CPU that prevented IDS from grabbing

it". What? I've never heard such complete bollocks!
Nothing can "hang on" to a CPU. I expect either an OS bug or an
Informix bug (SPARC Solaris normally reports hardware issues
pretty reliably to the OS logs). Without more info we cannot tell
which.