Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Frequent HDR failures: -25572 (oserr = 0 or 4) generally needing restart (11.50FC7 on SLES)

874 views
Skip to first unread message

Humphrey D

unread,
Apr 22, 2013, 3:50:19 AM4/22/13
to
Hi All,

I'm more a developer than a DBA, but after our DBA left, I seem to have inherited the job of supporting a number of servers running development databases on a free version of Informix - Innovator-C 11.50.FC7IE

The development boxes are running as HDR pairs, and we've never used any of the other MACH-11 options (SDS, RSS).
Increasingly, we seeing HDR interruptions, and the odd thing is on some blades, HDR generally recovers after a minute or two, whilst on others it generally fails requiring intervention!


There's a lot of log extracts below, but I guess my basic question is whether the -25582 "Network connection is broken" errors are genuinely related to the network (I suspect not, when the HDR nodes are two blades in an enclosure) or indicative of an error in the database engine itself (does the "oserr = 0" indicate that with no OS error, it's an Informix glitch?)

Could anyone suggest further diagnosis I could do? Sometimes we see a single pair fail twice in a few hours, sometimes they'll be fine for weeks. I'm pretty sure there's one or two pairs that have never failed in the past year...

Example of HDR failing but recovering itself on a secondary (DB02):
07:19:59 DR: ping timeout
07:19:59 DR: Receive error
07:19:59 ASF Echo-Thread Server: asfcode = -25582: oserr = 4: errstr = : Network connection is broken.
System error = 4.
07:19:59 DR_ERR set to -1
07:19:59 DR: Warning - Proxy Subsystem not terminated
07:20:00 DR: Turned off on secondary server
07:21:56 DR: Secondary server connected
07:21:57 DR: Secondary server needs failure recovery

07:21:58 DR: Failure recovery from disk in progress ...
07:22:12 Logical Log 51874 Complete, timestamp: 0x9ebe17c4.
07:22:14 B-tree scanners disabled.
07:22:15 DR: HDR secondary server operational




Example of failure on a secondary (DB32):

01:08:36 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 6735, Llog used 0
01:23:12 DR: ping timeout
01:23:12 DR: Receive error
01:23:12 ASF Echo-Thread Server: asfcode = -25582: oserr = 4: errstr = : Network connection is broken.
System error = 4.
01:23:12 DR_ERR set to -1
01:23:12 DR: Warning - Proxy Subsystem not terminated
01:23:13 DR: Turned off on secondary server




Slightly different report on another secondary (DB62):

02:31:01 Logical Log 14935 Complete, timestamp: 0x291208d.
02:34:09 DR: ping timeout
02:34:09 DR: Receive error
02:34:09 ASF Echo-Thread Server: asfcode = -25582: oserr = 4: errstr = : Network connection is broken.
System error = 4.
02:34:09 DR_ERR set to -1
02:34:10 SMX thread is exiting because of network error code -25582
02:34:10 DR: Turned off on secondary server




In all the above, "oserr = 4" means an interrupted system call, yes? Does this mean that it's an OS issue that tripped Informix, rather than a bug in the database engine?


We also get similar failures, but with "oserr = 0" - does this mean nothing bad at OS level, so it's likely a bug in Informix?

This from DB62:
06:27:43 DR: ping timeout
06:27:43 DR: Receive error
06:27:43 ASF Echo-Thread Server: asfcode = -25582: oserr = 0: errstr = : Network connection is broken.
06:27:43 DR_ERR set to -1
06:27:44 SMX thread is exiting because of network error code -25582
06:27:44 DR: Turned off on secondary server

Or, am I reading too much into the "oserr =" part of the logged errors?



I guess we're pretty limited in what we may discover, as we don't have an IBM support contract, but what I *have* managed to find is that whenever we see HDR fail and stay failed, than an "onstat –g cpu" on the primary always shows a terminated HDR ping thread:
92 dr_prping 3cpu 03/09 06:27:34 2.4990 336294 terminated


Trawling ibm.com gave me an APAR with broadly similar symptoms, but for leter version of Informix:
IC85187: AFTER HDR PING TIMEOUT, PRIMARY CAN'T RECONNECT, ERROR RECEIVED CONNECTION REQUEST FROM REMOTE SERVER WHEN DR ISNT OFF


And we do occasionally get that error message if we restart the secondary node, then we have to cycle Informix:
Primary (DB23):
12:48:35 DR: ping timeout

12:51:09 Checkpoint Completed: duration was 0 seconds.
12:51:09 Fri Apr 12 - loguniq 31282, logpos 0x91b5204, timestamp: 0x4b805a8 Interval: 87467
12:51:09 Maximum server connections 318
12:51:09 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 9059, Llog used 35583

13:06:12 Maximum server connections 318
13:06:12 Checkpoint Statistics - Avg. Txn Block Time 0.000, # Txns blocked 0, Plog used 3772, Llog used 10694

13:09:46 DR: Received connection request from remote server when DR is not Off
[Local type: Primary, Current state: FAILED]
[Remote type: Secondary]

13:10:34 Checkpoint Completed: duration was 0 seconds.
13:10:34 Fri Apr 12 - loguniq 31282, logpos 0xbb7d018, timestamp: 0x4bdc4b5 Interval: 87469



Secondary (DB22)
12:49:02 DR: ping timeout
12:49:02 DR: Receive error
12:49:02 ASF Echo-Thread Server: asfcode = -25582: oserr = 107: errstr = : Network connection is broken.
System error = 107.
12:49:02 DR_ERR set to -1
12:49:02 DR: Warning - Proxy Subsystem not terminated
12:49:03 DR: Turned off on secondary server
13:09:29 Shutdown Mode
13:09:30 (4115609) connection rejected - no calls allowed for sqlexec
13:09:30 listener-thread: err = -27002: oserr = 0: errstr = : No connections are allowed in quiescent mode.
13:09:30 (4115610) connection rejected - no calls allowed for sqlexec
13:09:30 listener-thread: err = -27002: oserr = 0: errstr = : No connections are allowed in quiescent mode.
13:09:30 Quiescent Mode
13:09:30 IBM Informix Dynamic Server Stopped.

13:09:36 IBM Informix Dynamic Server Started.
13:09:36 Segment locked: addr=0x44000000, size=1511919616
13:09:36 Segment locked: addr=0x9e1e1000, size=629145600

13:09:37 Event alarms enabled. ALARMPROG = '/opt/informix/etc/alarmprogram_101.sh'
13:09:37 Booting Language <c> from module <>
13:09:37 Loading Module <CNULL>
13:09:37 Booting Language <builtin> from module <>
13:09:37 Loading Module <BUILTINNULL>
13:09:43 DR: DRAUTO is 0 (Off)
13:09:43 DR: ENCRYPT_HDR is 0 (HDR encryption Disabled)
13:09:43 Event notification facility epoll enabled.
13:09:44 IBM Informix Dynamic Server Version 11.50.FC7IE Software Serial Number AAA#B000000
13:09:44 listener-thread: err = -25572: oserr = 98: errstr = : Network driver cannot bind a name to the port.
System error = 98.
13:09:44 sql_listener: ASF_LISTEN failed
13:09:44 Attempting to bring listener thread down.

Do the above pair of errors on DB23/DB22 mean that the port for HDR was still held open after I took Infomrix offline? I cycled Informix on primary and secondary again, and there was nothing odd in the logs, and this pair has been fine since...

Sorry for such a rambling first post - but if anyone can tell me what the likely problem is, or suggest additional Informix or Unix diagnostits I should do next time, I'd really be grateful!

Thanks very much.
Humph

Art Kagel

unread,
Apr 22, 2013, 9:42:27 AM4/22/13
to Humphrey D, inform...@iiug.org
Humph:

Here's what I have for you:
  • Errno 4 is indeed interrupted system call, so possibly an OS error started the trouble.
  • Errno 0 is bad Ju-Ju, yes probably a bug
  • Errno 104 indicates that the client at the other end of an outgoing pipe disconnected, so probably the secondary was encountering the same problem as the primary.  That would point to some kind of network glitch.  In the network itself or in the OS?  Hard to tell.
  • The dr_prping thread going defunct is worrying but you still don't know if this is cause or symptom.
  • You CAN get paid support for the Innovator-C installations you use for DEV and it is very inexpensive (less than $2000 IB, but I haven't looked at the pricing lately).  So, that is a viable option.

Art

Art S. Kagel
Advanced DataTools (www.advancedatatools.com)
Blog: http://informix-myview.blogspot.com/

Disclaimer: Please keep in mind that my own opinions are my own opinions and do not reflect on my employer, Advanced DataTools, the IIUG, nor any other organization with which I am associated either explicitly, implicitly, or by inference.  Neither do those opinions reflect those of other individuals affiliated with any entity with which I am affiliated nor those of the entities themselves.


_______________________________________________
Informix-list mailing list
Inform...@iiug.org
http://www.iiug.org/mailman/listinfo/informix-list

0 new messages