We has a live production installation with 250+ clients connected via
an openroad 4.1 application, the b/e is AIX 5.2 . This has had no h/w
changes for 2 years, no upgrade to Ingres (II 2.6/0305 (rs4.us5/00)
patch 10384), no CBF changes or PC changes (NODES , that we've been
told about) . The o/r application has not changed for some months. We
have load balancing with 5 iigcc's starting (one interesting point is
the fact we now only appear to have one of them clocking up CPU time
when looking at a ps -ef output).
Two days ago the following error started appearing in the Ingres
errlog :-
Y8707E ::[37528 IIGCC, 0000000d]: Wed Feb 7 16:03:18 2007
E_CLFE05_BS_CONNECT_ERR Unable to make outgoing connection.
Y8707E ::[37528 IIGCC, 0000000d]: System communication error:
Socket is not connected.
A restart to Ingres and all will work ok for an hour or so.
Another Company supports the AIX server, although we do have Ingres
access (but not root), they are saying this is an Ingres problem.....
We do also have access to errpt and there is an error showing :-
LABEL: GOENT_LINK_DOWN
IDENTIFIER: DED8E752
Date/Time: Thu 8 Feb 08:03:23 2007
Sequence Number: 515851
Machine Id: 005C90DF4C00
Node Id: Y8707E
Class: H
Type: TEMP
Resource Name: ent1
Resource Class: adapter
Resource Type: 14106902
Location: U0.1-P1-I1/E1
VPD:
Product Specific.( ).......10/100/1000 Base-TX PCI-X Adapter
Part Number.................00P4501
FRU Number..................00P4501
EC Level....................H12511
Manufacture ID..............YL1021
Network Address.............000255535899
ROM Level (alterable).......GOL002
Description
ETHERNET DOWN
Probable Causes
CABLE
CSMA/CD ADAPTER
Failure Causes
LINK TIMEOUT
Recommended Actions
CHECK CABLE AND ITS CONNECTIONS
Detail Data
FILE NAME
line: 191 file: goent_limbo.c
PCI ETHERNET STATISTICS
0000 0005 0063 0853 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0001
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000
0000 0000 0000 0000 0000 0000 0000 0002 0000 0001 0000 0001 0000 0000
0000 0000
0000 0000 0000 0000 0000 0000 0000 BB80 00F0 0249 0C00 0000 0000 01A0
0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
DEVICE DRIVER INTERNAL STATE
3333 3333 0000 0000 0000 0000
SOURCE ADDRESS
0002 5553 5899
We are being told this is a physical NIC which is not been used and
nothing to do with the problem. ....
Whilst the fault is present you cannot access IPM (except with -s) or
iinamu. Even when you first re-start Ingres, although you can start
IPM you cannot access the "Server" option?
I've seen a number of old posts on here on the subject, but little in
the way of responses, so assume this is either a silly question,
vague or not a well known subject. Funny enough those looking after
the server have told us they've requested an engineer to look at the
server (probably 24-48 hours away...)? My knowledge of TCP/IP is nil,
but from what I've read this looks like a server problem related to
TCP sockets i.e. a server problem. But any suggestions would be most
welcome as there are 250 user that can't work.
Steve
_______________________________________________
Info-Ingres mailing list
Info-...@kettleriverconsulting.com
http://www.kettleriverconsulting.com/mailman/listinfo/info-ingres
Paul
The config.dat parameters have not changed.
Ingres is started in batch (by us) as Ingres
Netstat shows a number of users connected (when any of them can
connect), but the bulk of them are using a single port. Out of 194
users 180 will be against the second port (18065 in our case) and the
rest split over the other 4.
One significant factor that's since come to light is the fact the
licence has been out of date since 01/01/07 (this is not supported by
us). The CA site indicates there is some problem, but what this
results in is somewhat vague... http://supportconnectw.ca.com/public/
ca_common_docs/lic_expire.asp. What's been said in this looks to me
like CA do not know the implications, but I think the needs to be
fixed a.s.a.p. There is and engineer looking at the AIX server as I
write....
>Whilst the fault is present you cannot access IPM (except with -s) or
>iinamu. Even when you first re-start Ingres, although you can start
>IPM you cannot access the "Server" option?
Sounds like the Name Server is getting hosed somehow. Is it
running? Can you connect to the DBMS server directly?
(find out what the port number for the dbms server is, by
looking in errlog.log if necessary, and
export II_DBMS_SERVER=portnumber
sql some-database
locally on the aix box.)
If you can connect to the DBMS server directly, it sounds like a
stuck name server. It's not immediately clear to me why
the name server would be hanging, but it is odd that this started
all of a sudden.
Am I right in thinking that existing connections are OK, it's
just that no new ones can be made when this happens?
It does sound as if something changed on at least some of the
clients, such that you're burying one of the comm servers instead
of using them all. I don't see how that would be related to the hang.
(GCC/GCN is not my area of expertise though.)
Karl
> We has a live production installation with 250+ clients connected via
> an openroad 4.1 application, the b/e is AIX 5.2 .
<snip>
>We have load balancing with 5 iigcc's starting (one interesting point is
> the fact we now only appear to have one of them clocking up CPU time
> when looking at a ps -ef output).
'
Steve, what does the client netutil look like? do you have merged
vnode entries on each client (II0, II1, II2, etc.)? Is each client
netutil configured seperately or do you use some kind of funky, shared
network client installation? I know you say that nothing has changed,
but if that is the case, if one user changed the netutil installation,
it might have affected all clients.
If you do have merged netutil entries, can you do a test connection to
each gcc within netutil from a remote client?
Another thing ... this is a longshot, but, are you by any chance using
installation passwords? If so, check the following file: $II_SYSTEM/
ingres/files/name/IILTICKET_*. Does it seem really large and
growing? Is the iigcn process consuming an inordinate amount of cpu?
If so, do this :
ingstop -iigcn
rm $II_SYSTEM/ingres/files/name/IILTICKET_*
ingstart -iigcn
Then get back in touch with me: There is a known Ingres bug that can
bring the name server to a screeching halt when you have frequent
connections and use installation passwords; used installation password
"tickets" aren't correctly "cleaned up", causing the IILTICKET file to
grow forever. When it reaches "critical mass" (in my case, that was a
couple-hundred Mb), iigcn cpu use soars and the gcn no longer can
respond adequately to remote connection requests. I wouldn't think
that 250 users would be enough to cause you problems (actually, I
wouldn't even think that you would need five iigcc's), but anything
is possible.
These are the only things that come to mind. If you are desperate and
really adventurous, you can activate GCN tracing, which might shed
some additional light on what's going on.
export II_GCA_LOG="mytracefile.out"
export II_GCN_TRACE=5
ingstop -iigcn
ingstart -iigcn
Good luck!
Jim Gramling
Rio de Janeiro
Still could be a ticket issue on the server side if you are using
installation passwords. Try bypassing the name server as Karl
suggested; also, take a look at "show comsvr *" in iinamu to see if
all five gcc's are actually registered.
Regards,
Jim Gramling
Rio de Janeiro
Hi Jim
Thanks for the help, very interesting what you said as some symptoms
are very similar. However the problem does seem to be at the server
side. We have an over night batch job on there that shuts down Ingres
each night as one of it's first jobs and ultimately kills off any
attached on-line user. The various jobs it then does have failed at
various points (hung) the last 3 nights.
The problem is more prevalent during the on-line day when the OpenRoad
users try to connect (this fails with what I assume are resulting
"outbound" errors in the server errlog.log - these are very numerous
during the day) . After a re-start of Ingres the problem does not
appear to be present until anywhere from 5 to 40 minutes, but even
then in that time, although all the users seem to be working fine, if
you run IPM, it gets in to the first menu, but you cannot run
"Server_List", it just hangs "Loading server list". I suspect in fact
the problem IS present when Ingres starts, it is just no obvious to
the end users until they can't re-login after an application time-out
or they've just come back from lunch. This is a government site and
the environment is pretty well locked down, so there is little they
can do as users to mess thing up.
I've tried by-passing the name server when the problem is present and
it just hangs.
I know it seems odd , but I still wonder if this may be linked to the
licence problem I posted about the CA notified Dec06. CA have been
very vague in the document as to what the ramifications are with this
problem (looks like a bit deliberately vague to me). Half baked it
might be , but could it be the old licence "rules" have sneaked back
in to SOME of the ingres products on CERTAIN platforms from version to
version?
Currently the plan is to put in the new licence and if it fails raise
with CA, but obviously your suggestion on tracing will need to be
tried (when I can next get access...)
Thanks again,
Steve
Hi
Eventually got the chance to run iinamu on this server whilst the
problem was present and I could get into it. When I run show comsvr it
hung there for a while then timed out E_GC0020_TIME_OUT Time_out
expiration: service request incomplete.... This is still a major
problem, Ingres Corp are looking at it under issue 115344, but any
further ideas would be most welcome. Still looking to get GCN_TRACE
run at some point.
Steve
Hi
Thought it worth updating this. Ingres suggested the following to see
if it would get us up and running and check out if TCP/IP sockets were
the problem :-
<<<
I would therefore suggest that we switch from using TCP/IP sockets for
inter-process communication to unix sockets. In order to do this do
the following
ingstop
ingsetenv II_GC_PROT unix
ingsetenv II_GC_PORT /tmp/ingsockets/
(optional - make sure this directory exists and is writeable by
Ingres. Also make sure you include the trailing / If you don't set
II_GC_PORT then the socket files are created in /tmp)
ingstart
>>>
Since doing this (24hrs now) there have been no reported problems, so
as far as the end users are concerned this is fixed.... We've asked
those looking after the AIX server to investigate the TCP/IP AIX side
of things.
Steve