I have a Tomcat app using a v4r5 AS/400 as a database server via JDBC.
This will work fine for days at a time, and then we'll get a call from
the users saying that the system's frozen. When we investigate we find
that one of the QZDASOINIT jobs appears to have gone into a loop - it's
clocking up CPU seconds like fury, but doing no I/O. The job doesn't
seem to be holding any locks, and there are other QZDASOINIT jobs
sitting idle, but all remote SQL has stopped. I have seen it stop the
entire system for half an hour with nothing timing out, so it's
obviously not a normal locking problem.
If we end the QZDASOINIT job, that sometimes cures the problem, but
sometimes it just causes another QZDASOINIT job to start exhibiting the
same behaviour. The only reliable fix we've been able to find is to
end ALL the QZDASOINIT jobs - this seems to resolve the problem for a
day or two.
Has anybody seen anything like this, or can you suggest something that
might help to alleviate it? I'm tearing my hair out...
Wayne
I haven't noticed any similarities in the jobs - when this happens I
have a live application completely locked, so my priorities tend to be
different! I'll see if I can spot anything next time it happens.
Thanks
I have seen this on two different systems. the first at v5r1. the
latest being a very large system at v5r3. my compiles were taking a
long time so I checked the system and saw the QZDASOINIT job taking all
the cpu. when I told operations about it they killed it right away.
apparently they see it enough that killing the job is a routine
procedure.
-Steve
possible. I know the first time I saw in on v5r1 it was a bug because
I could reproduce it. In this recent case I am just relaying what the
operator did. The job had been running for over an hour on the test
partition - I doubt it was running legitimately.
-Steve
Any ideas out there why this could be happening? Could this have
something to do with our current OS version? Do we need a PTF? We've
got a new iSeries box with V5R3. I'll test this out there and see if I
can recreate the situation.
Erick
looks like building a rather huge access path and a box with a poor cpu. The
database monitor (started for all jobs before the problem happens - tons of
data!) should show up the problem.
Dieter
I'll fire up DBMON and check it out, though, just in case.
Incidentally, I reconfigured the 400 to start no more than 5 QZDASOINIT
jobs, and to use each one only once. I similarly configuerd Tomcat to
use only 5 pooled connections, and to allow only one idle connection.
The result of this has been pretty much as you'd expect - the
QZDASOINIT jobs die and respawn very frequently, and the application
feels very slightly slower overall, in that things that used to happen
instantly (ie, within a second) now sometimes show a short pause (ie,
1-2 seconds). This is not great, but we have had no recurrence of the
looping problem since - I'll take the performance hit over complete
system lockup every two days quite happily!
I would like to get to the bottom of it though - this application is my
baby and I don't want it artificially slowed down by a load of job
start overhead, particularly if it might only be alleviating a problem
that will crop up anyway...
Wayne
just another idea, do you use extended dynamic (property of connect
String)???; there have been problems with this in more than one version of
the toolbox drivers. In this case: delete the package rather frequently, or
disable extended dynamic, its nearly useless anyway.
Dieter