QZDASOINIT looping problems?

wayne_r

unread,

Oct 19, 2005, 6:13:28 PM10/19/05

to

Hi,

I have a Tomcat app using a v4r5 AS/400 as a database server via JDBC.
This will work fine for days at a time, and then we'll get a call from
the users saying that the system's frozen. When we investigate we find
that one of the QZDASOINIT jobs appears to have gone into a loop - it's
clocking up CPU seconds like fury, but doing no I/O. The job doesn't
seem to be holding any locks, and there are other QZDASOINIT jobs
sitting idle, but all remote SQL has stopped. I have seen it stop the
entire system for half an hour with nothing timing out, so it's
obviously not a normal locking problem.

If we end the QZDASOINIT job, that sometimes cures the problem, but
sometimes it just causes another QZDASOINIT job to start exhibiting the
same behaviour. The only reliable fix we've been able to find is to
end ALL the QZDASOINIT jobs - this seems to resolve the problem for a
day or two.

Has anybody seen anything like this, or can you suggest something that
might help to alleviate it? I'm tearing my hair out...

Wayne

Jack Kingsley II

unread,

Oct 19, 2005, 7:00:13 PM10/19/05

to

What OS version, what client access version, sounds like a PTF candidate.
Are you using IP or DNS to get to the 400. Do these jobs that lock up have
any similarities to them, files in use, etc.

wayne_r

unread,

Oct 20, 2005, 5:19:44 AM10/20/05

to

The OS release is V4R5M0 - it's a customer's machine, and they're
stubborn about upgrading. I've asked them to check PTF levels. I
don't believe client access is part of the equation - we're using the
JDBC driver that comes as part of JTOpen 4.8. We point at the 400's IP
address.

I haven't noticed any similarities in the jobs - when this happens I
have a live application completely locked, so my priorities tend to be
different! I'll see if I can spot anything next time it happens.

Thanks

Steve Richter

unread,

Oct 20, 2005, 8:07:39 AM10/20/05

to

I have seen this on two different systems. the first at v5r1. the
latest being a very large system at v5r3. my compiles were taking a
long time so I checked the system and saw the QZDASOINIT job taking all
the cpu. when I told operations about it they killed it right away.
apparently they see it enough that killing the job is a routine
procedure.

-Steve

jac...@tampabay.rr.com

unread,

Oct 20, 2005, 10:40:24 AM10/20/05

to

steve, your killing a job, is it possible that someone is doing a
remote sql or what not on large files thus degrading the system, could
be the same thing for wayne_r.

Steve Richter

unread,

Oct 20, 2005, 11:26:30 AM10/20/05

to

possible. I know the first time I saw in on v5r1 it was a bug because
I could reproduce it. In this recent case I am just relaying what the
operator did. The job had been running for over an hour on the test
partition - I doubt it was running legitimately.

-Steve

Erick

unread,

Oct 20, 2005, 8:03:12 PM10/20/05

to

I've seen this problem happen. We have a JAVA application in a
WebSphere Application server using a JDBC (JDBC driver ver 4.5)
connection to the iSeries (OS V4R5M0). In our case, problem is
triggered by a stored procedure that access tons of data from different
tables. I'm able to recreate this problem after re-compiling this
stored procedure. The first run after recompile, causes all other
QZDASOINIT jobs started from the web app to freeze and status shows
TIMW. It also affects other interactive jobs that use one of the big
files accessed by the stored procedure. I don't see any locking
problems, since none of the jobs go into a LCKW status. The only
solution we have right now is to end the stored procedure and rerun it
during off-hours. Once it gets over the first run, everything seem to
work without a hitch.

Any ideas out there why this could be happening? Could this have
something to do with our current OS version? Do we need a PTF? We've
got a new iSeries box with V5R3. I'll test this out there and see if I
can recreate the situation.

Erick

Dieter Bender

unread,

Oct 21, 2005, 4:44:33 AM10/21/05

to

Hi,

looks like building a rather huge access path and a box with a poor cpu. The
database monitor (started for all jobs before the problem happens - tons of
data!) should show up the problem.

Dieter

wayne_r

unread,

Oct 21, 2005, 5:46:13 PM10/21/05

to

I can see that this might be true in Erick's case, but my application
only has five or six core tables, and none of them contain more than
about 50,000 rows at the moment. Unless I have the mother of Cartesian
joins hiding away in my SQL that has somehow never been seen in testing
(which, of course, is a possibility), I can't see how DB2/400 would be
stopped dead for 30 minutes by ANY query over such modest data volumes.

I'll fire up DBMON and check it out, though, just in case.

Incidentally, I reconfigured the 400 to start no more than 5 QZDASOINIT
jobs, and to use each one only once. I similarly configuerd Tomcat to
use only 5 pooled connections, and to allow only one idle connection.
The result of this has been pretty much as you'd expect - the
QZDASOINIT jobs die and respawn very frequently, and the application
feels very slightly slower overall, in that things that used to happen
instantly (ie, within a second) now sometimes show a short pause (ie,
1-2 seconds). This is not great, but we have had no recurrence of the
looping problem since - I'll take the performance hit over complete
system lockup every two days quite happily!

I would like to get to the bottom of it though - this application is my
baby and I don't want it artificially slowed down by a load of job
start overhead, particularly if it might only be alleviating a problem
that will crop up anyway...

Wayne

Jack Kingsley II

unread,

Oct 21, 2005, 6:04:19 PM10/21/05

to

Wayne, is this using any Java, could you have a runaway end user request
that got kicked out and left the port open, what about looking at the
subsystem that the job are running in, just looking for more clues.
Another thing that might be good to know, what are launching remotely to
create the job on the 400. Performance tool capture might be worthwhile as
well.

Dieter Bender

unread,

Oct 22, 2005, 5:30:44 AM10/22/05

to

wayne,

just another idea, do you use extended dynamic (property of connect
String)???; there have been problems with this in more than one version of
the toolbox drivers. In this case: delete the package rather frequently, or
disable extended dynamic, its nearly useless anyway.

Dieter