suddenly, storage limit was exceeded

Serg Anitshenko

unread,

Aug 16, 2000, 3:00:00 AM8/16/00

to

Hi,

I have never though that this situation can take place. But...
Last nigh, the storage on one of my test system was filled.
I remember clearly the percent of used storage was 75%
when I left my office. But at two in the morning the system went off.
The used storage was 99% (last message in history log).
After IPL the system had 72%. I cannot find any tracks of creating
huge objects on the system by discover the audit journal.
I checked all what I knew : journal recievers, often used files and objects

What is a way I can use to find any tracks???
I don't want to live in the office. :)

serg.

Bryan Douglas-Henry

unread,

Aug 16, 2000, 3:00:00 AM8/16/00

to

I had a case similar to this (fortunately caught it before the system fell
over)...

We had a CL program running in batch, which called two queries.
The first query created a work file in QTEMP for the second query to report
on (summary of a summary)...
The first query had a bad join and created a 60Gb work file in QTEMP.
This space was then made available again when the job ended (QTEMP is used
only for that job and gets deleted when the job ends).

Regards,

Bryan

"Serg Anitshenko" <a...@fuib.com> wrote in message
news:8nef80$22n$1...@leda.fuib.com...

Jeffrey Flaker

unread,

Aug 16, 2000, 3:00:00 AM8/16/00

to

check QHST logs (dsplog)

Beverly

unread,

Aug 16, 2000, 3:00:00 AM8/16/00

to

This happened to me a few months ago- a user had a query run a report with
*NOMAX limitation. Check the outques for something HUGE.. BevH

"Serg Anitshenko" <a...@fuib.com> wrote in message
news:8nef80$22n$1...@leda.fuib.com...

Hei

unread,

Aug 17, 2000, 1:07:17 AM8/17/00

to

Will there have a looped program to generate files in QTEMP?

Jan Gerrit Kootstra

unread,

Aug 17, 2000, 1:35:31 AM8/17/00

to Hei

Serg.,

If you want to able to monitor the behaviour of your system 24 hours a
day 7 days a week, without living in your office, you could try a
mointoring tool like Gensys made by SPS.

Look at www.gensys.nl. Here you'll find some information about the
concept of the tool. At our company we have great experience with the
tool.

Regards,

Jan Gerrit Kootstra
Pink Elephant Business Online Services

Charles R. Pence

unread,

Aug 17, 2000, 3:00:00 AM8/17/00

to

Serg Anitshenko wrote:
> I have never thought that this situation can take place. But...
> Last night, the storage on one of my test system was filled.

> I remember clearly the percent of used storage was 75%
> when I left my office. But at two in the morning the system went off.
> The used storage was 99% (last message in history log).
> After IPL the system had 72%. I cannot find any tracks of creating
> huge objects on the system by discover the audit journal.
> I checked all what I knew : journal recievers, often used files and
> objects
>
> What is a way I can use to find any tracks???
> I don't want to live in the office. :)

The quickest way to narrow-down the information IMO is by *USRPRF;
DSPUSRPRF to *OUTFILE, and compare to a previous instance. IIRC,
the storage threshold condition can be monitored by a break handling
program on a [user created] QSYS/QSYSMSG *MSGQ on old releases;
otherwise refer to WRKSYSVAL QSTG*. Such a program can spool all
kinds of information for later review: WRKACTJOB, WRKSYSSTS, WRKDSKSTS
and even start a performance trace to probably track exactly which
jobs are currently taking up storage. With a programmed response,
no reason to be in the office at 02.00 :-)

In my experience the problem typically has origin in a poorly designed
query which produced a large result; that the query failed abnormally
[a joblog should be available from around when 100% was near], and
the storage was freed. In your description however, it appears more
to be a job creating [probably in an unexpected loop condition] a
large amount of temporary objects; objects whose storage is reclaimed
during an IPL -- might be a similar joblog?? I can not infer well,
because I did't understand "the system went off."

I find a review of WRKSPLF *ALL OUTPUT(*PRINT) and review of spool
data generated during the high % time sometimes is worthwhile.

Regards, Chuck
All comments provided "as is" with no warranties of any kind whatsoever.

Richard Jackson

unread,

Aug 17, 2000, 3:00:00 AM8/17/00

to

Just a possibility: did someone leave database monitor running?

If it happens again, drop me a note. There are ways ...

--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Richard Jackson
Mailto:richard...@richardjackson.net
http:\\www.richardjacksonltd.com
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

"Serg Anitshenko" <a...@fuib.com> wrote in message
news:8nef80$22n$1...@leda.fuib.com...
> Hi,
>

> I have never though that this situation can take place. But...
> Last nigh, the storage on one of my test system was filled.

> I remember clearly the percent of used storage was 75%
> when I left my office. But at two in the morning the system went off.
> The used storage was 99% (last message in history log).
> After IPL the system had 72%. I cannot find any tracks of creating
> huge objects on the system by discover the audit journal.
> I checked all what I knew : journal recievers, often used files and
objects
>
> What is a way I can use to find any tracks???
> I don't want to live in the office. :)
>

> serg.
>
>
>

Serg Anitshenko

unread,

Aug 18, 2000, 3:00:00 AM8/18/00

to

"Charles R. Pence" <crp...@vnet.ibm.com> wrote in message
news:399C13CA...@vnet.ibm.com...

>
> The quickest way to narrow-down the information IMO is by *USRPRF;
> DSPUSRPRF to *OUTFILE, and compare to a previous instance. IIRC,
> the storage threshold condition can be monitored by a break handling
> program on a [user created] QSYS/QSYSMSG *MSGQ on old releases;
> otherwise refer to WRKSYSVAL QSTG*. Such a program can spool all
> kinds of information for later review: WRKACTJOB, WRKSYSSTS, WRKDSKSTS
> and even start a performance trace to probably track exactly which
> jobs are currently taking up storage. With a programmed response,
> no reason to be in the office at 02.00 :-)

Yeah... I have a little alarm system that sends to a mobile or a beeper
short messages
but when a storage of system is 90% the MSF jobs end and the messages aren't
sent
to SMTP server... And maybe I was ready to go to my office but I didn't know
about
the problem. And night staff slept at the moment....... :)

> In your description however, it appears more
> to be a job creating [probably in an unexpected loop condition] a
> large amount of temporary objects; objects whose storage is reclaimed
> during an IPL -- might be a similar joblog?? I can not infer well,
> because I did't understand "the system went off."

the system ended all subsystems and proposed to make a dump after that
it made IPL.
Now I see into the dump and seems I need to checks all 200 jobs that
are in the dump manually. Maybe there are some tools to discover
a dump information???? except the SST.

>
> I find a review of WRKSPLF *ALL OUTPUT(*PRINT) and review of spool
> data generated during the high % time sometimes is worthwhile.
>
> Regards, Chuck
> All comments provided "as is" with no warranties of any kind whatsoever.

serg.

Richard Jackson

unread,

Aug 18, 2000, 3:00:00 AM8/18/00

to

Dump analysis using Display/Alter is really hard. If I were you, I would
prepare myself to accept the idea that you won't figure it out this time.

--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Richard Jackson
Mailto:richard...@richardjackson.net
http:\\www.richardjacksonltd.com
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

"Serg Anitshenko" <a...@fuib.com> wrote in message

news:8njhbm$agk$1...@leda.fuib.com...

Serg Anitshenko

unread,

Aug 19, 2000, 3:00:00 AM8/19/00

to

"Richard Jackson" <richard...@attglobal.net> wrote in message
news:399da...@news1.prserv.net...

> Dump analysis using Display/Alter is really hard. If I were you, I would
> prepare myself to accept the idea that you won't figure it out this time.
>

why not? only the time will help me.

serg.

Dirk Dedapper

unread,

Aug 19, 2000, 3:00:00 AM8/19/00

to

There is a command 'GO DISKTASKS' that collects and generate a listing with
all objects stored in any library. This could give you the opportunity to
find out the largest objects.

Also chech the QRPLOBJ library....

Serg Anitshenko heeft geschreven in bericht <8nef80$22n$1...@leda.fuib.com>...

Richard Jackson

unread,

Aug 19, 2000, 3:00:00 AM8/19/00

to

I don't understand what you mean, "only the time will help me". Does this
mean that you only need to know the time when the dump was taken? I'm
sorry, I don't understand.

Let's consider what a main storage dump is, the dump contains and does not
contain, and how you might use it to debug the problem. The dump is taken
because an exception occurred that terminates machine processing. There
aren't many exceptions like that (hundreds) but they are all interesting. A
main storage dump contains the contents of main storage and the CPU state
when the exception occurred. If you select the correct option, it may also
contain some virtual main storage.

You might ask, "doesn't the execution of the main storage dump code overlay
the current CPU state?" Yes and no. The "machine exception" errors are
recognized by the hardware. When the exception is signaled, the hardware
saves the CPU state onto a microcode-level call stack then starts the
exception handling and the dump program. The main storage dump program
knows how to reference the saved information.

Let me use a couple of examples:

Suppose that object X (a database file) is the thing you want to know about
because it was growing uncontrollably.

Suppose that some job inserted one record into X, that X needed to be
extended to accept the new record, that there was no more space available,
that the storage management space extend routine signaled the machine
exception, and the dump is taken as indicated above.

Now suppose the same conditions and that the same job inserted billions of
records into file X while occasionally sending a message to the QSYSOPR
message queue. This time, storage management is extending QSYSOPR when it
runs out of space.

Let's compare the two examples.

Approach number 1: find the object that caused the failure - In the first
case, we will see some routine name like #SM_extend" running and pointing to
object X and the job containing pointers to X and QSYSOPR. (There is some
storage management program that does this and it will have a name something
like "#SM_extend" but I made up this name.) In the second case, we see
"#SM_extend" running and pointing to QSYSOPR and the job containing pointers
to X and QSYSOPR. If we conclude that X was the problem is the first case,
is it valid to conclude that QSYSOPR is the problem in the second case? No,
it isn't. The dump will contain thousands of pointers to objects but the
dump won't tell us how big any of those objects are. If it tell us which
object has grown a lot in the last two hours then I don't know where to look
for that.

Approach number 2: find an object that is active in the dump that is too big
right now - You might want to go back and check each object in the dump.
When you looked at an object, perhaps you could recognize an object that
should be small but is currently huge. Many of the pointers in the dump
will point to temporary objects. Temporary objects are deleted when you
IPL. You can't check their sizes at all but if you could, you wouldn't be
checking the size of the ones that existed at the time of the failure. The
ones that you see now are new, created since the IPL.

Serg Anitshenko

unread,

Aug 21, 2000, 3:00:00 AM8/21/00

to

Hi Richard,

"Richard Jackson" <richard...@attglobal.net> wrote in message

news:399eb...@news1.prserv.net...

> I don't understand what you mean, "only the time will help me". Does this
> mean that you only need to know the time when the dump was taken? I'm
> sorry, I don't understand.
>

< skip...>

> Approach number 2: find an object that is active in the dump that is too
big
> right now - You might want to go back and check each object in the dump.
> When you looked at an object, perhaps you could recognize an object that
> should be small but is currently huge. Many of the pointers in the dump
> will point to temporary objects. Temporary objects are deleted when you
> IPL. You can't check their sizes at all but if you could, you wouldn't be
> checking the size of the ones that existed at the time of the failure.
The
> ones that you see now are new, created since the IPL.
>

< skip...>

>
> I talk too much. Is any of this information useful to you?
>

Thanks for your letter.
I think, that in my situation there were some temporary objects that were
deleted last IPL. I have not known yet exactly what job is an owner of this
objects but
seems it was java application that uses objects on IFS. Now, as the first
step,
I reduced the max allowed storage for some user profiles that start that
applications.
I think, the dump of memory must contain some information about links onto
the object
that was growing. The goal of analysis is only to find any links or name of
these obects.
And the time, I said, is time of this analysis.
Sorry for my english.

Richard Jackson

unread,

Aug 21, 2000, 3:00:00 AM8/21/00

to

The main storage dump contains a very large number of pointers so it
probably does contain a pointer to the object that created the problem. The
trick is to figure out which object/pointer is the failing one. That will
probably require some detailed knowledge about operating system components
and might require compile listings.

If you have a small number of possible failing applications, find a place
inside each one where it is creating an file or increasing the size of a
file in the IFS - open or fprintf or the appropriate verbs for your language
of choice. At the point where the object is created or increased in size,
send a "trace point message" identifying the the trace point sending the
message, the date, and the time to a message queue or IFS file. Watch the
file/message queue. If something goes crazy, you will see hundreds of
messages from the same trace point very close together in time. Once you
figure out which trace point is causing the problem, figure why that
function is looping or being called so often. This technique is crude but
effective.

Regarding your comment, "Only time will help me" - if you are willing to
work hard for a long time, you will eventually figure it out. That approach
has always worked for me.

--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Richard Jackson
Mailto:richard...@richardjackson.net
http:\\www.richardjacksonltd.com
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

"Serg Anitshenko" <a...@fuib.com> wrote in message

news:8nqtde$buu$1...@leda.fuib.com...

Charles R. Pence

unread,

Aug 21, 2000, 3:00:00 AM8/21/00

to

Serg Anitshenko wrote:
> "Charles R. Pence" <crp...@vnet.ibm.com> wrote in message

> > The quickest way to narrow-down the information IMO is by *USRPRF;
> > DSPUSRPRF to *OUTFILE, and compare to a previous instance. IIRC,
> > the storage threshold condition can be monitored by a break handling
> > program on a [user created] QSYS/QSYSMSG *MSGQ on old releases;
> > otherwise refer to WRKSYSVAL QSTG*. Such a program can spool all
> > kinds of information for later review: WRKACTJOB, WRKSYSSTS, WRKDSKSTS
> > and even start a performance trace to probably track exactly which
> > jobs are currently taking up storage. With a programmed response,
> > no reason to be in the office at 02.00 :-)
>
> Yeah... I have a little alarm system that sends to a mobile or a beeper
> short messages but when a storage of system is 90% the MSF jobs end and
> the messages aren't sent to SMTP server... And maybe I was ready to go
> to my office but I didn't know about the problem. And night staff slept
> at the moment....... :)

Because you have a dependency on MSF, and one item you need to detect
is one which may disable MSF, then you should remove the condition which
disables MSF. Use STRSST, work with disk units, configure disk units,
and then work with ASP threshold. Set the ASP to allow a higher threshold
than the WRKSYSVAL QSTG* settings.