I just arrived at a shop where there are monthly performance problems.
At every end of the month, endusers are running reports (business
objects) and Peoplesoft batches.
Problem is, that there is no overview with the DBA-team of the
database use, e.g. :
- a nightbatch was accidentally started during daytime, the DBA team
was unable to figure this out. (enduser found out about this one)
- an other batch exeeded it's running time by more than 200% - DBA
team found out when looking for something else.
- the same report is ran by different users
- there are no ad hoc queries , but non of the batches and reports are
examined for performance.
etc.
I wonder how to get control over this situation and introduce a
proactive approach.
E.g. , I need an overview of the database use, and as a second target,
be able to quickly identify reports/ batches that are running amok.
If anyone has pointers on how they've dealt with such a situation, i
would be glad to hear.
Solaris, Oracle 9.2.0.5, app server
Thank you for your time,
H.
> hello everyone,
>
> I just arrived at a shop where there are monthly performance problems.
> At every end of the month, endusers are running reports (business
> objects) and Peoplesoft batches.
> Problem is, that there is no overview with the DBA-team of the
> database use, e.g. :
>
> - a nightbatch was accidentally started during daytime, the DBA team
> was unable to figure this out. (enduser found out about this one)
>
That seems a bit surprising. We have had similar problems at times and
our DBAs have always been able to determine what was consuming all the
resources.
> - an other batch exeeded it's running time by more than 200% - DBA
> team found out when looking for something else.
>
> - the same report is ran by different users
>
> - there are no ad hoc queries , but non of the batches and reports are
> examined for performance.
>
> etc.
>
> I wonder how to get control over this situation and introduce a
> proactive approach.
>
> E.g. , I need an overview of the database use, and as a second target,
> be able to quickly identify reports/ batches that are running amok.
>
> If anyone has pointers on how they've dealt with such a situation, i
> would be glad to hear.
>
>
> Solaris, Oracle 9.2.0.5, app server
>
> Thank you for your time,
1. If you have access to the code and if its using PL/SQL or Java, consider
adding dbms_application_info statements. this can help the DBAs identify
what is going on.
2. See if you can schedule the regular reports rather than allowing end
users to execute them whenever they want. for example, if you know that
some end users need a particular report on the last Thursday of each
month and if it doesn't need to be run immediately, schedule the report
to run during a low demand period, such as 3am (watch out for backups
etc).
What we ended up doing was implementing another layer between the app
server and the database. this layer established a hierarchical form of
access control. End users were still able to request reports, but unless
they had the necessary access rights, their report would be scheduled to
run during a low demand period. some senior staff had the authority to
execute a report immediately. In most cases, staff had no problems with
having to wait for a report (it did require some change management and
some users complained because they werre now required to 'think
ahead'. However, the outcome did mean that they actually got their
reports and usually got them at an expected time. Previously, there was
a lot of variation - once they got use to having to 'order' reports
earlier, they appreciated the fact that the reports turned up in a more
predictable way).
Don't just consider technical solutions to the problem. Often, such
problems are based around poor business practices. For example, we had a
problem where the payroll report was taking far too long to run and it
was getting worse every month. We did everything we could to tune things
and get it running fast enough, but we soon reached a point of
diminishing returns.
Further investigation found that the report was doing payment analysis
for over 40,000 employees. We only have around 5,000. Some more analysis
showed that the HR department had adopted procedures that didn't fit
well with how the software worked. Rather than terminating a position
when someone left, they updated it so that the employees records
appeared to represent an active employee who was not getting paid at the
moment. they had adopted this procedure because we have quite a high
turn-over of casual staff and they wanted to make the re-hiring of
casual staff easier. As a result, the payruns were taking longer and
longer to execute. The situation became worse after an application
upgrade. It turns out, the previous version had a bug that was fixed in
the new version. The fixing of the bug resulted in correct processing of
employee pay entitlements and as we had what appeared to be 40,000 staff
that could be entitled to pay, the system now processed all those
records.
this was a difficult one to resolve because it coincided with an
applicaiton upgrade and a move to a new version of Oracle running under
RAC. As we had moved to a new version of oracle and now had more
processing power, the problem was not picked up in testing - reports
were initially running within acceptable limits. However, after a few
months, the processing time had really blown out. There were also some
failures on the part of the HR systems administrator who was neglecting
to run monthly 'clean-up' jobs. Again, a process problem rather than a
technical problem.
--
tcross (at) rapttech dot com dot au
I do not find the fact that the DBA's have difficulty finding the
problem surprising. Often by the time word of a performance problem
reaches the DBA the problem can be gone.
There are several ways to attack an overall month-end performance
problem. One method would be to schedule or manually execute a series
of short statspack snapshots during a problem time period. Then the
high logical IO SQL could be exaimined from the reports. Tying
specific SQL statements to source programs can however be difficult.
You do have the option of generating outlines to tune the identified
statements.
Another approach would be to monitor the SGA during a problem period
looking for high cost SQL and since the SQL is currently running tie
it to the source program. Now you know the source so you can apply
coding changes directly, if desired and supported.
Identifying key jobs based on the customer input and also on schedule
dependencies and setting SQL trace on for these jobs is probably a
more logical approach. The trace files can give you very specific
information about the critical tasks and the bottlenecks in the
tasks. Getting customer input can sometimes help with customer
relations. The customer can be more willing to live with a problem if
the customer knows you are actively working on solving the problem
even if it is going to take some time to resolve the issues.
HTH -- Mark D Powell --
>=10g. I know you don't want to hear that, and you do have some things available like statspack, as Mark suggested.
However, I swear I heard the alleluia chorus the first time I looked
at the v$ views and AWR in 10g.
We recently completed a total model of our application based on
resource utlization using the AWR. We determined our highest CPU
consuming statements over a 90 day window of half hour AWR snapshots.
Some people think that is too infrequent, but for modeling it is
perfect.
As noted, we tuned the bad code, and for the remainder, built a model
that allows us to predict capacity requirements down to within 3% or
so of host utilization based on a given workload. We did this by
determining the largest total CPU consumer statements over time, and
talked to the developers to understand what drives their execution
based on business logic. We then used that to plug into our model and
predict CPU.
For example, we may average 16 milliseconds of CPU per execution to
execute a given statement 200,000 times per half hour period. Once we
understand the origin of that statement and its relationship to
others, we can predict how often it will be executed based on a given
business workload. From there we can predict our CPU utilization
(3,200 seconds).
If the business load increases, we can use high level business metrics
to predict host utilization based on the frequently executed
statements and their average utilization.
The issue with 9i is that some of the elements of the v$ structures
just aren't present in that version, such as CPU_ELAPSED. That one
was an absolute goldmine for us. I have long thought that the
instrumentation of Oracle's kernel is what sets them apart, and this
is a perfect example.