DB: Oracle Database 10.2.0.1.0, Enterprise Edition
OS: MS Windows Server 2003, Enterprise Edition (5.2.3790), SP1
App: Visual Studio 2003 (7.1.6030) C#, ODP.Net 10.2.0.1.0
I had been tasked with fixing a pretty severe bug a few weeks ago in a
couple of stored procedures in our application schema. The bug is fixed
now and the procs functional, but in the process of implementing the fix
I ran into an unexpected and (what I consider) strange speed/hang issue.
I essentially want to know if anyone has some insight into why the proc
hung based on their knowledge of how oracle handles cursors+loops+etc
under the hood. Details below.
This particular database schema design has the concept of an anchor or
parent table that is the container for multiple "versions" of a dataset.
The 2 procedures in question take in an old version id and a new version
id and each copy a different subset of that given version's data into
the new "version" (e.g., selecting from and inserting into the same
tables). All stored procs and tables etc are in the same SCHEMA. Also,
this is a development database, and I'm the only one logged into it
while this dev/testing is going on.
Headers (NOTE, names have been changed throughout):
create or replace procedure PR_Copy_Groups (P_To_Version_ID in RAW,
P_From_Version_ID in RAW)
create or replace procedure PR_Copy_Other_Groups (P_To_Version_ID in RAW,
P_From_Version_ID in RAW)
The "PR_Copy_Groups" copies one table's data, and the
"PR_Copy_Other_Groups" copies 3 tables' deep worth of data, but the
top-level table (first table) is defined exactly the same as the one
table that "PR_Copy_Groups" handles. The bug was in the initial select
statement only, for both procedures (the initial select is exactly the
same in both procs, only the "from TABLE" is different between the two).
The old version of the procedures used "Open Cursor --> loop and insert
row by row" to do the copying. Because "Other_Groups" is copying 3 tables
deep worth of data, that does in fact mean it had 3 nested loops (yuck).
Initially, I just changed the first select statement because that's all
the bug was -- it was selecting the wrong data and then later inserting
it -- and left the loop(s) alone (for now).
The select statement change for "PR_Copy_Groups" worked just fine in
testing. After applying the fix to the "PR_Copy_Other_Groups" procedure
and successfully compiling it, in testing the copying of "Other_Groups"
started hanging and simply never finished, when it used to run/finish
just fine before, even with the 3 loops and the old (incorrect) select
statement. (I let it run for 30min and it never did finish)
So I tested the copying from the application by doing the following in
the "PR_Copy_Other_Groups" proc:
I commented out the entire body of the proc -- it finishes
I commented out the two inner loops -- it finishes
I commented out only the innermost loop -- it finishes
I uncommented all three loops with the new SELECT -- it does NOT finish
I go back to the old bad SELECT with all 3 loops -- it finishes
I am baffled about how that initial select can change a stored procedure
that does easily finish running into one that does not at all, when the
initial select is only run once, isn't really slow, and does in fact
work properly (as is shown by PR_Copy_Groups and the commenting out
tests). The dataset is also not very big (see below the pseudocode).
Here's an overview of how the proc initially was written, the fix that
caused the problem, and the final solution, using pseudo-PL/SQL (note
the PSEUDO here, to keep it from becoming unbearably long).
------------------ ( OLD ) -------------------
PROCEDURE SCHEMA.PR_Copy_Other_Groups ( ... )
BEGIN
open Cur1 for
select *
from SCHEMA.TB_OTHER_GROUP
where VERSION_ID = P_From_Version_ID
;
loop until Cur1 done
-- IN param, out param
PR_Copy_CLOB_Tbl_Entry( Old_CLOB_ID, New_CLOB_ID );
INSERT INTO SCHEMA.TB_OTHER_GROUP ( ..., MY_CLOB_ID )
VALUES ( ..., from Cur1, New_CLOB_ID )
Open Cur2 for SELECT (...) from SCHEMA.TB_INNER1 [...];
loop until Cur2 done
INSERT INTO SCHEMA.TB_INNER1 [...] VALUES (..., from Cur2);
Open Cur3 for SELECT (...) from SCHEMA.TB_INNER2 [...];
loop until Cur3 done
INSERT INTO SCHEMA.TB_INNER2 [...] VALUES (..., from Cur3);
end loop;
end loop;
end loop;
END;
------------------------------------------------
--------------- ( FIX w/hang ) ---------------
PROCEDURE SCHEMA.PR_Copy_Other_Groups ( ... )
BEGIN
open Cur1 for
select (some stuff),
e1.ID ENTITY1_ID, e2.ID ENTITY2_ID, e3.ID ENTITY3_ID, e4.ID ENTITY4_ID
from SCHEMA.TB_OTHER_GROUP g
LEFT OUTER JOIN SCHEMA.TB_ENTITY oe1 on oe1.ID = g.ENTITY1_ID
LEFT OUTER JOIN SCHEMA.TB_ENTITY e1 on e1.NAME = oe1.NAME
and e1.VERSION_ID = P_To_Version_ID
LEFT OUTER JOIN SCHEMA.TB_ENTITY oe2 on oe2.ID = g.ENTITY2_ID
LEFT OUTER JOIN SCHEMA.TB_ENTITY e2 on e2.NAME = oe2.NAME
and e2.VERSION_ID = P_To_Version_ID
LEFT OUTER JOIN SCHEMA.TB_ENTITY oe3 on oe3.ID = g.ENTITY3_ID
LEFT OUTER JOIN SCHEMA.TB_ENTITY e3 on e3.NAME = oe3.NAME
and e3.VERSION_ID = P_To_Version_ID
LEFT OUTER JOIN SCHEMA.TB_ENTITY oe4 on oe4.ID = g.ENTITY4_ID
LEFT OUTER JOIN SCHEMA.TB_ENTITY e4 on e4.NAME = oe4.NAME
and e4.VERSION_ID = P_To_Version_ID
where g.VERSION_ID = P_From_Version_ID
;
loop until Cur1 done
[ same as above, nothing changed here ]
end loop;
END;
------------------------------------------------
--------------- ( FINAL FIX ) ----------------
PROCEDURE SCHEMA.PR_Copy_Other_Groups ( ... )
BEGIN
INSERT INTO SCHEMA.TB_OTHER_GROUP ( ... )
SELECT ( ... ) from [ that big new select above ]
;
INSERT INTO SCHEMA.TB_INNER1 ( ... )
SELECT ( ... ) from SCHEMA.TB_INNER1
where VERSION_ID = P_From_Version_ID
;
INSERT INTO SCHEMA.TB_INNER2 ( ... )
SELECT ( ... ) from SCHEMA.TB_INNER2
whre VERSION_ID = P_From_Version_ID
;
END;
------------------------------------------------
Row counts for the above tables:
TB_GROUP: 2500
TB_OTHER_GROUP: 1500
TB_ENTITY: 200
TB_INNER1: 2000
TB_INNER2: 8000
------------------------------------------------
I can provide some actual code snippets with changed names where needed
but all the code (tables + stored procs) for this whole thing is really
a ton of code and and it would take hours to desensitize it all if I
were to post everything relevant (the old Copy Other Groups proc is 200
lines by itself). Just ask if you'd like to see some particular snippet
or xplans or init params or whatever.
Anyway, like I said earlier, I am hoping, for my own curiosity, someone
here can provide some under-the-hood insight into how oracle processes
cursors+loops (or whatever else) that would cause a previously-working
procedure to hang just by changing an exterior (to the loops) SELECT
statement that is only run once per procedure invocation.
Thanks for reading, if you managed to make it this far. :)
--
"Care must be exorcised when handring Opiticar System as it is apts to
be sticked by dusts and hand-fat." --Japanese Translators
"Keep your fingers off the lens." --Elton Byington, English Translator
>Anyway, like I said earlier, I am hoping, for my own curiosity, someone
>here can provide some under-the-hood insight into how oracle processes
>cursors+loops (or whatever else) that would cause a previously-working
>procedure to hang just by changing an exterior (to the loops) SELECT
>statement that is only run once per procedure invocation.
You are guilty of two cardinal sins in your second version
1 You are using outer joins all over the place
2 You are processing records instead of sets.
Ad 1:
an outer join should really be an exception.
If you have outer joins all over the place, likely your design is
utterly wrong.
Let's take for a moment a very simple example.
select e.deptno, e.empname, d.deptname
from emp left outer join dept
What is going to happen? Emp is going to be the driving table, which
is joined to dept by a nested loops join or *worse* a MERGE JOIN
CARTESIAN.
You should have reviewed the EXPLAIN PLAN for this particular
statement, and you would probably have noticed it uses MERGE JOIN
CARTESIAN all over the place (and heaps of temporary segment and 100
percent of your CPU).
THIS is what you perceive as a 'hang' and your server is experiencing
this as a full blown knockout.
A result of using OUTER JOINs everywhere.
Outer joins are EVIL.
2 Oracle has a seperate SQL engine and a separate PL/SQL engine.
This means: for every SQL statement you execute inside PL/SQL you get
a *context switch*.
So this is why you should not code procedurally (using records) what
you could have done non-procedurally (using INSERT SELECT processing a
set).
If you insist on doing it procedurally you should at least have used
BULK COLLECT and BULK INSERT.
Both BULK COLLECT and BULK-INSERT have been there since 8i, however as
many developers didn't read the books of Steven Feuerstein, because
they think PL/SQL is equally 'smart' as T-SQL, BULK COLLECT and BULK
INSERT are under used.
What you have now is
1 a 'hang' caused by outer join
2 a completely unscalable procedure.
Happens all the time. People like myself earn their money from
eradicating those procedures.
Hth
--
Sybrand Bakker
Senior Oracle DBA
Sybrand is correct: Cursor loops are hopelessly obsolete and
inefficient. For demos using array processing go to Morgan's
Library at www.psoug.org and look up "Array Processing."
--
Daniel A. Morgan
Oracle Ace Director & Instructor
University of Washington
damo...@x.washington.edu (replace x with u to respond)
Puget Sound Oracle Users Group
www.psoug.org
>You are guilty of two cardinal sins in your second version
Guess I better not die until I've gone to confession, then...
>1 You are using outer joins all over the place
>2 You are processing records instead of sets.
>Ad 1:
>an outer join should really be an exception.
>If you have outer joins all over the place, likely your design is
>utterly wrong.
The design of the tables is fine. Unfortunately you're just going to
have to trust me on that one (even though I know you don't want to :).
The database models a very complex system that is rather proprietary
(we're not talking commercial-proprietary here either), so I can't
really give sufficient insight into it to allow you to believe me.
Anyway, I can eliminate 4 of the outer joins from that SELECT statement.
Go ahead and roar at me for that.
>Let's take for a moment a very simple example.
>select e.deptno, e.empname, d.deptname
>from emp left outer join dept
>What is going to happen? Emp is going to be the driving table, which
>is joined to dept by a nested loops join or *worse* a MERGE JOIN
>CARTESIAN.
>You should have reviewed the EXPLAIN PLAN for this particular
>statement, and you would probably have noticed it uses MERGE JOIN
>CARTESIAN all over the place (and heaps of temporary segment and 100
>percent of your CPU).
I looked at the xplan.
With 4 outers it is 10% of the CPU
With 8 outers it is 11% of the CPU
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
| 0 | SELECT STATEMENT | | 1 | 347 | 31 (10) | 00:00:01 |
| 0 | SELECT STATEMENT | | 34 | 11798 | 39 (11) | 00:00:01 |
On a larger dataset, I expect the CPU may be used more. However, I don't
expect the dataset to get very large (less than 5 people will be using
the apps that talk to this db sporatically in a production environment,
and they historically have only needed to once every few months). Yes, I
do know the adage about assumptions. However if I were to place a bet,
I'm betting that I'm right in this case.
>THIS is what you perceive as a 'hang' and your server is experiencing
>this as a full blown knockout.
>A result of using OUTER JOINs everywhere.
>Outer joins are EVIL.
Well, they are also sometimes necessary. Filling in the four ENTITY
fields is not required, and I still want to grab the GROUP entries even
if none of them are filled in. However the e1/e2/e3/e4 can be INNER
joins.
So anyway, I can agree that outer joins are bad where they can be
avoided. I'll see if the money man will let me change the procedures
again even though I believe that particular bug has been closed out.
>2 Oracle has a seperate SQL engine and a separate PL/SQL engine.
>This means: for every SQL statement you execute inside PL/SQL you get
>a *context switch*.
Okay, that is interesting. I am assuming this has significant overhead,
or at least enough that when processing large amounts of data row-by-row
it can really, really increase the run time unnecessarily.
>So this is why you should not code procedurally (using records) what
>you could have done non-procedurally (using INSERT SELECT processing a
>set).
Well like I said, I was tasked to fix the bug. The cursor/loop stuff was
already there. And I ended up replacing it all with a INSERT SELECT
(like you can see in the original post).
>If you insist on doing it procedurally you should at least have used
>BULK COLLECT and BULK INSERT.
>Both BULK COLLECT and BULK-INSERT have been there since 8i, however as
>many developers didn't read the books of Steven Feuerstein, because
>they think PL/SQL is equally 'smart' as T-SQL, BULK COLLECT and BULK
>INSERT are under used.
Okay, so let me see if I understand you correctly here. If I'm going to
keep the loops instead of INSERT INTO SELECT FROM, then I should've used
the BULK COLLECT/BULK INSERT and thrown it into an "array" and then
processed the array from there? But otherwise, it's better to use INSERT
INTO SELECT FROM in all cases.
I did make a change that re-introduced another cur/loop (not nested
though) in the final fix that I didn't list in the posted pseudo-code.
It doesn't use the BULK COLLECT/BULK INSERT though. I'll see if I can
still change that about the procedures, if the money man says yes to
the above changes too.
>What you have now is
>1 a 'hang' caused by outer join
>2 a completely unscalable procedure.
Okay. Well, Oracle is a learning process. Every little bit of insight
helps. Thank you for your comments.
>Happens all the time. People like myself earn their money from
>eradicating those procedures.
You forgot to work "lazy programmers" in there somewhere. :)
The BULK stuff I'm not really familiar with but it seems very
interesting. I was actually planning to look into it soon since the
newer project I'm working on is forcing me to use SQL embedded in the
app in order to interact with the DB, and I noticed there was a BULK
INSERT option with ODP.Net. In my experience, per-row transactions via
ODP.Net are cripplingly slow, and have a ton of overhead and I really
want to avoid that if possible.
However, that aside, I will see if I am allowed to make that change in
these two recent stored procs (see my latest reply to Sybrand).
The rule of thumb is that row-by-row can be expected to take around 4
times as long as the equivalent single SQL statement.
The CPU estimate is not the only thing to look at in the execution
plan. Perhaps the optimizer has picked the wrong join order or access
path, possibly indicating an issue in the stats or an inefficiency in
the SQL, such as a NOT IN predicate against a nullable key.
For non-obvious PL/SQL performance issues the first thing I would do
is run the procedure using DBMS_PROFILER. (IDEs such as PL/SQL
Developer have a button for this but you can work it from the command
line if you have to.) Also while it appears to be hanging, query v
$session, v$session_wait, v$session_wait_history and v$session_longops
to see what it's actually up to.
Unless an exception is generated or the table has constraints, indexes,
triggers, etc. Then it can 10:1 or more.