RAC internals - GLOBAL ENQUEUE SERVICES DEADLOCK DETECTED

Dusan Bolek

unread,

Nov 2, 2005, 2:44:13 AM11/2/05

to

We're encountering many of "Global Enqueue Services Deadlock detected"
messages in the alert log of one of our database instances (two node
RAC database) together with quite big global cache cr request waits in
STATSPACK report. The problem is that the only other available
information is trace file that looks like:

*** 2005-09-19 10:37:01.153
Global Wait-For-Graph(WFG) at ddTS[0.0] :
BLOCKED 70000016f956fd8 5 [0x2b90011][0x5f20],[TX] [131321,1285] 1
BLOCKER 70000016b5ff288 5 [0x2b90011][0x5f20],[TX] [131317,1159] 1
BLOCKED 70000016b1c8440 5 [0x3fe001d][0x1a1d],[TX] [131317,1159] 1
BLOCKER 700000199574ea0 5 [0x3fe001d][0x1a1d],[TX] [131321,1285] 1

and sometimes also like:

----------enqueue------------------------
lock version : 10665
Owner node : 0
grant_level : KJUSEREX
req_level : KJUSEREX
bast_level : KJUSERNL
notify_func : 0
resp : 7000001b13711d8
procp : 700000160d71fa0
pid : 5365832
proc version : 39
oprocp : 0
opid : 0
gid : 618
xid : 67532 66652
dd_time : 0.0 secs
dd_count : 0
timeout : 0.0 secs
On_timer_q? : N
On_dd_q? : N
lock_state : GRANTED
Open Options : KJUSERDEADLOCK
Convert options : KJUSERNOQUEUE
History : 0x9a514395
Msg_Seq : 0x0
res_seq : 0

The problem is that it looks like no one including local Oracle support
knows what is the meaning of values in this tracefile. We are also
pursuing this with Oracle Worldwide Support, but it can take a lot of
time. Is there someone who investigated this kind of trace file before?
Search on Metalink showed only a couple of similar, but never answered
questions.

Thank you

Dusan Bolek

unread,

Nov 2, 2005, 2:47:11 AM11/2/05

to

Database version is: 9.2.0.6, OS: AIX 64-bit 5.2.

Forgot to mention that. Sorry.

Dusan

Alexey Sergeyev

unread,

Nov 2, 2005, 5:36:53 AM11/2/05

to

Hi Dusan!

Yea, we 've experienced a problem like that. Due to some issues with
bitmap indexes we were getting periodically ORA-00060, with very similar
traces (i got those by setting '60 trace name ERRORSTACK level 1' event).
The traces weren't too informative, and i opened a TAR with the question
about more accurate diagnostic. The Oracle WWS suggestion was: "...
systemstate level 10 is what you need, though the trace information is
really **only to be read by Oracle** (there are no docs on how to read
it)...". So, i'm afraid, in a similar way your traces are useful for the
Oracle Support only...

Alexey Sergeyev

"Dusan Bolek" <spa...@seznam.cz> wrote in message
news:1130917453.7...@g44g2000cwa.googlegroups.com...

yon...@yahoo.com

unread,

Nov 2, 2005, 1:53:12 PM11/2/05

to

Dusan Bolek wrote:
> We're encountering many of "Global Enqueue Services Deadlock detected"
> messages in the alert log of one of our database instances (two node
> RAC database) together with quite big global cache cr request waits in
> STATSPACK report. The problem is that the only other available
> information is trace file that looks like:
>
> *** 2005-09-19 10:37:01.153
> Global Wait-For-Graph(WFG) at ddTS[0.0] :
> BLOCKED 70000016f956fd8 5 [0x2b90011][0x5f20],[TX] [131321,1285] 1
> BLOCKER 70000016b5ff288 5 [0x2b90011][0x5f20],[TX] [131317,1159] 1
> BLOCKED 70000016b1c8440 5 [0x3fe001d][0x1a1d],[TX] [131317,1159] 1
> BLOCKER 700000199574ea0 5 [0x3fe001d][0x1a1d],[TX] [131321,1285] 1

Dusan,

Metalink Note:262226.1 offers some interpretation. Your process 131321
on node 1 is waiting for a TX lock in mode 5 (the two hex numbers,
0x2b90011 and 0x5f20, may be id1 and id2 in v$lock, respectively).
Process 131317 on the same node is holding it. Unlike in a single
instance deadlock graph, a Global Wait-For-Graph is missing the current
SQL. The good news to you is that your database is capable of producing
this deadlock involving only one instance. So set cluster_database to
false and try to reproduce the deadlock to get a better trace file.

Note:181489.1 has an interesting section about global cache cr request.
But I think it's unrelated to the above deadlock.

>
> and sometimes also like:
>
> ----------enqueue------------------------
> lock version : 10665
> Owner node : 0
> grant_level : KJUSEREX
> req_level : KJUSEREX
> bast_level : KJUSERNL
> notify_func : 0
> resp : 7000001b13711d8
> procp : 700000160d71fa0
> pid : 5365832

That looks like lkdebug output and is about a different instance, a
different process. But confirm with Oracle support.

Yong Huang

Dusan Bolek

unread,

Nov 3, 2005, 6:50:03 AM11/3/05

to

Thank you, that looks like something. I will try to catch a deadlock
message in the alert log as quickly as possible, then find user
responsible (via pid) and from V$SQL the statement that could be
causing this trouble. It is not 100% reliable way, but could be
usefule.

Dusan

Jonathan Lewis

unread,

Nov 3, 2005, 7:37:23 AM11/3/05

to

"Dusan Bolek" <spa...@seznam.cz> wrote in message

news:1131018603.2...@f14g2000cwb.googlegroups.com...

If the deadlock is on mode 5 enqueues at both
ends, it is most likely to be the foreign key locking
issue, possibly with a referential integrity constraint
declared as 'on delete cascade' but not covered by
a suitable index.

Look of foreign keys without indexes, and then
look for code that deletes the parent, or updates
the columns of the indexes protecting the parent key.

--
Regards

Jonathan Lewis

http://www.jlcomp.demon.co.uk/cbo_book/ind_book.html
Cost Based Oracle: Fundamentals
Now available to pre-order.

http://www.jlcomp.demon.co.uk/faq/ind_faq.html
The Co-operative Oracle Users' FAQ

http://www.jlcomp.demon.co.uk/appearances.html
Public Appearances - schedule updated 4th Sept 2005

Dusan Bolek

unread,

Nov 3, 2005, 11:39:45 AM11/3/05

to

Thank you, Jonathan. That looks very interesting. Unfortunately, the
code accessing this database is too complex to be investigated
in this way. However, I have got from dba_constraints all referential
constraints without associated index that were created with ON
DELETE CASCADE. There are twelve of them and I will ask the guy
responsible for this database to create appropriate indexes for
all of them. Maybe that will solve the problem.

--
Dusan Bolek

Dusan Bolek

unread,

Nov 4, 2005, 10:45:55 AM11/4/05

to

> Metalink Note:262226.1 offers some interpretation. Your process 131321
> on node 1 is waiting for a TX lock in mode 5 (the two hex numbers,
> 0x2b90011 and 0x5f20, may be id1 and id2 in v$lock, respectively).
> Process 131317 on the same node is holding it.

The strange thing is that I can find none of the processes that are
participating in a deadlock.
When checked against V$PROCESS, even immediately after a new dead lock
emerges in trace file,
there is no row returned from that table.

--
Dusan Bolek

yon...@yahoo.com

unread,

Nov 4, 2005, 3:24:17 PM11/4/05

to

You're right. I did some testing and find that the two numbers,
[131321,1285], in your case, do not in any way denote a process
(Note:262226.1 says the first number is PID). Instead they correspond
to transaction_id0 and transaction_id1 of v$ges_blocking_enqueue,
respectively (or the same in v$dlm_locks). Documentation says they're
lower and upper 4 bytes of the transaction identifier where the lock
belongs to. I can't find more information about it. Perhaps for our
purpose, we can conceptually think of the combination of the two
numbers, i.e. a transaction identifier, as a process identifier.

By the way, I do see the SQL involved in the global deadlock (tested in
9.2.0.7.0 on Linux):
...
*** 2005-11-04 13:38:33.199
user session for deadlock lock 0x7553ab14
...
Current SQL Statement:
update test set a = :"SYS_B_0" where a = :"SYS_B_1"
Global Wait-For-Graph(WFG) at ddTS[0.28] :
BLOCKED 0x7553ab14 5 [0xf001d][0x8353],[TX] [2162689,7995] 0
...

In any case, follow Jonathan's practical advice.

Yong Huang

dus...@gmail.com

unread,

Nov 6, 2005, 1:46:59 PM11/6/05

to

Yes, I have got info from Oracle that in some cases this numbers should
be related to transaction id instead of process id. However, I do not
know how it is encoded, because I'm also using Log Miner to find
offending SQL, but still with no success. :-(

--
Dusan Bolek

yon...@yahoo.com

unread,

Nov 6, 2005, 6:16:52 PM11/6/05

to

Using log miner is the way to go, unless you enable SQL trace
beforehand. Although it may not help solve your problem, can you ask
Oracle when the number(s) is/are PIDs? It may be that with two numbers
in brackets (as in your case), they're transaction IDs, and if it's a
single number as in Bug:3977157, it's PID. AIX can have surprisingly
high PIDs. A message at
http://unix.derkeiler.com/Mailing-Lists/AIX-L/2004-02/0059.html
says its max value is > 67 million. Solaris is only up to 999999 and
Linux 32k.

Please post back anything interesting you and Oracle support find.
Thanks.

Yong Huang

Dusan Bolek

unread,

Nov 7, 2005, 2:02:07 AM11/7/05

to

The Oracle support response was:

These values are:
<BLOCKED|BLOCKER> <lockp> <cvt|held mode> <res name> <pid|did|txn_id>
<node>

The second to last field (<pid|did|txn_id>) could be a pid or the txn
id depending
on if we have an XID based deadlock search flag.

However, as I already said. The mentioned txn id is not the transaction
id that could be found in a Log Miner output. :-(

--
Dusan Bolek

Dusan Bolek

unread,

Nov 9, 2005, 4:47:59 AM11/9/05

to

It is getting more and more strange. We encountered today a bunch of
problems that looks like a normal deadlock caused by application
behaviour (lot of users deadlocking each other). In this case as you
can see the mode is just same (5 on both ends) as with the previously
mentioned deadlocks. The only difference is that these are occurying
between instances. The strange is that these ones were accompanied with
the SQL that caused the problem. So I'm not sure if in our case your
statement "If the deadlock is on mode 5 enqueues at both ends, it is
most likely to be the foreign key locking issue" is correct.

See below:

*** 2005-11-09 08:16:10.087
user session for deadlock lock 70000016a0b29b8
pid=825 serial=32833 audsid=14358159 user: 425/USER
O/S info: user: wasadmin, term: , ospid: 1323120, machine: cn67b1
program: /was5/WebSphere/AppServer/java/bin/java@cn67b1 (
application name: /was5/WebSphere/AppServer/java/bin/java@cn67b1 (,
hash value=0
Current SQL Statement:
DELETE FROM SESSION_DATA WHERE SESSION_ID = :1 AND DATA_KEY = :2
user session for deadlock lock 70000016c0b07b0
pid=867 serial=37699 audsid=14358272 user: 425/USER
O/S info: user: wasadmin, term: , ospid: 1323120, machine: cn67b1
program: /was5/WebSphere/AppServer/java/bin/java@cn67b1 (
application name: /was5/WebSphere/AppServer/java/bin/java@cn67b1 (,
hash value=0
Current SQL Statement:
DELETE FROM SESSION_DATA WHERE SESSION_ID = :1 AND DATA_KEY = :2
Global Wait-For-Graph(WFG) at ddTS[0.eff6] :
BLOCKED 70000016a0b29b8 5 [0x20a0008][0x1cb52],[TX] [66361,527468] 0
BLOCKER 70000016f469718 5 [0x20a0008][0x1cb52],[TX] [131371,47259] 1
BLOCKED 70000016f469718 5 [0x20a0008][0x1cb52],[TX] [131371,47259] 1
BLOCKER 70000016bc15980 5 [0x20a0008][0x1cb52],[TX] [131217,103403] 1
BLOCKED 70000016b47f828 5 [0xbc001d][0x6ed23],[TX] [131217,103403] 1
BLOCKER 70000016d0b11a0 5 [0xbc001d][0x6ed23],[TX] [131469,20450] 1
BLOCKED 70000016d0b11a0 5 [0xbc001d][0x6ed23],[TX] [131469,20450] 1
BLOCKER 70000016d0b8a18 5 [0xbc001d][0x6ed23],[TX] [131394,51017] 1
BLOCKED 70000016d0b8b98 5 [0x400005][0x6914],[TX] [131394,51017] 1
BLOCKER 70000016c0b0648 5 [0x400005][0x6914],[TX] [66403,574442] 0
BLOCKED 70000016c0b07b0 5 [0x3d0017][0x82a7],[TX] [66403,574442] 0
BLOCKER 70000016a0b2400 5 [0x3d0017][0x82a7],[TX] [66361,527468] 0

--
Dusan Bolek

Jonathan Lewis

unread,

Nov 10, 2005, 4:54:12 AM11/10/05

to

"Dusan Bolek" <spa...@seznam.cz> wrote in message

news:1131529679....@z14g2000cwz.googlegroups.com...

> It is getting more and more strange. We encountered today a bunch of
> problems that looks like a normal deadlock caused by application
> behaviour (lot of users deadlocking each other). In this case as you
> can see the mode is just same (5 on both ends) as with the previously
> mentioned deadlocks. The only difference is that these are occurying
> between instances. The strange is that these ones were accompanied with
> the SQL that caused the problem. So I'm not sure if in our case your
> statement "If the deadlock is on mode 5 enqueues at both ends, it is
> most likely to be the foreign key locking issue" is correct.
>

My error - my comment is only appropriate for TM enqueues
in mode 5; not TX in mode 5.

On the other hand, I've never seen a TX in mode 5 - I had thought
that modes 4 and 6 were the only possibilities for TX. So I wonder
if we are being misled by the Metalink note.

Two thoughts - application oriented only:
Does this code come from a logoff trigger ?

Is webapplication server configured to drop a number of connections
if they have been idle for a few minutes.

The code looks like an attempt to record a logoff in the database -
but in RAC, you can use the same session_id from different nodes,
so maybe this code is simply not RAC-aware and causes a genuine
deadlock somehow.

--
Regards

Jonathan Lewis

http://www.jlcomp.demon.co.uk/faq/ind_faq.html

The Co-operative Oracle Users' FAQ

http://www.jlcomp.demon.co.uk/cbo_book/ind_book.html
Cost Based Oracle: Fundamentals

http://www.jlcomp.demon.co.uk/appearances.html
Public Appearances - schedule updated 4th Nov 2005

Jonathan Lewis

unread,

Nov 10, 2005, 4:57:12 AM11/10/05

to

<yon...@yahoo.com> wrote in message
news:1131135857....@g47g2000cwa.googlegroups.com...

> Dusan Bolek wrote:
>> > Metalink Note:262226.1 offers some interpretation. Your process 131321
>> > on node 1 is waiting for a TX lock in mode 5 (the two hex numbers,
>> > 0x2b90011 and 0x5f20, may be id1 and id2 in v$lock, respectively).
>> > Process 131317 on the same node is holding it.
>>

> You're right. I did some testing and find that the two numbers,
> [131321,1285], in your case, do not in any way denote a process
> (Note:262226.1 says the first number is PID). Instead they correspond
> to transaction_id0 and transaction_id1 of v$ges_blocking_enqueue,
> respectively (or the same in v$dlm_locks). Documentation says they're
> lower and upper 4 bytes of the transaction identifier where the lock
> belongs to. I can't find more information about it. Perhaps for our
> purpose, we can conceptually think of the combination of the two
> numbers, i.e. a transaction identifier, as a process identifier.
>
> By the way, I do see the SQL involved in the global deadlock (tested in
> 9.2.0.7.0 on Linux):
> ...
> *** 2005-11-04 13:38:33.199
> user session for deadlock lock 0x7553ab14
> ...
> Current SQL Statement:
> update test set a = :"SYS_B_0" where a = :"SYS_B_1"
> Global Wait-For-Graph(WFG) at ddTS[0.28] :
> BLOCKED 0x7553ab14 5 [0xf001d][0x8353],[TX] [2162689,7995] 0
> ...
>
> In any case, follow Jonathan's practical advice.
>
> Yong Huang
>

Given that the documentation and notes are wrong about the
process id / transaction id - might the comment about the
mode 5 lock also be wrong ?

Could you see a transaction holding or requesting a TX mode 5
in the few seconds before the deadlock was reported ?

yon...@yahoo.com

unread,

Nov 10, 2005, 4:23:12 PM11/10/05

to

I just tested again. I do not see lock mode 5. When the two sessions
involving a deadlock were hung, the session that later detected ORA-60
had this in v$lock:

SID TY ID1 ID2 LMODE REQUEST CTIME
BLOCK
--- -- ---------- ---------- ---------- ---------- ----------
----------
133 TX 327689 60114 6 0 72
2
133 TM 114196 0 3 0 72
2
133 TX 852009 33959 0 6 60
0

and these in v$ges_blocking_enqueue:

HANDLE GRANT_LEV REQUEST_L OPEN_OPT_DEADLOCK WHICH_QUEUE STATE
-------- --------- --------- ----------------- ----------- -----
68D149C0 KJUSEREX KJUSEREX 1 1 GRANTED
68D157E4 KJUSERNL KJUSEREX 1 2 OPENING

After this session detected ORA-60, followed by a few seconds delay,
the last row in v$lock output (where LMODE=0) is gone and the second
row in v$ges_blocking_enqueue is also gone. The lmd0 process trace file
shows:

Global Wait-For-Graph(WFG) at ddTS[0.5e] :
BLOCKED 0x68d157e4 5 [0xd0029][0x84a7],[TX] [2621441,14115] 0
BLOCKER 0x6d6964bc 5 [0xd0029][0x84a7],[TX] [1835010,4352] 1
BLOCKED 0x6c633798 5 [0x50009][0xead2],[TX] [1835010,4352] 1
BLOCKER 0x68d149c0 5 [0x50009][0xead2],[TX] [2621441,14115] 0

So you're right. According to v$lock, there's no mode 5 lock involved.
In fact, showing 5 on all four lines in this Wait-For-Graph plus the
fact that 5 is the only number I see in all WFGs I find makes me wonder
if this number is hardcoded in Oracle code.

Yong Huang

Dusan Bolek

unread,

Nov 11, 2005, 7:30:10 AM11/11/05

to

This code definitely came from an application server. They are storing
all application servers sessions there. It looks like there was a lock
on this table so none can login in or logout. Our application vendor is
still investigating its code, it looks like there are more issues to be
solved. However, I'm glad that I have been able to get some meaningful
information from a deadlock trace file. By getting txn id from the
trace file, I can find the offending SQL in a log miner output and that
is helping a lot to solve deadlock related problems.
Another interesting thing about the deadlock tracing that I've found is
that if a deadlock occurs between two transactions on the same node
there is no SQL in the generated tracefile, but if these transactions
were on different nodes it would leave an SQL in the trace file on one
of the affected nodes. Do not know why, but it looks like it works this
way.

--
Dusan Bolek