Effect of listener on existing connections?

groups....@gmail.com

unread,

Jun 13, 2008, 9:40:24 AM6/13/08

to

Hi all,

Oracle support is not giving us satisfactory results. Perhaps you can
give some answers?

We've recently upgraded our system to Oracle 10.2.0.3.0, running on
Solaris (sparc) 10 inside a ZFS zone (our previous system was Oracle
9.2.0.4.0 running on sparc Solaris 8, and was running on that for the
last 5 years). Since the upgrade 6 weeks ago, we've had two instances
where our applications (running in the same O/S environment on a
different node on the cluster) have locked up - existing connections
to Oracle become unresponsive when executing SQL (with no error
message - they just block), and attempts to create new connections are
met with the error:

"ORA-03135: connection lost contact".

The first time this happened, the outage lasted for about 5 minutes,
then it went away, and execution proceeded normally. The second time
it happened it lasted for 25 minutes until we were able to intervene
manually, and "fixed" the problem by restarting the instance &
listener. During the time of the outage, there were no error messages
in the alert.log, listener.log or /var/adm/messages. However, a few
minutes after normal operation was restored (the first time), and
right as we were restarting the instance (the second time), we saw
these messages appear in the alert.log:

"WARNING: inbound connection timed out (ORA-3136)".

We also see these messages appearing with regularity in our listener
log, during all times of operation (not just in proximity to the
outage):

"WARNING: Subscription for node down event still pending"

We opened an SR with Oracle Support, but so far, I'm unimpressed with
their response. They've told me nothing that I already found on
Google from searching for those error messages - namely that we need
to add some lines to our listener.ora & sqlnet.ora:

listener.ora:
INBOUND_CONNECT_TIMEOUT_LISTENER = 0
SUBSCRIBE_FOR_NODE_DOWN_EVENT_LISTENER=OFF

sqlnet.ora:
SQLNET.INBOUND_CONNECT_TIMEOUT = 0

We've made these changes, but I have low confidence that they will
actually solve the problem (and I've told Oracle as much) for the
following reasons:

1. The SUBSCRIBE_FOR_NODE_DOWN_EVENT_LISTENER value is to address the
issue of the listener locking up if you're not using ONS. However,
how does a blocked listener explain the fact that apps with existing
connections to the db become blocked? My understanding (and I could
be wrong here) is that once you're connected to the instance, there is
no further authentication that needs to be performed. Our in-house
experiments also show that we can "kill -STOP" the listener and apps
with existing connections continue to perform normally.

2. The INBOUND_CONNECT_TIMEOUT value is to address the issue with
clients that are "slow to authenticate", but the "WARNING: inbound
connection timed out (ORA-3136)" message appears AFTER the crisis
interval - in the recent
case, it appeared 20 minutes after our app became blocked. I would
expect to see it within 60 seconds of the block, as that's what the
value currently is.

3. All apps trying to establish new connections (including sqlplus,
running on the same node as the instance) received the login error,
not just "certain apps" as described in the Oracle tech note
274303.1. Why would an app like sqlplus, running on the same box as
the server (which was the case here) need more than 1 minute to
authenticate when logging in? Even our own apps shouldn't be taking
long to authenticate.

Should this event occur again, I don't see how Support will be able to
resolve it, as they haven't asked me for any additional info. I want
to know from them what steps I need to take to gather information so
that they can REALLY fix the problem, or give me an answer that
unambiguously addresses the issue, rather than just googling on error
numbers. So far my requests for a clear action plan from them have
been met with the email equivalent of a blank, slack-jawed stare.

=========

So, my basic questions are:

1. Can listener unavailability cause existing client connections to
become unresponsive?

2. If the answer to #1 is "no", Is there some way I can escalate this
issue within Support to get to a analyst who actually understands how
the Oracle server works, and is capable of doing something other than
typing search queries into metalink?

Thanks,

-S

gym dot scuba dot kennedy at gmail

unread,

Jun 13, 2008, 10:06:00 AM6/13/08

to

<groups....@gmail.com> wrote in message
news:a8f8b4b1-adbb-4916...@f36g2000hsa.googlegroups.com...

When you try to make a connection the Listener spawns a server process
connects you to that process and then gets out of the way. On a test system
you can confirm this by running sqlplus to the test system. After you have
sucessfully connected shut down the listener on the test system. Your
sqlplus session should continue to function normally. In another sqlplus
session if you try to connect you will see that the listener is unavailable.
So the answer to #1 is No. The asnwer to number 2 is to escalate the SR.
Ask to give it a higher priority. I am assuming this is a production
system. If this system is critical for your business then escalate it.
That will help you get someone more compotent. You can always back off the
priority and you should retain the same tech.
Good luck.
Jim

Stephen...@gmail.com

unread,

Jun 13, 2008, 10:54:33 AM6/13/08

to

Is there a way to escalate that doesn't relate to the system currently
being in a failure state? The only ecalation I know about is to make
it a "Sev 1", which essentially means I'm on-call 24/7 until the issue
is closed, which isn't really the case - that seems to be used for
cases where the system has failed and you need every possible support
resource in order to get the system running again. The system is
running fine now, but I believe it could fail again at any time.

On Jun 13, 10:06 am, "gym dot scuba dot kennedy at gmail"
<kenned...@verizon.net> wrote:
> <groups.brob...@gmail.com> wrote in message

Robert Klemme

unread,

Jun 13, 2008, 12:16:23 PM6/13/08

to

On 13.06.2008 15:40, groups....@gmail.com wrote:
> We've recently upgraded our system to Oracle 10.2.0.3.0, running on
> Solaris (sparc) 10 inside a ZFS zone (our previous system was Oracle
> 9.2.0.4.0 running on sparc Solaris 8, and was running on that for the
> last 5 years). Since the upgrade 6 weeks ago, we've had two instances
> where our applications (running in the same O/S environment on a
> different node on the cluster) have locked up - existing connections
> to Oracle become unresponsive when executing SQL (with no error
> message - they just block), and attempts to create new connections are
> met with the error:
>
> "ORA-03135: connection lost contact".

I have zero experience with ZFS zones but in your situation I would try
to look that way - just a gut feeling. Did you try OS tools to
investigate the issue? Maybe you have a configuration issue which leads
to resource conflicts escalating. Maybe opening a case with Sun could
help as well.

Kind regards

robert

joel garry

unread,

Jun 13, 2008, 1:34:47 PM6/13/08

to

Have you check TCP issues? Do you have lots of ports used? What kind
of timeout settings do you have? Do you see any FIN_WAIT type port
blocking when the issue occurs, or lasting long times? Have you
modified sqlnet to use larger buffer sizes? What kind of ping
response do you get during the problem times?

Just some questions I would ask, I'd be wondering about the OS not
being nice to Oracle.

jg
--
@home.com is bogus.
"That's strange..." - what you don't want to hear trying to track down
denied insurance claims.

AGT

unread,

Jun 13, 2008, 10:32:52 PM6/13/08

to

<groups....@gmail.com> wrote in message
news:a8f8b4b1-adbb-4916...@f36g2000hsa.googlegroups.com...

Writes:

>> Oracle support is not giving us satisfactory results. Perhaps you can
>> give some answers?
>>
>> We've recently upgraded our system to Oracle 10.2.0.3.0, running on
>> Solaris (sparc) 10 inside a ZFS zone (our previous system was Oracle
>> 9.2.0.4.0 running on sparc Solaris 8, and was running on that for the
>> last 5 years). Since the upgrade 6 weeks ago, we've had two instances
>> where our applications (running in the same O/S environment on a
>> different node on the cluster) have locked up - existing connections
>> to Oracle become unresponsive when executing SQL (with no error
>> message - they just block), and attempts to create new connections are
>> met with the error:
>> "ORA-03135: connection lost contact".

How hard would it be to eliminate the zoning..?
I dont think this is related nor ZFS but if you could test w/o
these changes then youd know for sure.

I dont know why you would do this in the first place. Zones are
appropriate for some things and you generally get better
throughput from ZFS over UFS but zones just stir up the pot for me.

Dedicate the box(es) to Oracle only - dont even make a special
project for it - just use default. Keep things simple as possible.

Maybe you have reasons for all this fancy overhead but so far I see none : >

groups....@gmail.com

unread,

Jun 16, 2008, 5:48:55 PM6/16/08

to

On Jun 13, 10:32 pm, AGT <usenetpersonger...@gmail.com> wrote:
> <groups.brob...@gmail.com> wrote in message

The zones are here to stay. We sell a turnkey solution that runs on
self-contained hardware, so our apps plus the database all live on one
box (really a cluster - two boxes, with the one as a failover node).
The zones make it much easier to administer & monitor all the
components of the system (db + apps) with a unified mechanism.

Additionally, we perform our hot backups using zones & snapshots,
which is much faster and less intrusive than what we used to have
running rman - our backup window is now a second or two (while the
snapshot is taken), at which point the snapshot of the db zone can be
backed up at any point in the subsequent 24 hours. This works much
better for us, as different clients have different backup strategies.
It's simpler to let them point to a net-mountable volume that contains
all the files they need to archive to whatever backup strategy they're
using for their enterprise.

In any case, we haven't figured out how to replicate the scenario yet
- it only occurs in the production environment, and never showed up
during our testing. We typically tested a months worth of operation
at an accelerated rate (anywhere from 4x to 40x the normal speeds, so
tests finished in 2 - 7 days). We could start running some tests at
1x speeds, but given the intermittent rate of failure, it would be
hard to draw any conclusions from a non-zone-based system that didn't
fail after running for a month or two.

We may as well kick off a several-month simulation, though, in case we
start to see this problem occur with regularity.

Robert Klemme

unread,

Jun 17, 2008, 2:17:40 AM6/17/08

to

On 16.06.2008 23:48, groups....@gmail.com wrote:

> In any case, we haven't figured out how to replicate the scenario yet
> - it only occurs in the production environment, and never showed up
> during our testing. We typically tested a months worth of operation
> at an accelerated rate (anywhere from 4x to 40x the normal speeds, so
> tests finished in 2 - 7 days). We could start running some tests at
> 1x speeds, but given the intermittent rate of failure, it would be
> hard to draw any conclusions from a non-zone-based system that didn't
> fail after running for a month or two.
>
> We may as well kick off a several-month simulation, though, in case we
> start to see this problem occur with regularity.

I don't know Solaris too well but is there any chance to have some
monitoring run on the production box that exhibited the error over the
course of a week and try to capture circumstances of the error surfacing
that way? Maybe you can collect some network statistics along with
other data and later analyze it and find the error.

Kind regards

robert

mpachec...@hotmail.com

unread,

Jun 17, 2008, 9:07:55 AM6/17/08

to

Not sure if it will help or if it is even related but the default
value of the environment parameter USE_SHARED_SOCKET changed from 9i
to 10g. In 9i the default was FALSE and now it is TRUE. Do you use any
kind of gateway to other databases or a Firewall in production?

AGT

unread,

Jun 17, 2008, 11:41:58 AM6/17/08

to

On Tue, 17 Jun 2008 08:17:40 +0200, Robert Klemme wrote:

> On 16.06.2008 23:48, groups....@gmail.com wrote:
>> In any case, we haven't figured out how to replicate the scenario yet
>> - it only occurs in the production environment, and never showed up
>> during our testing. We typically tested a months worth of operation
>> at an accelerated rate (anywhere from 4x to 40x the normal speeds, so
>> tests finished in 2 - 7 days). We could start running some tests at
>> 1x speeds, but given the intermittent rate of failure, it would be
>> hard to draw any conclusions from a non-zone-based system that didn't
>> fail after running for a month or two.

Sounds like

>> We may as well kick off a several-month simulation, though, in case we
>> start to see this problem occur with regularity.

You could or see below

> I don't know Solaris too well but is there any chance to have some
> monitoring run on the production box that exhibited the error over the
> course of a week and try to capture circumstances of the error surfacing
> that way?

What if it all happens in millisecond or less...
Hard to capture that. Even with dtrace

> Maybe you can collect some network statistics along with
> other data and later analyze it and find the error.

Theres many a release of "Solaris" for many HW platforms. All of them must
be patched regularly especially if 'something funny is going on". One of
my clients utterly refuses to patch the OS claiming that would introduce a
"change" yet they patch everything else Oracle without question.. As I
said I have serious doubts that zoning or ZFS would cause this, but a
missing kernel/driver patch or a simple tweak in /etc/system might make
the ghost go away. For near instant rollback I suggest enabling live
upgrade - just in case a "change" is problematic