Troubleshooting NIC / HBA errors in Solaris 10

underh20

unread,

Mar 15, 2011, 11:00:37 PM3/15/11

to

Our Sun T5220 server "atlanta" runs Solaris 10. There are NIC failure
and repair messages appearing in the message log regularly. Are there
issues with network or are there issues with the HBAs ?

What are the commands and best ways to trouble-shoot the network and
HBAs ? "aggr-1" is the virtual interface aggregate of two interfaces
at the server. "atlanta" is one of the two-nodes in the Sun
Cluster. Thanks, Bill

Mar 1 12:24:07 atlanta in.mpathd[309]: [ID 23451 daemon.error] NIC
failure detected on aggr-1of group ipmp0
Mar 1 12:24:07 atlanta Cluster.PNM: [ID 23452 daemon.notice] sc_ipmp0:
state transition from OK to DOWN.

Mar 1 12:24:23 atlanta in.mpathd[309]: [ID 300212 daemon.error] NIC
repair detected on aggr-1 of group ipmp0
Mar 1 12:24:23 atlanta Cluster.PNM: [ID 234234 daemon.notice]
sc_ipmp0: state transition from DOWN to OK.

Paroksha

unread,

Mar 16, 2011, 1:08:31 AM3/16/11

to

its NIC failure
see the output of cmd

# fmadm faulty .... it will show u carrect result

# prtdiag -v

if this NIC is onboard u have to replace motherboard
before that try to connect with new cable

underh20

unread,

Mar 16, 2011, 12:46:38 PM3/16/11

to

#fmadm faulty command returns nothing. Does that mean we are good at
all the network parts in the server including the HBAs ?
#prtdiag -v doesn't show any errors either.

If we are to swap to different cable, do we swap out one at time on
the two cables since the aggregate virtual interface is built on 2
physical connections ?
Any command or thing we will need when swapping out these cables ?

Thanks, Bill

Cydrome Leader

unread,

Mar 16, 2011, 6:02:33 PM3/16/11

to

underh20 <underh20.s...@gmail.com> wrote:
> Our Sun T5220 server "atlanta" runs Solaris 10. There are NIC failure
> and repair messages appearing in the message log regularly. Are there
> issues with network or are there issues with the HBAs ?
>
> What are the commands and best ways to trouble-shoot the network and
> HBAs ? "aggr-1" is the virtual interface aggregate of two interfaces
> at the server. "atlanta" is one of the two-nodes in the Sun
> Cluster. Thanks, Bill

what does the link status show?

dladm show-dev

it might be a bad cable or switch port. I find I have to swap more cables
than I do network adapters, so I look there first.

underh20

unread,

Mar 16, 2011, 6:55:09 PM3/16/11

to

On Mar 16, 3:02 pm, Cydrome Leader <prese...@MUNGEpanix.com> wrote:

We have the following outputs for "dladm" command. How do they
look ? Thanks, Bill

# dladm show-dev
e1000g0 link: up speed: 1000 Mbps duplex: full
e1000g1 link: up speed: 1000 Mbps duplex: full
e1000g2 link: up speed: 1000 Mbps duplex: full
e1000g3 link: unknown speed: 0 Mbps duplex: half
e1000g4 link: unknown speed: 0 Mbps duplex: half

# dladm show-link
e1000g0 type: non-vlan mtu: 1500 device: e1000g0
e1000g1 type: non-vlan mtu: 1500 device: e1000g1
e1000g2 type: non-vlan mtu: 1500 device: e1000g2
aggr-1 type: non-vlan mtu: 1500 aggregation: key 1

# dladm show-aggr
key: 1 (0x0001) policy: L4 address: 0:21:28:4f:f8:3c (auto)
device address speed
duplex link state
e1000g0 0:21:28:4f:f8:3c 1000 Mbps
full up attached
e1000g2 0:21:28:4f:f8:3e 1000 Mbps
full up attached

Message has been deleted

Paroksha

unread,

Mar 17, 2011, 9:10:18 AM3/17/11

to

i have faced similar problem once
pl check ur kernel level, its bug of OS level

the solution was to update the system ,

u r talking about IPMP then surely d same issue

if possible apply latest recommended patch or patch-cluster.

we faced same issue in IPMP that interface is automatically getting swithed over , same kind u r facing now, we did all the things . i dont remember what was exactly the kernel level.

the problem was solved after updation.
it worked for us,

Paroksha

unread,

Mar 17, 2011, 9:13:53 AM3/17/11

to

i have faced similar problem once
pl check ur kernel level, its bug of OS level

the solution was to update the system ,

u r talking about IPMP then surely d same issue

if possible apply latest recommended patch or patch-cluster.

we faced same issue in IPMP that interface is automatically getting swithed over , same kind u r facing now, we did all the things . i dont remember what was exactly the kernel level.

problem was solved after applyng the patch-cluster
it surely worked for me

underh20

unread,

Mar 17, 2011, 5:21:16 PM3/17/11

to

Interesting point. We have the following Solaris 10 release version
and kernel patch level. Do they look similiar to yours before or
after the patch application. Thanks, Bill

Oracle Solaris 10 9/10
kernel level: 144488-06

tim....@gmail.com

unread,

Mar 18, 2011, 12:17:34 PM3/18/11

to

Another thing to consider is network congestion , depending on the
configuration of IPMP it is possible the a bust router or target of
the ping in a probe based configuration is droping the ping packets,
it is possible that this will cause the interface to be failed over
and then as traffic goes away, fail back, Try increasing the value
in /etc/default/mpathd to FAILURE_DETECTION_TIME=20000 (20 seconds) or
more.

underh20

unread,

Mar 20, 2011, 1:17:28 AM3/20/11

to

On Mar 18, 9:17 am, "tim.w...@Inklingresearch.com"

<tim.w...@gmail.com> wrote:
> On Mar 15, 11:00 pm, underh20 <underh20.scubadiv...@gmail.com> wrote:
>
>
>
> > Our Sun T5220 server "atlanta" runs Solaris 10. There are NIC failure
> > and repair messages appearing in the message log regularly. Are there
> > issues with network or are there issues with the HBAs ?
>

> > What are the commands and best ways totrouble-shootthe network and

> > HBAs ? "aggr-1" is the virtual interface aggregate of two interfaces
> > at the server. "atlanta" is one of the two-nodes in the Sun
> > Cluster. Thanks, Bill
>
> > Mar 1 12:24:07 atlanta in.mpathd[309]: [ID 23451 daemon.error] NIC
> > failure detected on aggr-1of group ipmp0
> > Mar 1 12:24:07 atlanta Cluster.PNM: [ID 23452 daemon.notice] sc_ipmp0:
> > state transition from OK to DOWN.
>
> > Mar 1 12:24:23 atlanta in.mpathd[309]: [ID 300212 daemon.error] NIC
> > repair detected on aggr-1 of group ipmp0
> > Mar 1 12:24:23 atlanta Cluster.PNM: [ID 234234 daemon.notice]
> > sc_ipmp0: state transition from DOWN to OK.
>
> Another thing to consider is network congestion , depending on the
> configuration of IPMP it is possible the a bust router or target of
> the ping in a probe based configuration is droping the ping packets,
> it is possible that this will cause the interface to be failed over
> and then as traffic goes away, fail back, Try increasing the value
> in /etc/default/mpathd to FAILURE_DETECTION_TIME=20000 (20 seconds) or
> more.

The current value for FAILURE_DETECTION_TIME is 10000. Do we need to
restart any process or program after increasing the value
to 20000 ? Thx, Bill

#
# Time taken by mpathd to detect a NIC failure in ms. The minimum time
# that can be specified is 100 ms.
#
FAILURE_DETECTION_TIME=10000

tim....@gmail.com

unread,

Mar 20, 2011, 11:17:36 AM3/20/11

to

I believe you do need to restart mpathd.

Cydrome Leader

unread,

Mar 22, 2011, 1:19:02 PM3/22/11

to

underh20 <underh20.s...@gmail.com> wrote:

> On Mar 16, 3:02?pm, Cydrome Leader <prese...@MUNGEpanix.com> wrote:
>> underh20 <underh20.scubadiv...@gmail.com> wrote:
>> > Our Sun T5220 server "atlanta" runs Solaris 10. There are NIC failure

>> > and repair messages appearing in the message log regularly. ?Are there

>> > issues with network or are there issues with the HBAs ?
>>
>> > What are the commands and best ways to trouble-shoot the network and

>> > HBAs ? ?"aggr-1" is the virtual interface aggregate of two interfaces
>> > at the server. ?"atlanta" is one of the two-nodes in the Sun
>> > Cluster. ? Thanks, Bill

looks ok- there's nothing goofy like 100Mb half duxplex or nonsense like
that. Does the switch they're connected to show that the ports are happy?