DETAIL:
At 10:02 this morning we started experiencing problems with the cisco
7513 router in UC (uca.noc.uwo.ca) which provides service to about 40
campus subnets. The initial symptom was our traffic monitor
application (MRTG) complaining that it could not contact the router.
Brian responded to this and tried to contact the router va telnet and
was unsuccessful. He then tried to access it via the serial console
port. He was unable to get a command prompt but was seeing error
messages being fired out which basically were reporting insufficient
memory to run processes in the router. We then started losing
connectivity with networks served by this router
At this point he called me in as it was looking like a hardware memory
failure and it was going to take a power cycle in order to get the
router into a state where we could even access it to troubleshoot. I
went on-site and power cycled it at about 10:55. When it rebooted it
griped about a memory parity problem in the RSP (that's the Route
Switch Processor - otherwise known as the CPU for the box) in slot 6
and automatically rebooted itself again to make the RSP in slot 7 the
master. It appeared to come up normally on the slot 7 RSP and as far
as we could see all nets were back on-line and functioning. We
concluded that the RSP in slot 6 had developed a memory failure and I
pulled the RSP from the backplane so that it could not cause further
problems.
I was paged again at 11:45 to report that the problem was back. I
started troubleshooting and found that it was basically the same as
before - we couldn't access uca either by telnet or direct serial.
Same error messages as the first time. We were not losing contact with
attached subnets yet though. I opened an urgent priority 1 case with
Cisco TAC at 12:34 by email. Received an automated reply at 12:45
saying the case had been dispatched to an engineer and we would be
contacted.
Further poking and prodding while waiting for the call from Cisco led
me to find unusually high input traffic levels on 4 networks:
221/222/223/224. These are all in the Faculty of Engineering and are
all on the same community of switches. I contacted Tim Hunt in
Engineering and he found that he couldn't even get his Optiviy network
monitor to run. I logged into their 2 main segmenting switches and a
brief look at the traffic statistics pointed me to L1AESB1-9port08 as
having an extremely high level of input broadcast traffic. This port
connects to L7ESB1-9 so I tried to access that switch to see if I
could track this down to a single port there, but because of the
broadcast traffic level being handled by this switch I was not able to
access it. I shut down L1AESB1-9port08 and the broadcast traffic
immediately went away. After about 2 minutes I was able to get access
to the uca router again and the error messages stopped.
I waited about 10 minutes and enabled the port again while keeping a
close eye out for the problem to come back. It didn't start up right
away and I was able to log onto L7ESB1-9 and look at the traffic
history there. This immediately pointed to port 13 as having extremely
high input traffic. I checked our database and found that this port
connected to ZEL76692 in ESB 2050 and phoned Tim with this. The
problem started again after about 2 minutes so I shut down
L7ESB1-9port13 and, sure enough, it cleared immediately. We were
eventually able to track the source of the problem down to one of 4
machines on a hub which someone had installed in that room. Tim
disconnected it and I enabled the port and all is well with the world
again. Tim will be following up to sort out the problem on their
machine.
Never did hear from a Cisco TAC engineer so at 13:59 I emailed them
and told them to ignore the case.
PEOPLE: Glen Marrier, Brian Borowski, and Tim Hunt.
SCHEDULE:
This all transpired on Tuesday Jun 22 starting at 10:02. The problem
affecting the rest of the campus was resolved at 12:58 and the
ultimate cause was tracked down and removed at about 13:35.
RECOVERY: That's what this was.
USER NOTIFICATION:
This message goes to the SOA list, the NOC, ITS Operations, and the
Help Desk.