SYSTEM: Campus Backbone Network
SUBSYSTEM: Reznet
SUMMARY:
Shortly before 15:00 on Sep 23 Glen was notified by Kim Knowles that a
large number of Reznet users in Huron College were not able to connect
to anything. She reported that the problem had started around 14:15
and it appeared to her that all of Huron's Reznet was affected. At
about that time we had inserted a switch into the connection between
nsca (the Cisco 7513 router that handles all of the Reznets) and
L3NSC00-1 (the Bay Accelar switch that nsca gets its Backbone
connectivity from) in order to provide an access point for
troubleshooting a different problem. Because of the time frame
coincidence we immediately removed the switch although we could not
understand how it could be causing the problem. Sure enough, the
problem persisted after it was removed.
In order to try to define the exact nature of the problem we needed to
have someone at a PC on the affected network so we called Kim back to
see if she could get on to a Reznet user's PC and call us from there.
She did and we were able to determine that there was no physical
problem as she could ping and telnet to hosts on the network if she
used an IP address. However, when she tried to access anything using
an IP name the connection failed. This pointed us to a possible DNS
server problem. Kim had to leave at this point and we were confident
that the problem narrowed was down to the point that we could sort it
out.
Ed started looking at the DNS side of things and came up blank. The
DNS server was definitely up and working.
We started looking at traffic through the nsca router and got
sidetracked for a while by a red herring. Ed couldn't see any traffic
at all from subnet 219 (Huron Reznet) in the netflow data for nsca. It
took a while to sort out that there appears to be a setup problem in
the netflow data collection which is causing it to ignore net 219
completely (Ed to follow up on Friday on this one).
Back to the drawing board. We could not come up with any potential
theories and needed to step back and walk through things again. This
meant we needed someone on a PC at Huron's Reznet again. Called Stacey
at the Huron Help Desk and she was able to find a user, Dave
Forestell, who could work with us for a short while. After much head
scratching we were able to determine that machines from Huron had
connectivity to everything except 129.100.2.12 which is the primary
DNS server ns1.uwo.ca. Strange - we should have been getting
complaints from everywhere if there was a problem with it.
We started looking harder at ns1. We could definitely get to it from
other nets on campus. Tried pinging it from one of the switches on the
Huron Reznet - not working! At last we had a test we could use for
this problem that wouldn't require us to tie up Dave on the phone
(thanks for your help Dave!). Tried to access 129.100.2.12 from one on
the switches on the Elgin Hall Reznet - also not working! Hmm - Ed
realizes that 129.100.2.12 is a second logical address on the same
physical interface as 120.100.2.60 so we tried pinging 129.100.2.60
from switches in both Huron and Elgin and that worked! Checked ARP
cache entries in nsca, L3NSC00-1, and L1NSC2-2 for 129.100.2.12 and
129.100.2.60 - correct in all locations. Tried pinging 129.100.2.12
from nsca - it works.
We had seen a similar problem some time back where access to some
addresses was being blocked by an Accelar switch in the Backbone so we
moved the nsca connection to a new port on L3NSC00-1 - problem
persists. This is really looking like the problem is in nsca and that
we might have to reboot it. Rebooting nsca would have disastrous
results for the whole Campus Backbone though so we decided to try
manually clearing all of the internal tables and caches possible
without bringing it down. Cleared the ARP cache first and the problem
disappeared immediately.
This is a real puzzle. We had already confirmed that the ARP entry for
129.100.2.12 in nsca was correct and if there was a problem with the
entry then we should not have been able to ping to 129.100.2.12 from
nsca at all.
PEOPLE: Glen Marrier and Ed Gibson, with help from Kim Knowles and
Dave Forestell at Huron College (thanks folks!).
SCHEDULE: Thursday September 23, 1999 from about 14:15 to 17:45.
PEOPLE AFFECTED:
Reznet users in Huron College and Elgin Hall for sure and possibly
other Reznet locations as well.
RECOVERY: That's what this documents.
USER NOTIFICATION:
This message goes to Kim Knowles and Dave Forestell (Huron College),
Brad McMillan (ITS Reznet coordinator) and the ITS Help Desk.