Posted at 2024-01-19 15:55:13 GMT by Andrews & Arnold
_Started: 2024-01-19 15:50:00_
This is a summary and update regarding the problems we've been having
with our network, causing line drops for some customers, interrupting
their Internet connections for a few minutes at a time. It carries on
from the earlier, now out of date, post:
https://aastatus.net/42577
We are not only an Internet Service Provider.
We also design and build our own routers under the FireBrick brand.
This equipment is what we predominantly use in our own network to
provide Internet services to customers. These routers are installed
between our wholesale carriers (e.g. BT, CityFibre and TalkTalk) and
the A&A core IP network. The type of router is called an "LNS", which
stands for L2TP Network Server.
FireBricks are also deployed elsewhere in the core; providing our L2TP
and Ethernet services, as well as facing the rest of the Internet as
BGP routers to multiple Transit feeds, Internet Exchanges and CDNs.
Throughout the entire existence of A&A as an ISP, we have been running
various models of FireBrick in our network.
Our newest model is the FB9000. We have been running a mix of
prototype, pre-production and production variants of the FB9000 within
our network since early 2022.
As can sometimes happen with a new product, at a certain point we
started to experience some strange behaviour; essentially the hardware
would lock-up and "watchdog" (and reboot) unpredictably.
Compared to a software 'crash' a hardware lock-up is very hard to
diagnose, as little information is obtainable when this happens. If the
FireBrick software ever crashes, a 'core dump' is posted with specific
information about where the software problem happened. This makes it a
lot easier to find and fix.
After intensive work by our developers, the cause was identified as
(unexpectedly) something to do with the NVMe socket on the motherboard.
At design time, we had included an NVME socket connected to the PCIE
pins on the CPU, for undecided possible future uses. We did not
populate the NVMe socket, though. The hanging issue completely cleared
up once an NVMe was installed even though it was not used for anything
at all.
As a second approach, the software was then modified to force the PCIe
to be switched off such that we would not need to install NVMes in all
the units.
This certainly did solve the problem in our test rig (which is multiple
FB9000s, PCs to generate traffic, switches etc). For several weeks
FireBricks which had formerly been hanging often in "artificially
worsened" test conditions, literally stopped hanging altogether,
becoming extremely stable.
So, we thought the problem was resolved. And, indeed, in our test rig
we still have not seen a hang. Not even once, across multiple FB9000s.
However...
We did then start seeing hangs in our Live prototype units in
production (causing dropouts to our broadband customers).
At the same time, the FB9000s we have elsewhere in our network, not
running as LNS routers, are stable.
We are still working on pinpointing the cause of this, which we think
is highly likely to be related to the original (now, solved) problem.
Further work...
Over the next 1-2 weeks we will be installing several extra FB9000 LNS
routers. We are installing these with additional low-level monitoring
capabilities in the form of JTAG connections from the main PCB so that
in the event of a hardware lock-up we can directly gather more
information.
The enlarged pool of LNSs will also reduce the number of customers
affected if there is a lock-up of one LNS.
We obviously do apologise for the blips customers have been seeing. We
do take this very seriously, and are not happy when customers are
inconvenienced.
We can imagine some customers might also be wondering why we bother to
make our own routers, and not just do what almost all other ISPs do,
and simply buy them from a major manufacturer. This is a fair question.
At times like this, it is a question we ask ourselves!
Ultimately, we do still firmly believe the benefits of having the
FireBrick technology under our complete control outweigh the
disadvantages. CQM graphs are still almost unique to us, and these
would simply not be possible without FireBrick. There have also been
numerous individual cases where our direct control over the firmware
has enabled us to implement individual improvements and changes that
have benefitted one or many customers.
Many times over the years we have been able to diagnose problems with
our carrier partners, which they themselves could not see or
investigate. This level of monitoring is facilitated by having
FireBricks.
But in order to have finished FireBricks, we have to develop them. And
development involves testing, and testing can sometimes reveal
problems, which then affect customers.
We do not feel we were rationally premature in introducing prototype
FireBricks into our network, having had them under test not routing
live customer traffic for an appropriate period beforehand.
But some problems can only reveal themselves once a "real world" level
and nature of traffic is being passed. This is unavoidable, and whilst
we do try hard to minimise disruption, we still feel the long term
benefits of having FireBricks more-than offset the short term problems
in late stage of development. We hope our detailed view on this is
informative, and even persuasive.
_Update expected: 2024-01-22 13:00:00_
URL:
https://aastatus.net/apost.cgi?incident=42608
--
AAISP Status Feed
URL:
https://aastatus.net/