[Status] [PEW] Broadband: Work to help resolve recent LNS problems (Updated 18 Jan) (Open)

Andrews & Arnold

unread,

Jan 19, 2024, 10:56:04 AMJan 19

to

Posted at 2024-01-19 15:55:13 GMT by Andrews & Arnold

_Started: 2024-01-19 15:50:00_

This is a summary and update regarding the problems we've been having
with our network, causing line drops for some customers, interrupting
their Internet connections for a few minutes at a time. It carries on
from the earlier, now out of date, post: https://aastatus.net/42577

We are not only an Internet Service Provider.

We also design and build our own routers under the FireBrick brand.
This equipment is what we predominantly use in our own network to
provide Internet services to customers. These routers are installed
between our wholesale carriers (e.g. BT, CityFibre and TalkTalk) and
the A&A core IP network. The type of router is called an "LNS", which
stands for L2TP Network Server.

FireBricks are also deployed elsewhere in the core; providing our L2TP
and Ethernet services, as well as facing the rest of the Internet as
BGP routers to multiple Transit feeds, Internet Exchanges and CDNs.

Throughout the entire existence of A&A as an ISP, we have been running
various models of FireBrick in our network.

Our newest model is the FB9000. We have been running a mix of
prototype, pre-production and production variants of the FB9000 within
our network since early 2022.

As can sometimes happen with a new product, at a certain point we
started to experience some strange behaviour; essentially the hardware
would lock-up and "watchdog" (and reboot) unpredictably.

Compared to a software 'crash' a hardware lock-up is very hard to
diagnose, as little information is obtainable when this happens. If the
FireBrick software ever crashes, a 'core dump' is posted with specific
information about where the software problem happened. This makes it a
lot easier to find and fix.

After intensive work by our developers, the cause was identified as
(unexpectedly) something to do with the NVMe socket on the motherboard.
At design time, we had included an NVME socket connected to the PCIE
pins on the CPU, for undecided possible future uses. We did not
populate the NVMe socket, though. The hanging issue completely cleared
up once an NVMe was installed even though it was not used for anything
at all.

As a second approach, the software was then modified to force the PCIe
to be switched off such that we would not need to install NVMes in all
the units.

This certainly did solve the problem in our test rig (which is multiple
FB9000s, PCs to generate traffic, switches etc). For several weeks
FireBricks which had formerly been hanging often in "artificially
worsened" test conditions, literally stopped hanging altogether,
becoming extremely stable.

So, we thought the problem was resolved. And, indeed, in our test rig
we still have not seen a hang. Not even once, across multiple FB9000s.

However...

We did then start seeing hangs in our Live prototype units in
production (causing dropouts to our broadband customers).

At the same time, the FB9000s we have elsewhere in our network, not
running as LNS routers, are stable.

We are still working on pinpointing the cause of this, which we think
is highly likely to be related to the original (now, solved) problem.

Further work...

Over the next 1-2 weeks we will be installing several extra FB9000 LNS
routers. We are installing these with additional low-level monitoring
capabilities in the form of JTAG connections from the main PCB so that
in the event of a hardware lock-up we can directly gather more
information.

The enlarged pool of LNSs will also reduce the number of customers
affected if there is a lock-up of one LNS.

We obviously do apologise for the blips customers have been seeing. We
do take this very seriously, and are not happy when customers are
inconvenienced.

We can imagine some customers might also be wondering why we bother to
make our own routers, and not just do what almost all other ISPs do,
and simply buy them from a major manufacturer. This is a fair question.
At times like this, it is a question we ask ourselves!

Ultimately, we do still firmly believe the benefits of having the
FireBrick technology under our complete control outweigh the
disadvantages. CQM graphs are still almost unique to us, and these
would simply not be possible without FireBrick. There have also been
numerous individual cases where our direct control over the firmware
has enabled us to implement individual improvements and changes that
have benefitted one or many customers.

Many times over the years we have been able to diagnose problems with
our carrier partners, which they themselves could not see or
investigate. This level of monitoring is facilitated by having
FireBricks.

But in order to have finished FireBricks, we have to develop them. And
development involves testing, and testing can sometimes reveal
problems, which then affect customers.

We do not feel we were rationally premature in introducing prototype
FireBricks into our network, having had them under test not routing
live customer traffic for an appropriate period beforehand.

But some problems can only reveal themselves once a "real world" level
and nature of traffic is being passed. This is unavoidable, and whilst
we do try hard to minimise disruption, we still feel the long term
benefits of having FireBricks more-than offset the short term problems
in late stage of development. We hope our detailed view on this is
informative, and even persuasive.

_Update expected: 2024-01-22 13:00:00_

URL: https://aastatus.net/apost.cgi?incident=42608

--
AAISP Status Feed
URL: https://aastatus.net/

Andrews & Arnold

unread,

Jan 19, 2024, 2:08:05 PMJan 19

to

Posted at 2024-01-19 15:55:13 GMT by Andrews & Arnold

Update #1: 2024-01-19 19:02:42 GMT

| We do not feel we were irrationally premature in introducing prototype

Andrews & Arnold

unread,

Jan 22, 2024, 11:56:05 AMJan 22

to

Posted at 2024-01-19 15:55:13 GMT by Andrews & Arnold

Update #2: 2024-01-22 16:55:23 GMT

> _Update 22 Jan 2024 16:55:23_ Action point: Replacement of three LNSs
> on Tuesday 23rd January: https://aastatus.net/42609 _Update expected:
> 2024-01-23 17:00:00_

Andrews & Arnold

unread,

Jan 27, 2024, 5:40:05 AMJan 27

to

Posted at 2024-01-19 15:55:13 GMT by Andrews & Arnold

Update #3: 2024-01-27 10:33:15 GMT

| * Action point: Replacement of three LNSs on Tuesday 23rd January:
> https://aastatus.net/42609
>
> * Action point: Work on Z.Witless from Sunday 28th January:
> https://aastatus.net/42614
>
> _Update expected: 2024-01-29 13:00:00_

Andrews & Arnold

unread,

Feb 1, 2024, 11:32:05 AMFeb 1

to

Posted at 2024-01-19 15:55:13 GMT by Andrews & Arnold

Update #4: 2024-02-01 16:30:24 GMT

| https://aastatus.net/42609 _Semi-completed_

* Action point: Work on Z.Witless from Sunday 28th January:

| https://aastatus.net/42614 _Completed_

|
>
> _Update 1 Feb 2024 16:30:24_ As of 1st February two out of three of our
> 'Witless' LNSs have been fitted with NMVe drives and JTAG debugging
> capabilities. If/when they have a hardware lock-up we'll be able to
> gain a bit more of an insight in to the cause. The third LNS has not,
> but it has been stable with an uptime of 78 days. _Update expected:
> 2024-02-05 13:00:00_

Andrews & Arnold

unread,

Feb 2, 2024, 5:40:06 AMFeb 2

to

Posted at 2024-01-19 15:55:13 GMT by Andrews & Arnold

Update #5: 2024-02-02 10:38:01 GMT

| _Update 22 Jan 2024 16:55:23_ _Work being carried out:_

* Action point: Replacement of three LNSs on Tuesday 23rd January:
https://aastatus.net/42609 _Semi-completed_

* Action point: Work on Z.Witless from Sunday 28th January:
https://aastatus.net/42614 _Completed_

> * Action point: Install a new LNS in to the pool. 2nd February:
> https://aastatus.net/42615 _Completed_

|
| _Update 2 Feb 2024 10:30:00_ _Latest Summary, as of 2nd February:_
| Three out of four of our FB9000 LNSs have been fitted with NMVe drives

| and JTAG debugging capabilities. If/when they have a hardware lock-up
| we'll be able to gain a bit more of an insight in to the cause. The

| fourth LNS has not, but it has been stable with an uptime of 79 days. _Update

Andrews & Arnold

unread,

Feb 2, 2024, 9:40:06 AMFeb 2

to

Posted at 2024-01-19 15:55:13 GMT by Andrews & Arnold

Update #6: 2024-02-02 14:38:54 GMT

> * Action point: Upgrade A, B, C Gormless to FB9000: Work starting
> from Saturday 3rd February: https://aastatus.net/42616

Andrews & Arnold

unread,

Feb 5, 2024, 3:32:05 PMFeb 5

to

Posted at 2024-01-19 15:55:13 GMT by Andrews & Arnold

Update #7: 2024-02-05 20:30:10 GMT

> _Update 5 Feb 2024 20:30:10_ _5th Feb:_ Both Z and Y have hung in
> recent days (Saturday 3rd and Monday 5th) - we are currently analysing
> the data from the various cache and memory systems that we were able to
> retrieve from the hardware whilst it was in its hung state. _Update
> expected: 2024-02-07 13:00:00_

Andrews & Arnold

unread,

Feb 9, 2024, 11:32:05 AMFeb 9

to

Posted at 2024-01-19 15:55:13 GMT by Andrews & Arnold

Update #8: 2024-02-09 16:29:31 GMT

| https://aastatus.net/42609 _Completed_

* Action point: Work on Z.Witless from Sunday 28th January:
https://aastatus.net/42614 _Completed_

* Action point: Install a new LNS in to the pool. 2nd February:
https://aastatus.net/42615 _Completed_

* Action point: Upgrade A, B, C Gormless to FB9000: Work starting

| from Saturday 3rd February: https://aastatus.net/42616 _Completed_

> * Action point: Spread customer connections over the new LNSs: 10th
> and 11th February: https://aastatus.net/42620

>
_Update 2 Feb 2024 10:30:00_ _Latest Summary, as of 2nd February:_
Three out of four of our FB9000 LNSs have been fitted with NMVe drives
and JTAG debugging capabilities. If/when they have a hardware lock-up
we'll be able to gain a bit more of an insight in to the cause. The
fourth LNS has not, but it has been stable with an uptime of 79 days.

_Update 5 Feb 2024 20:30:10_ _5th Feb:_ Both Z and Y have hung in
recent days (Saturday 3rd and Monday 5th) - we are currently analysing
the data from the various cache and memory systems that we were able to
retrieve from the hardware whilst it was in its hung state. _Update

| expected: 2024-02-11 13:00:00_

Andrews & Arnold

unread,

Feb 9, 2024, 11:56:05 AMFeb 9

to

Posted at 2024-01-19 15:55:13 GMT by Andrews & Arnold

Update #9: 2024-02-09 16:51:10 GMT

| _Update 9 Feb 2024 16:50:00_ _Latest Summary, as of 9th February:_ We
| now have a larger pool of FB9000 LNSs. Six out of seven of them have

| been fitted with NMVe drives and JTAG debugging capabilities. If/when
| they have a hardware lock-up we'll be able to gain a bit more of an

| insight in to the cause. The seventh LNS has not, but it has been
> stable with an uptime of 86 days.

Andrews & Arnold

unread,

Feb 10, 2024, 9:00:05 AMFeb 10

to

Posted at 2024-01-19 15:55:13 GMT by Andrews & Arnold

Update #10: 2024-02-10 13:57:05 GMT

Andrews & Arnold

unread,

Feb 11, 2024, 3:48:05 PMFeb 11

to

Posted at 2024-01-19 15:55:13 GMT by Andrews & Arnold

Update #11: 2024-02-11 20:44:43 GMT

| and 11th February: https://aastatus.net/42620 _Completed_

Andrews & Arnold

unread,

Feb 16, 2024, 8:32:04 AMFeb 16

to

Posted at 2024-01-19 15:55:13 GMT by Andrews & Arnold

Update #12: 2024-02-16 13:28:00 GMT

| _Update 16 Feb 2024 13:27:00_ _Work being carried out:_

* Action point: Replacement of three LNSs on Tuesday 23rd January:
https://aastatus.net/42609 _Completed_

* Action point: Work on Z.Witless from Sunday 28th January:
https://aastatus.net/42614 _Completed_

* Action point: Install a new LNS in to the pool. 2nd February:
https://aastatus.net/42615 _Completed_

* Action point: Upgrade A, B, C Gormless to FB9000: Work starting
from Saturday 3rd February: https://aastatus.net/42616 _Completed_

* Action point: Spread customer connections over the new LNSs: 10th
and 11th February: https://aastatus.net/42620 _Completed_

> * Action point: Software upgrades and spread customer connections
> over the LNSs: 17th and 18th February: https://aastatus.net/42626

>
_Update 9 Feb 2024 16:50:00_ _Latest Summary, as of 9th February:_ We
now have a larger pool of FB9000 LNSs. Six out of seven of them have
been fitted with NMVe drives and JTAG debugging capabilities. If/when
they have a hardware lock-up we'll be able to gain a bit more of an
insight in to the cause. The seventh LNS has not, but it has been
stable with an uptime of 86 days.

_Update 5 Feb 2024 20:30:10_ _5th Feb:_ Both Z and Y have hung in
recent days (Saturday 3rd and Monday 5th) - we are currently analysing
the data from the various cache and memory systems that we were able to
retrieve from the hardware whilst it was in its hung state. _Update

| expected: 2024-02-19 13:00:00_