NROD & NRDP Service Disruptions

60 views
Skip to first unread message

David Higginbottom

unread,
Jan 5, 2018, 7:09:32 AM1/5/18
to A gathering place for the Open Rail Data community

Dear Users,

 

As you are probably aware we had  a number of issues on the NRDP and NROD feeds over the 2017 Christmas and New Year period.

 

These issues started on 21/12/17 when we performed our planned restarts of the servers.

 

The restarts were performed to pre-empt a hardware switchover planned by our data centre provider, Amazon AWS.  By doing this early in a controlled manner we were trying to minimise the possible downtime window and ensure we had resources at hand to resolve any issues that may have occurred when the hardware switchover took place. We now know this hardware switchover planned by Amazon was to move the virtualised servers onto hardware that had been patched to protect against the Meltdown and Spectre bugs, although we were not informed about this at the time.

 

Soon after the restart we detected that the AWS patching had caused the automatic time synchronisation to drift on all of our servers. This started to affect any services that relied on time limited tokens, for example the CIF downloads on NROD. Due to the AWS patches - it was also not possible for us to update the time manually on each server. The only way to get the time back to in sync was to reboot the virtual servers themselves, the drift then started again but it allowed a few days of normal usage. The drift value was inconsistent so it was not possible for us to predict when this would have to be done,.

 

We also noticed that the same applications were consuming up to 30% more system resources than previously, this was the cause of the latency and extended catch up times on the NRDP feeds.

 

Our out of hours oncall support teams rebuilt a number of servers over the Christmas and New Year break, deploying the software applications onto more efficient and reliable AWS instance types. We needed a period of testing and observation before we could switch the production systems to these servers.

 

The main NROD servers were switched during the downtime on 03/01/2018 and the NRDP servers were switched on 04/01/2018.

Since the switchovers we have had no latency on NRDP and there has been no time drift on NRDP or NROD servers.

 

We still have some work to do on some back end servers on that may result in short interruptions as we run restarts, we’ll notify you of these through the @open_rail_feeds twitter account, but this disruption will be minimal and there should only be a couple of occurrences as we bring the updated instances on line.

 

I apologise for any inconvenience that was caused to you during this period and thank you for your patience while we resolve these complex issues.

 

Kind regards

 

David Higginbottom

Head of Support

Digital Systems Group

CACI Limited

5th Floor 

8 St Paul’s Street

Leeds LS1 2LE


www.caci.co.uk

cid:image001.png@01D33C67.57CF0740

 

Follow @open_rail_feeds on twitter for service announcements and outage information

 


This electronic message contains information from CACI International Inc or
subsidiary companies, which may be confidential, proprietary,
privileged or otherwise protected from disclosure.  The information is
intended to be used solely by the recipient(s) named above.  If you are not
an intended recipient, be aware that any review, disclosure, copying,
distribution or use of this transmission or its contents is prohibited.  If
you have received this transmission in error, please notify us immediately
at postm...@caci.co.uk
Viruses: Although we have taken steps to ensure that this e-mail and 
attachments are free from any virus, we advise that in keeping with good 
computing practice the recipient should ensure they are actually virus free.

CACI Limited. Registered in England & Wales. Registration No. 1649776. CACI House, Avonmore Road, London, W14 8TS

Brad Gyton

unread,
Jan 5, 2018, 8:03:55 AM1/5/18
to A gathering place for the Open Rail Data community
Thank you for this update David

Much Appreciated

Mike Kynaston

unread,
Jan 6, 2018, 7:26:54 AM1/6/18
to A gathering place for the Open Rail Data community
David,
As somebody who is fairly new to this group, thanks for taking the time and trouble to not only update us, but to provide an explanation of what happened and the actions taken.

Having read back through the archive on a couple of occasions, Im guessing that communication between Network Rail/CACI and users of the Open Data system has in the past not been as good as it maybe could have been - thats not a criticism, more an observation.  But I can understand the frustration felt by us, the 'end users' of the Open Data Feeds when things go wrong as our users and customers are likely asking what is going on, and we can say no more than "its a Network Rail Data Feeds problem that is out of our hands" which makes it sounds like we're passing the buck!

So thanks for the detailed update, much appreciated, and lets us hope the reliability of the feeds settles down again.

Mike
Reply all
Reply to author
Forward
0 new messages