Hi there!
I'm an interactive developer at the Guardian and I'm interested in looking at
train delays on a service level. To that end, I've been looking for a
comprehensive historical list of train delays over the last year, and was told
that the historical delay attribution data published on Network Rail's website
(
http://www.networkrail.co.uk/transparency/datasets/) might be what I'm looking
for.
I apologise in advance for the length of this post, it grew quite quickly! The
TL;DR version is: I'm looking for any advice that might help me understand the
data, or indeed any suggestions as to where I might find a comprehensive list
of train delays.
Here goes!
There is an "Explanation of historic delay attribution" document available
which doesn't really help much, but here is my interpretation of what the some
of the fields mean:
- "Incident Number": An incident ID
- "Incident Reason": What caused the incident
- "Train ID - Affected": The late train's ID
- "Train ID - React": The train that directly caused "Train ID - Affected" to
be late
- "Train ID - Resp": The train that caused the initial delay which set off a
chain of delays
- "Incident Start Datetime": Incident start, I'd expected this to match the
time at which the original delay in the chain happened, but it doesn't.
- "Incident End Datetime": Incident end, I'd expected this to match the time
when the last train in the chain was delayed, but it doesn't.
- "Planned Origin WTT Datetime - Affected": The train's timetabled departure
time from origin
- "Planned Dest WTT Datetime - Affected": The train's timetabled arrival time
at destination
- "Event Datetime": When the delay occurred
- "Reactionary Reason Code": The reason the train was delay
- "Performance Event Code": Whether the train was delayed or cancelled
(actually well explained in the doc)
- "Planned Origin Location Code - Affected": The train's origin STANOX
- "Planned Dest Location Code - Affected": The train's destination STANOX
- "Start Stanox"/"End Stanox": The delay happened between these two STANOXs
- "PfPI Minutes": The number of minutes the train was delayed
In the hope that it would help, I built a tool to visual the data, from which
I've included a few links below and an explanation of what the tool shows
(sorry its not very user friendly!)
============
What the visualisation tool shows:
NOTE: All field names come from the historical delay attribution data.
- 3 coloured blocks indicating train IDs, each unique ID is assigned a random
colour:
1) "Train ID - Affected"
2) "Train ID - React"
3) "Train ID - Resp"
- Grey background: time when the incident occurred, i.e. time between "Incident
Start Datetime" and "Incident End Datetime"
- Blue line: "Planned Origin WTT Datetime - Affected"
- Red line: "Planned Dest WTT Datetime - Affected"
- Black line: "Event Datetime"
- 7 text fields, which you can hover over to get the corresponding meaning:
1) "Reactionary Reason Code"
2) "Performance Event Code"
3) "Planned Origin Location Code - Affected"
4) "Start Stanox"
5) "End Stanox"
6) "Planned Dest Location Code - Affected"
7) "PfPI Minutes"
============
Going on the assumptions I outlined above, I've found a lot of cases where the
data shows things I wasn't expecting, mostly because I don't think I understand
what defines an "incident":
1) Why are there incidents with duplicate rows, i.e. seemingly the same delays,
where only a few values change. e.g. Below there is two of every delay, and
only the "Incident Reason" and "Responsible Manager" values seem to change.
http://willpf.co.uk/traindelays/examples.html?incident-615888
2) Why are there incidents where all or some of the delays occur outside of the
incident start/end time?
http://willpf.co.uk/traindelays/examples.html?incident-627441
3) Equally, why do some incidents start/end a long time before/after the last
delay?
http://willpf.co.uk/traindelays/examples.html?incident-654826
4) Why are there incidents with 100s of delayed trains, but seemingly no link
between them?
Same example as 2)
http://willpf.co.uk/traindelays/examples.html?incident-628606
I note the recently released
http://www.mytrainjourney.co.uk/ and corresponding
Historical Performance API, but they only allow single queries rather than a
holistic view of all services. The NROD wiki says that the service is powered
by historical Darwin data, does anyone know if the historical delay attribution
data is that same data?
Sorry for the incredibly long post, but if anyone has any pointers or ideas
where I can find more information I'd really appreciate it. Hopefully I haven't
overlooked a really obvious place to look for information but if I have,
apologies. Thanks for any help in advance!
Will
P.S. On a side note, I'd like to add that this forum and the NROD wiki have
been an incredible resource, and the dedication you all show to open data is
amazing (I've read some pretty heated arguments in the last few days!).
--
------------------------------
This e-mail and all attachments are confidential and may also be
privileged. If you are not the named recipient, please notify the sender
and delete the e-mail and all attachments immediately. Do not disclose the
contents to another person. You may not use the information for any
purpose, or store, or copy, it in any way. Guardian News & Media Limited
is not liable for any computer viruses or other material transmitted with
or as part of this e-mail. You should employ virus checking software.
Guardian News & Media Limited is a member of Guardian Media Group plc. Registered
Office: PO Box 68164, Kings Place, 90 York Way, London, N1P 2AP. Registered
in England Number 908396