Schedule duplicates in full schedule

109 views
Skip to first unread message

Adrian Hooper

unread,
Jan 19, 2017, 11:33:31 AM1/19/17
to A gathering place for the Open Rail Data community
I've just been refactoring some code that handles the schedule, and came across something I think is a bit odd.
After clearing out my database and re-running the schedule ingestion, everything was running just fine and then I started to see duplicates in the unique identifier of the schedule (UIDStart Date and STP Indicator not just the UID).

Is this normal on the full ingestion? I assumed there'd be no need for duplications on this file? As of right now, my code is basically ignoring the duplicate ones and carrying on, but if it's expected, am I safe in ignoring these or do I need to do something else with them?
I know the schedule documentation only mentions 'create' or 'delete' schedules when processing the update file, but I also noticed duplicates appearing there as well. What should I be doing in the event of duplicates, is this essentially an update?

David Turner

unread,
Jan 19, 2017, 2:34:47 PM1/19/17
to Adrian Hooper, A gathering place for the Open Rail Data community
By "full ingestion" do you mean within a single full CIF file, or do you mean ingesting each of a sequence of nightly updates? I don't think I've ever seen duplicates within a single file, but they certainly do occur between files. Also deletions of schedules that don't actually seem to exist.

It's possible that this is because the CIF doesn't contain records for trains whose End Date is in the past, so e.g. an STP schedule that expires may be silently deleted and then recreated with the same start date but a newer end date - I've never actually bothered to look. Instead we treat duplicate "Create" records as if the second were a "Replace" (also mentioned in the documentation, not just "Create" and "Delete") and update things accordingly. This seems more sensible than ignoring the second one as it's more up-to-date than the first one.

I'd be interested to hear a proper answer, too!

Cheers,

David




--
You received this message because you are subscribed to the Google Groups "A gathering place for the Open Rail Data community" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openraildata-talk+unsubscribe@googlegroups.com.
To post to this group, send email to openraildata-talk@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Adrian Hooper

unread,
Jan 19, 2017, 2:51:15 PM1/19/17
to A gathering place for the Open Rail Data community, ageh...@gmail.com
Yeah a single CIF file (but the JSON version rather than CIF). What you say about the update files makes perfect sense, and I agree, I should really be replacing not ignoring so will probably adjust that, but still find it odd that I'm seeing duplicates. I've been wondering if it's due to the hash that I'm generating somehow producing a collision, so the input isn't actually the same but it's still producing the same hash. I may drop the hash completely and try again with just the concatenated values, but I'd be surprised if I were getting a collision.
To unsubscribe from this group and stop receiving emails from it, send an email to openraildata-t...@googlegroups.com.
To post to this group, send email to openrail...@googlegroups.com.

David Turner

unread,
Jan 19, 2017, 3:51:55 PM1/19/17
to Adrian Hooper, A gathering place for the Open Rail Data community
Hmm, ok, I use the CIF format rather than the JSON one. I see a handful of duplicates in the full JSON schedule but none in the CIF format one:

$ zcat CIF_ALL_FULL_DAILY%2Ftoc-full.gz | jq '.JsonScheduleV1 | {schedule_start_date,CIF_stp_indicator,CIF_train_uid}' -c | sort | uniq -c | sort -n | tail
      1 {"schedule_start_date":"2017-12-09","CIF_stp_indicator":"P","CIF_train_uid":"P52168"}
      2 {"schedule_start_date":"2016-12-11","CIF_stp_indicator":"P","CIF_train_uid":"Y60610"}
      2 {"schedule_start_date":"2016-12-15","CIF_stp_indicator":"P","CIF_train_uid":"H20523"}
      2 {"schedule_start_date":"2016-12-17","CIF_stp_indicator":"P","CIF_train_uid":"H02062"}
      2 {"schedule_start_date":"2016-12-17","CIF_stp_indicator":"P","CIF_train_uid":"H14818"}
      2 {"schedule_start_date":"2017-01-02","CIF_stp_indicator":"P","CIF_train_uid":"P70644"}
      2 {"schedule_start_date":"2017-02-05","CIF_stp_indicator":"O","CIF_train_uid":"Y69696"}
      2 {"schedule_start_date":"2017-02-05","CIF_stp_indicator":"O","CIF_train_uid":"Y69703"}
      2 {"schedule_start_date":"2017-05-21","CIF_stp_indicator":"P","CIF_train_uid":"H37737"}
  42046 {"schedule_start_date":null,"CIF_stp_indicator":null,"CIF_train_uid":null}
$ zcat CIF_ALL_FULL_DAILY%2Ftoc-full.CIF.gz | grep ^BS | cut -c4-15,80 | sort | uniq -c | sort -n | tail
      1 Y76899170114P
      1 Y76902170109C
      1 Y76902170109P
      1 Y76903170114C
      1 Y76903170114P
      1 Y76904170109C
      1 Y76904170109P
      1 Y76907170109P
      1 Y76909170114P
      1 Y76911170108P
$

Not sure what that's all about!

Cheers,

David



To unsubscribe from this group and stop receiving emails from it, send an email to openraildata-talk+unsubscribe@googlegroups.com.
To post to this group, send email to openraildata-talk@googlegroups.com.

Adrian Hooper

unread,
Jan 19, 2017, 4:21:00 PM1/19/17
to A gathering place for the Open Rail Data community, ageh...@gmail.com
hmm that's really strange. Maybe there's just an issue then in the way the JSON file is converted from the CIF. On the up side, that basically confirms that my code is fine and there actually are duplicates :D so that's a relief.

Tom Lane

unread,
Jan 23, 2017, 8:39:48 AM1/23/17
to A gathering place for the Open Rail Data community, ageh...@gmail.com
Do these take in to account the 'bug' where STP of 'P' and 'O' may not be showing correctly?

I believe it is documented on the Wiki.

Adrian Hooper

unread,
Jan 23, 2017, 8:51:54 AM1/23/17
to A gathering place for the Open Rail Data community, ageh...@gmail.com
Should do yeah. I thought it was only on activation messages on the movement feed though? I didn't find any documentation about issues for other feeds?

Tom Lane

unread,
Jan 23, 2017, 1:35:26 PM1/23/17
to A gathering place for the Open Rail Data community, ageh...@gmail.com
Ah so it does. My mistake. :) 
Reply all
Reply to author
Forward
0 new messages