GTFS-RT Proposal: "strict" flag

166 views
Skip to first unread message

Andrew Byrd

unread,
Nov 27, 2017, 6:25:41 PM11/27/17
to GTFS-realtime
[Apologies for any duplicates, my initial post appears to have been rejected]

Hello,

A debate ensued on https://github.com/google/transit/pull/65 around the idea that stop time updates may lack either an arrival or departure time. I believe we have reached a consensus that to be coherent and backward compatible, the spec must allow such missing values, especially since much realtime data is currently produced by systems that will continue to lack one or the other of these values for the foreseeable future.

However certain GTFS-RT producers do not want to require all RT consumers to implement their own delay propagation and other heuristics to deal with missing values. A producer may have access to large amounts of archival operational data, sophisticated prediction models, or other special resources, and may therefore intend to provide consumers with authoritative predictions, relieving them of the need for ad-hoc techniques to smooth over missing data.

This idea emerged during a project in the Netherlands where real-time support was being added to several routing systems at once, including OpenTripPlanner. Significant amounts of extra code were necessary to match partial trip updates to trips, and there was duplicated effort implementing basic, unreliable heuristics to guess how vehicle delays would propagate down a route. Several components created during that project and still in use adhere to the ideas in this proposal.

PROPOSAL: 

This proposal adds a single boolean field named “strict” to the FeedHeader message. (The name is debatable. Alternatives include “unambiguous”, “high_quality”, “complete_data”, etc.)
 
By setting the value of this field to true, the producer indicates its intent to provide complete, unambiguous, high-quality arrival and departure predictions. When this field is true, each TripUpdate message shall include a StopTimeUpdate message for every planned or already visited stop on the corresponding trip. Each of these StopTimeUpdate messages shall include both an arrival and departure StopTimeEvent.

When this field is true, consumers may apply simpler, stricter validation criteria to the incoming messages, eliminating the need for heuristics that fill in missing arrival or departure StopTimeEvents or entire missing StopTimeUpdates.

This new field defaults to false, so that feeds lacking the field will be held to existing less-strict requirements, and consumers will know they must provide mechanisms to fill in missing information in such feeds.

I welcome comments on this proposal before it is finalized.

Regards,
Andrew

Sean Barbeau

unread,
Nov 27, 2017, 8:15:50 PM11/27/17
to Andrew Byrd, gtfs-r...@googlegroups.com, ITD Consortium
Andrew,
Thanks for the proposal! 

Could you explain how proposal improves upon using the existing GTFS-realtime spec, and a producer simply populating every arrival and departure time for all stop-time-updates for all trips?

Off the top of my head I can't see how this differs from the "strict mode" - in other words, a "strict mode" flag seems redundant in this case. Based on currently defined rules I believe consumers would be required to use all of these predictions provided by the producer - they wouldn't be allowed to create their own.

Sean

Andrew

On 7 Nov 2017, at 23:09, Sean Barbeau <sjba...@gmail.com> wrote:

Hi all,
There is a GTFS-realtime proposal currently open for voting at:

To summarize the proposal, currently the GTFS-realtime spec has conflicting information over whether an arrival AND a departure prediction must be provided for each stop_time_update. This proposal addresses the conflict by removing the text that says you must provide both an arrival AND departure. The remaining language in the GTFS-realtime spec then clearly states that producers can provide an arrival OR a departure prediction.

Voting will end on Tuesday November 14th at 23:59:59 UTC.

Please vote via comment on this pull request on Github ("+1" or "Yes" if you agree with the proposed change, "-1" or "No" if you disagree with the proposed change and would prefer to leave the documentation as-is):

Full text of the proposal is available at the above link.

Thanks,
Sean

Sean Barbeau, Ph.D.
Center for Urban Transportation Research
University of South Florida


--
This group is to discuss GTFS Best Practices and work on transit data interoperability.
---
You received this message because you are subscribed to the Google Groups "Interoperable Transit Data (ITD)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to itd-consortium+unsubscribe@googlegroups.com.
To post to this group, send email to itd-consortium@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/itd-consortium/379ec083-20e0-4da2-86ce-3e7aada29670%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sean Barbeau

unread,
Jan 6, 2018, 11:52:54 AM1/6/18
to GTFS-realtime
Andrew,
Happy New Year!  Just wanted to follow up on this proposal and my below question, as we still have a conflict in the GTFS-realtime spec we need to resolve.

Thanks,
Sean

Andrew Byrd

unread,
Jan 17, 2018, 8:36:46 AM1/17/18
to gtfs-r...@googlegroups.com, ITD Consortium
Hi Sean,

The idea of the flag is for producers to declare the characteristics of the data they produce to consumers. This applies to both completeness and semantics: a producer may state that it expects to have its output validated more strictly, and also interpreted as having a more precise meaning.

First, validation level: If the consuming app encounters a missing prediction in a trip and the flag is set to true, it will be considered an error and the message rejected. If the flag is set to false or not included (it would default to false for backward compatibility) the consumer should propagate delays downstream without complaining.

Second, semantics: if the flag is set to true, the producer is claiming that the predictions at each stop are a best effort at actual arrival prediction using a suitable algorithm and the best available statistical methods and/or historical data, accounting for timing points etc. Such messages are intended as a lossless interface between specialized software components. When the flag is set to false or absent, delays may have been propagated using basic heuristics or decay functions, or may be provided only at the last observed location of a vehicle. The consumer may decide to conservatively interpret such predictions as lowest-common-denominator output from non-specialized systems such as legacy AVL, and be more careful about presenting such results to users, e.g. suggesting time-sensitive transfers to passengers based on those predictions.

A single general-purpose consumer app (e.g. mobile trip planner) might want to interpret inputs differently depending on the declared quality of the source, or it might only want to interface with sources that declare a certain clear set of semantics.

Here is a link to the document in which this field was proposed a little over two years ago:

The document is open to comments and already has input from several people.

Regards,
Andrew

You received this message because you are subscribed to the Google Groups "GTFS-realtime" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gtfs-realtim...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gtfs-realtime/CAEkVq%2Bjuf5Fa%2BRbnn7dyfK1STagphwj54ckV5SphqajZNvFi_A%40mail.gmail.com.

Sean Barbeau

unread,
Feb 2, 2018, 9:58:41 AM2/2/18
to GTFS-realtime
Andrew,
Sorry for the delayed response - I posted a reply the same day, but looking at the thread now it appears it never actually posted.

First, validation level: If the consuming app encounters a missing prediction in a trip and the flag is set to true, it will be considered an error and the message rejected. If the flag is set to false or not included (it would default to false for backward compatibility) the consumer should propagate delays downstream without complaining.

As you imply, I see this strictly (pun intended ;) ) as a validation issue.  I'm happy to add a warning to the gtfs-realtime-validator to capture this.  I don't see this by itself being a reason to add a "strict" flag.

Second, semantics: if the flag is set to true, the producer is claiming that the predictions at each stop are a best effort at actual arrival prediction using a suitable algorithm and the best available statistical methods and/or historical data, accounting for timing points etc. When the flag is set to false or absent, delays may have been propagated using basic heuristics or decay functions, or may be provided only at the last observed location of a vehicle.

This is great in theory, but I think it will be very challenging to outline this in a specification so it's interpreted the same by all producers and consumers.  What's a "best effort at actual arrival predictions"?  What's a "suitable algorithm"?  What's a "best available statistical method"?

And again, using today's GTFS-realtime spec, if a producer doesn't want a consumer to do any propagation of times (i.e., to show only times that are generated by the producer), then the producer just needs to provide a complete trip_update with all stop_time_updates.  So we don't need any additional fields to handle this case.

A single general-purpose consumer app (e.g. mobile trip planner) might want to interpret inputs differently depending on the declared quality of the source

This sounds like it's introducing more complexity and ambiguity into the format.  Generally speaking, I'd prefer to move away from allowing different interpretations of data to having concise, well-defined rules surrounding propagation.  We clarified some cases in GTFS-realtime v2.0, and I'd prefer to focus on continuing to close the gap on any remaining gray areas, rather than introducing something that allows for more interpretations.

Here is a link to the document in which this field was proposed a little over two years ago:
https://docs.google.com/document/d/19Dy6afltgs1ebbxKQGX4jpzWHh--Iw4AOO_rtX1bQoc/edit?usp=sharing

Thanks for providing this - it looks like it was part of a proposal for DIFFERENTIAL messages.  Would you mind splitting the strict proposal into it's own document?  The strict field makes more sense to me in context of DIFFERENTIAL messages, but not in context of FULL_DATASET.

Sean
Andrew

To unsubscribe from this group and stop receiving emails from it, send an email to itd-consortiu...@googlegroups.com.
To post to this group, send email to itd-con...@googlegroups.com.

Andrew Byrd

unread,
Feb 2, 2018, 2:31:12 PM2/2/18
to GTFS-realtime
On Friday, February 2, 2018 at 3:58:41 PM UTC+1, Sean Barbeau wrote:
As you imply, I see this strictly (pun intended ;) ) as a validation issue.  I'm happy to add a warning to the gtfs-realtime-validator to capture this.  I don't see this by itself being a reason to add a "strict" flag.

Yes, it is essentially a validation issue. The flag would exist to declare the expected contents (completeness) of the feed so consumers and validators would know what to expect. It would be a bit strange for validators or consumers to produce large numbers of warnings on input data that was missing most of the stops on every trip, because that’s valid data. It’s only invalid data from the point of view of certain more demanding consumers / systems.
 
This is great in theory, but I think it will be very challenging to outline this in a specification so it's interpreted the same by all producers and consumers.  What's a "best effort at actual arrival predictions"?  What's a "suitable algorithm"?  What's a "best available statistical method"?

Point taken, these terms are difficult to define so any producer can claim "state of the art high quality”. We can drop this vague semantic part. It’s simply about declaring the expected contents of the feed for validation and clarity.
 
Thanks for providing this - it looks like it was part of a proposal for DIFFERENTIAL messages. 

True, we were originally working on these two ideas together at the same time but they are separable. 
 
Would you mind splitting the strict proposal into it's own document?  The strict field makes more sense to me in context of DIFFERENTIAL messages, but not in context of FULL_DATASET.

If you find that strict mode makes more sense in conjunction with differential updates, but less sense as a standalone proposal, then why would you like to separate out the descriptions of the two ideas ? Also, if strict mode makes sense with differential updates, then would be useful for many large GTFS-RT deployments in general: I believe the Netherlands, Finland, Norway, and maybe Paris are all using differential message-based real-time data. Given widespread usage, we need to get the differential update semantics clarified in the spec.

Regards,
Andrew

Sean Barbeau

unread,
Feb 5, 2018, 11:03:50 AM2/5/18
to GTFS-realtime
It would be a bit strange for validators or consumers to produce large numbers of warnings on input data that was missing most of the stops on every trip, because that’s valid data. It’s only invalid data from the point of view of certain more demanding consumers / systems.

I view this as similar to missing timestamps for VehiclePositions.  They really should be included, but some systems can't.  It's still valid data if the timestamp is missing, but we still throw a warning for it in the validator.

Similarly, it would be great if transit agencies could provide high-quality stop-level predictions, but many still can't.  Granted, the scope of adding high quality stop-level predictions is very different than a vehicle timestamp.

So, I view this as a tooling issue - we should be able filter out warnings in validation tools that we're not interested in for whatever reason (e.g., hardware/software doesn't support, etc).  However, the warning should still exist to enforce that this is desired behavior.

That all being said, if others agree that the "strict" field would be useful and should be adopted, I wouldn't block a proposal for adopting it.

If you find that strict mode makes more sense in conjunction with differential updates, but less sense as a standalone proposal, then why would you like to separate out the descriptions of the two ideas ? Also, if strict mode makes sense with differential updates, then would be useful for many large GTFS-RT deployments in general: I believe the Netherlands, Finland, Norway, and maybe Paris are all using differential message-based real-time data. Given widespread usage, we need to get the differential update semantics clarified in the spec.

I only suggested separating them because this thread was dedicated to the proposal for the "strict" flag but not differential updates, which is a much larger proposal to tackle.  I totally agree - we need to define DIFFERENTIAL messages and behavior - I actually added the link to the document to the Github issue I opened for DIFFERENTIAL messages at https://github.com/google/transit/issues/84.

So I'm personally open to either approach - you could draft a stand-alone "strict" proposal, or try to tackle the larger DIFFERENTIAL messages proposal.

I'd like to hear from others with their thoughts on this too.

Sean

Paul Harrington

unread,
Feb 7, 2018, 6:21:05 PM2/7/18
to GTFS-realtime
For me the crux of this proposal is:

When a field named “strict” exists in the FeedHeader message, each TripUpdate message shall include a StopTimeUpdate message for every planned or already visited stop on the corresponding trip. Each of these StopTimeUpdate messages shall include both an arrival and departure StopTimeEvent."

Like this I think it would a great proposal and would be in favour of it and here is why:

My JVM could be running for a few weeks and then suddenly crash with an out of memory error. It would have had plenty of memory for a long time but then something happened to cause resource issues. User triggered usage is consistent (and low) and I don't think this is the problem. I've noticed that a JVM consumes 5%-7% of the CPU (even without external traffic) and most of this is almost certainly down to processing of TripUpdates.

I've reexamined my TripUpdate implementation and even if a producer does provide StopTimeUpdates for all stops on a trip I don't know that (they haven't told me so I can't assume so).  If sequence numbers are contiguous from the start sequence (e.g. 10,11,12, etc) all the way up to the last stop sequence number then I don't have issues processing. But lets say they use sequence numbers 10,20,30 etc. When I process the StopTimeUpdate with a sequence number of 20 I've noted that the last one was 10. So for sequence numbers 11..19 I have to make a call to the database with the sequence number and trip id to see if it is a valid stop and return the stop id if it is. So depending on how the feed is produced you can see how this quickly becomes computationally expensive if there are a lot of trips with large gaps between the sequence numbers. When they are contiguous no database call is necessary in this case. And if there were gaps but Andrews strict flag was set to true then I wouldn't go looking for the missing sequence numbers as I would know they don't exist. Okay bad implementation you may say, but that is how it evolved. It would undoubtedly be better to get all stop sequences from the stop_times database table using the trip id and check against this for missing stops but it is still a database call.
And then when you get to the last StopTimeUpdate how do you know if it is the last stop of not. You don't so you have to find the max sequence number for that trip and compare. In strict mode this would not be necessary.

Additionally if the stop you are  propagating from does not contain a delay, then you have to make a DB call to get the scheduled time, work out the delta between it and the estimated time and use this for propagation purposes.

The algorithm also needs to know not to propagate from a skipped stop so these StopTimeUpdates are not remembered for propagation purposes. Some of you may recall there was a previous discussion on whether or not to propagate skipped stops .

The main point is that in strict mode you don't have to worry about the complexities of propagation and code can be far simpler. I will still keep the code I have but strict allows most of the complex stuff be bypassed. 

Hope this helps. "Strict" is probably something I will implement through a per feed configuration item anyway but then I have to gauge by examining a feed whether or not I think it is "strict". It would be more straight forward to have the feed itself tell me.



On Monday, 27 November 2017 23:25:41 UTC, Andrew Byrd wrote:. 

Paul Harrington

unread,
Feb 8, 2018, 8:07:34 AM2/8/18
to GTFS-realtime
If you wanted to a stricter version of strict (one which would make parsing even more straight forward) you could word it as

When a 
field named “strict” exists in the FeedHeader message, each TripUpdate message shall include a StopTimeUpdate message for every planned or already visited stop on the corresponding trip. Each of these StopTimeUpdate messages shall include both an arrival and departure StopTimeEvent.  The StopTimeUpdate message will always contain both the stop_id and stop_sequence and the TripUpdate message will always contain a timestamp"

Andrew Byrd

unread,
Feb 8, 2018, 1:47:06 PM2/8/18
to gtfs-r...@googlegroups.com

I agree we would want these additional restrictions in “strict” mode to very unambiguously tie the message to the static data source.

I would say though: 
"When a boolean field named ‘strict' exists in the FeedHeader message and its value is true”. Presence of the field is not sufficient, as it could be false. The default value should be false for backward compatibility.

Andrew

--
You received this message because you are subscribed to the Google Groups "GTFS-realtime" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gtfs-realtim...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages