A question to the GTFS community

781 views
Skip to first unread message

Michael Forte

unread,
Nov 30, 2015, 1:24:37 PM11/30/15
to Transit Developers, Muhammad Umair Akhtar
Over the past two years, I have been working with the GTFS data and have noticed that the data quality ranges from poor to excellent. I'm wondering if there is something that Google isn't clear with in the spec for the data tables that is causing these problems. Also, if there are other common problems people are seeing I would love to capture it all in one place so they can be addressed.

There are holes in the data and sometimes the data is just wrong. For example, a fairly common one I'm seeing is that adjacent stops for a route may have the same arrival and departure times which is just impossible. This really screws up the view in Google Maps when trying to display it.

So you have a reference point to use, you can see this for TheBus agency in Hawaii.
Vehicle 103 (5587795.1031.847835)
Look at stops 4 and 5 that both have arrivals at 05:47 and departures at 05:48

Here is another example for the Adelaide Metro in Australia.
Vehicle 118 (159606)
Look at stops 3 and 4 that both have identical arrival and departure times at 05:52

Are there any moderator quality checks needed by the GTFS community to say the data can be posted or is there some inherent mistake in the definitions of what these data fields are supposed to contain?

I would just like to hear from everyone since I've been pulling my hair out over this and I really should only be worrying about creating the end user app and not the data quality going into it.

Thanks for any input.

Michael Forte
Forte Web Properties, Inc.

Brian Ferris

unread,
Nov 30, 2015, 3:36:20 PM11/30/15
to Transit Developers, Muhammad Umair Akhtar
Are we talking GTFS or GTFS-realtime here?

For GTFS, I think some agencies (and the scheduling software they use) don't have the ability and/or desire to schedule their arrivals and departures to the exact second.  Instead, their times are rounded to the minute.  For close consecutive arrivals/departures, this can result with sequential stop-times with the same times.


--
You received this message because you are subscribed to the Google Groups "Transit Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to transit-develop...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

umair Akhtar

unread,
Nov 30, 2015, 3:40:45 PM11/30/15
to Brian Ferris, Transit Developers
I am talking for GTFS, oh i see now so they are rounded to minutes.
Then let me ask you this: In Data for some stops we can say the difference is minute so they are rounded upto same but what about the other stop times ? Don't you think that its totally wrong data for times ? 

Aaron Antrim

unread,
Nov 30, 2015, 4:14:31 PM11/30/15
to transit-d...@googlegroups.com
Hello Michael and Transit Developers:

Your question suggests three important needs to me.

1.) We need a best practices or style guide for GTFS, as there are a great many cases where the Spec leaves a lot of freedom and room for interpretation in choices that will have consequences for consuming applications. I am unsure of the best home or process for building a best practices guide.

2.) Tools that help us to discuss GTFS datasets and their particular components will be very helpful for facilitating high-quality communication clarifying best practices around Spec use. One such tool is transitfeeds.com. I find it very useful to quickly point to a data element in a public GTFS dataset in conversation with other developers (e.g. http://transitfeeds.com/p/thebus-honolulu/57/latest). Transit Land (https://transit.land) aims to provide another such tool.

3.) We need more mechanisms to for conversation between feed publishers and feed consumers. One such mechanism are the proposed new fields in feed_info.txt, feed_contact_email and feed_contact_url: https://groups.google.com/forum/#!topic/gtfs-changes/iJkdVB1DQnM

On Thu, Nov 26, 2015 at 3:36 PM, Michael Forte <michael...@gmail.com> wrote:
There are holes in the data and sometimes the data is just wrong. For example, a fairly common one I'm seeing is that adjacent stops for a route may have the same arrival and departure times which is just impossible. This really screws up the view in Google Maps when trying to display it.

So you have a reference point to use, you can see this for TheBus agency in Hawaii.
Vehicle 103 (5587795.1031.847835)
Look at stops 4 and 5 that both have arrivals at 05:47 and departures at 05:48

Here is another example for the Adelaide Metro in Australia.
Vehicle 118 (159606)
Look at stops 3 and 4 that both have identical arrival and departure times at 05:52

Are there any moderator quality checks needed by the GTFS community to say the data can be posted or is there some inherent mistake in the definitions of what these data fields are supposed to contain?

I would just like to hear from everyone since I've been pulling my hair out over this and I really should only be worrying about creating the end user app and not the data quality going into it.




--
Aaron Antrim, Principal
Trillium Solutions, Inc.
www.trilliumtransit.com
Portland, Oregon
503.567.8422 ext. 3

Aaron Bannert

unread,
Nov 30, 2015, 4:25:00 PM11/30/15
to transit-d...@googlegroups.com
+1 on all 3 points!

It would be great to have a one-stop-shop for GTFS dataset authors to get their questions answered. This could be a simple FAQ-style wiki (does this group have an official website?). We could answer common edge case questions (for example: how to handle loops, how to handle first and last trips of the day, title conventions, etc…). We might also consider creating a Stack Overflow community for a more interactive and persistent (and searchable) Question-and-Answer forum.

-aaron


Brian Ferris

unread,
Nov 30, 2015, 4:48:37 PM11/30/15
to transit-d...@googlegroups.com
1) Agreed, a Style guide would be useful.  We even tried making this happen (witness the beyond-sparse GTFS Style Guide) but it turns out that getting everyone to agree on best practices is easier said than done.  I ran out of cycles to work on this myself, but I think having someone drive the consensus building process is going to be unavoidable no matter where the resulting document lands.

2) More tools are certainly good.  Getting them all to agree on what exactly is "valid" GTFS will always be a fun task ;)

3) I always thought of transit-developers and gtfs-changes as good places for these kind of conversations (hey, look at this conversation we are having now) but I'll be the first to admit that searching through the 8 years (!) of gtfs-changes history is maybe not the best way to do things.

Towards that problem, we (big hat-tip to Sean Barbeau) actually tried creating a Stack Overflow site for transportation developers: https://groups.google.com/d/msg/transit-developers/GSpplsQ9Xcw/Rc6l-uffBIwJ

Unfortunately, we weren't able to demonstrate enough usage to get out of the StackExchange incubator.  Maybe that's changed in the past two years?

Ritesh Warade

unread,
Nov 30, 2015, 4:55:30 PM11/30/15
to transit-d...@googlegroups.com

+1 on all 3 from me, especially on point #1.

 

A best practices or style guide is much needed. Having worked with GTFS data of varying ‘quality’ from a number of agencies, I have encountered the need for this many times, and so am volunteering to help.

 

---

Ritesh Warade

Associate Director, IBI Group

617.699.9544 | ritesh...@ibigroup.com

Ritesh Warade

unread,
Nov 30, 2015, 5:14:22 PM11/30/15
to transit-d...@googlegroups.com

I set up a google doc a while ago to collect thoughts regarding what might go into a GTFS best practices guide and shared with a few people. Here it is: https://docs.google.com/document/d/1FeAJNDs-1EdzcQq_daq8_uR0KIug6tzKDxdPxSdi8L4/edit?usp=sharing

 

Please feel free to use this however you see fit.

 

Regards,

 

Ritesh

 

---

Ritesh Warade

Associate Director, IBI Group

617.699.9544 | ritesh...@ibigroup.com

 

From: Ritesh Warade
Sent: Monday, November 30, 2015 4:54 PM
To: 'transit-d...@googlegroups.com' <transit-d...@googlegroups.com>
Subject: RE: [transit-developers] A question to the GTFS community

 

+1 on all 3 from me, especially on point #1.

 

A best practices or style guide is much needed. Having worked with GTFS data of varying ‘quality’ from a number of agencies, I have encountered the need for this many times, and so am volunteering to help.

 

---

Ritesh Warade

Associate Director, IBI Group

617.699.9544 | ritesh...@ibigroup.com

 

 

 

From: transit-d...@googlegroups.com [mailto:transit-d...@googlegroups.com]
Sent: Monday, November 30, 2015 4:48 PM
To:
transit-d...@googlegroups.com
Subject: Re: [transit-developers] A question to the GTFS community

 

1) Agreed, a Style guide would be useful.  We even tried making this happen (witness the beyond-sparse GTFS Style Guide) but it turns out that getting everyone to agree on best practices is easier said than done.  I ran out of cycles to work on this myself, but I think having someone drive the consensus building process is going to be unavoidable no matter where the resulting document lands.

Sean Barbeau

unread,
Dec 1, 2015, 10:14:08 AM12/1/15
to Transit Developers
Agreed on all three points made by Aaron.


We even tried making this happen (witness the beyond-sparse GTFS Style Guide) but it turns out that getting everyone to agree on best practices is easier said than done.

Brian - would we be able to move this GTFS Style Guide to Github, similar to how the GTFS-rt spec was moved there?

We might be able to use the same process with contributions/discussions happening via issues/pull requests to help flesh this out a bit.

Sean

To unsubscribe from this group and stop receiving emails from it, send an email to transit-developers+unsub...@googlegroups.com.


For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Transit Developers" group.

To unsubscribe from this group and stop receiving emails from it, send an email to transit-developers+unsub...@googlegroups.com.


For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Transit Developers" group.

To unsubscribe from this group and stop receiving emails from it, send an email to transit-developers+unsub...@googlegroups.com.

Brian Ferris

unread,
Dec 1, 2015, 11:45:45 AM12/1/15
to Transit Developers
Again, I don't think the communication medium is the hard part here.  That said, I don't have any problem using GitHub if we think it'd help.  I'd only add that I would prefer the published Style Guide to live with the published spec, no matter the underlying update mechanism.

To unsubscribe from this group and stop receiving emails from it, send an email to transit-develop...@googlegroups.com.


For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Transit Developers" group.

To unsubscribe from this group and stop receiving emails from it, send an email to transit-develop...@googlegroups.com.


For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Transit Developers" group.

To unsubscribe from this group and stop receiving emails from it, send an email to transit-develop...@googlegroups.com.


For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Transit Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to transit-develop...@googlegroups.com.

Joa

unread,
Dec 2, 2015, 9:52:15 PM12/2/15
to Transit Developers
The inconsistencies that show in the GTFS feeds are not rooted in GTFS or a lack of best practices there. As Brian points out,  feeds are often times exported from scheduling systems. They represent the scheduling practices of the operator:
- Scheduled times for each stop, which works nicely with GTFS definition of stop times, or
- Consistent existence of scheduled times for rail and BRT only; rubber tired services are often scheduled by timing point only. To complicate matters, Timing points (also called timepoints) may not even be stops. In operational reality, they instruct the driver to not pass the timing point ahead of time. Everything in between is up to the driver's discretion. That explains cases where we see stop times that do not increment from stop to stop: There might be more stops between two timing points than there is scheduled time between the timing points in minutes resolution.
Defining best practices here might be an exercise in futility; the data upstream might not be there in the first place, and it can take substantial effort to change scheduling and planning practices to accommodate GTFS style representation of stop times. (Don't hold your breath)
JP



On Tuesday, December 1, 2015 at 8:45:45 AM UTC-8, Brian Ferris wrote:
Again, I don't think the communication medium is the hard part here.  That said, I don't have any problem using GitHub if we think it'd help.  I'd only add that I would prefer the published Style Guide to live with the published spec, no matter the underlying update mechanism.

On Tue, Dec 1, 2015 at 7:14 AM Sean Barbeau <bar...@cutr.usf.edu> wrote:
Agreed on all three points made by Aaron.


We even tried making this happen (witness the beyond-sparse GTFS Style Guide) but it turns out that getting everyone to agree on best practices is easier said than done.

Brian - would we be able to move this GTFS Style Guide to Github, similar to how the GTFS-rt spec was moved there?

We might be able to use the same process with contributions/discussions happening via issues/pull requests to help flesh this out a bit.

Sean


On Monday, November 30, 2015 at 5:14:22 PM UTC-5, Ritesh Warade wrote:

I set up a google doc a while ago to collect thoughts regarding what might go into a GTFS best practices guide and shared with a few people. Here it is: https://docs.google.com/document/d/1FeAJNDs-1EdzcQq_daq8_uR0KIug6tzKDxdPxSdi8L4/edit?usp=sharing

 

Please feel free to use this however you see fit.

 

Regards,

 

Ritesh

 

---

Ritesh Warade

Associate Director, IBI Group

 

From: Ritesh Warade

To unsubscribe from this group and stop receiving emails from it, send an email to transit-developers+unsub...@googlegroups.com.


For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Transit Developers" group.

To unsubscribe from this group and stop receiving emails from it, send an email to transit-developers+unsub...@googlegroups.com.


For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Transit Developers" group.

To unsubscribe from this group and stop receiving emails from it, send an email to transit-developers+unsub...@googlegroups.com.


For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Transit Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to transit-developers+unsub...@googlegroups.com.

Sean Barbeau

unread,
Dec 3, 2015, 10:01:24 AM12/3/15
to Transit Developers
Defining best practices here might be an exercise in futility;

Possible, but I don't think that should prevent us from trying.  The problem I often encounter is the scheduling/AVL software vendor doing a least-effort dump of their data to GTFS.  Currently the agency has little ammunition to request/require that things be done differently, as technically the vendor is meeting the GTFS spec.  If there were a best practices document, it would give the agency some evidence that the majority of the community sees things differently from the vendor's interpretation.

Sean
 

Ritesh Warade

unread,
Dec 3, 2015, 10:12:02 AM12/3/15
to transit-d...@googlegroups.com

+1 for Sean’s point

And the same is true for GTFS-realtime

 

---

Ritesh Warade

 

From: transit-d...@googlegroups.com [mailto:transit-d...@googlegroups.com] On Behalf Of Sean Barbeau
Sent: Thursday, December 03, 2015 10:01 AM
To: Transit Developers <transit-d...@googlegroups.com>
Subject: Re: [transit-developers] A question to the GTFS community

 

Defining best practices here might be an exercise in futility;

To unsubscribe from this group and stop receiving emails from it, send an email to transit-develop...@googlegroups.com.


For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Transit Developers" group.

To unsubscribe from this group and stop receiving emails from it, send an email to transit-develop...@googlegroups.com.


For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Transit Developers" group.

To unsubscribe from this group and stop receiving emails from it, send an email to transit-develop...@googlegroups.com.


For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Transit Developers" group.

To unsubscribe from this group and stop receiving emails from it, send an email to transit-develop...@googlegroups.com.


For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Transit Developers" group.

To unsubscribe from this group and stop receiving emails from it, send an email to transit-develop...@googlegroups.com.

Joa

unread,
Dec 4, 2015, 9:48:32 AM12/4/15
to Transit Developers
Vendors. Depends. There is one that I remember who used to see GTFS exports, or any data exports for that matter, as adversarial to their business. So they expressed that in their t&c's and agencies unwittingly just passed that through to signing on the dotted line. In such cases, a GTFS styleguide doesn't give any leverage to the agency and by extension, the dev community. Again, this has nothing to do with GTFS itself.
To me this remains a matter of taking care of the issue up front, by properly defining ownership of data and requirements for data exports in the procurement contract at the agency's end, (i.e. RFP).
Example (contract dumpster deep dive): Look up SFMTA"s radio system replacement contract (http://archives.sfmta.com/cms/cmta/documents/4-17-12Item11Radiocontractenclosure.pdf), search for "OWNERSHIP OF DATA" and "Reporting, Archival Data and Support for Queries to Real-time Data"
(Note: The contractor or their team awarded in this case is not that "one")

Jordan Espejo

unread,
Dec 4, 2015, 12:37:01 PM12/4/15
to Transit Developers
This is my exact sentiment. Many times I have made requests to our vendor to fix discrepancies in both gtfs and gtfs-realtime, but they choose not to comply, which leaves me to manually fix them.

Jordan
SJRTD


On Thursday, December 3, 2015 at 7:01:24 AM UTC-8, Sean Barbeau wrote:

Quentin Zervaas

unread,
Dec 7, 2015, 5:48:08 PM12/7/15
to Transit Developers
I have an experimental feature on transitfeeds.com where it is possible to automate the normalization of feeds.

I only have it running on the SFMTA feed for now:


You can see two versions: their original, and a normalized version. The normalized version is generated by running the original through a transformation pipeline. (As I mentioned, this is currently experimental so the transformation isn't 100% working properly at the moment)

For this particular feed, the normalizing chain looks as follows:


My biggest problem with this though is the enormous amount of effort that would be required to build a list of these rules for all feeds that require them.

Having said that, a system something like this that is made public could help to alleviate the errors / inconsistencies that appear in many feeds

Cheers

Quentin

sabre23t

unread,
Dec 7, 2015, 10:40:49 PM12/7/15
to Transit Developers
On Tuesday, 1 December 2015 05:48:37 UTC+8, Brian Ferris wrote:
Towards that problem, we (big hat-tip to Sean Barbeau) actually tried creating a Stack Overflow site for transportation developers: https://groups.google.com/d/msg/transit-developers/GSpplsQ9Xcw/Rc6l-uffBIwJ
Unfortunately, we weren't able to demonstrate enough usage to get out of the StackExchange incubator.  Maybe that's changed in the past two years?

I just had a look over on StackExchange network of sites for transportation related topics and found ...

(1) http://travel.stackexchange.com/ with questions tagged as public-transport, buses, or trains may be best for transportation users type of Q/A. There are hundreds of these questions there.

(2) http://engineering.stackexchange.com/ with questions tagged as transportation may be best for transportation developers type of Q/A. Unfortunately there are only less than 10 transportation questions there. We could make it more. :-)

regards,
sabre23t

sjba...@gmail.com

unread,
Dec 8, 2015, 9:40:06 AM12/8/15
to Transit Developers
Thinking more, regarding the GTFS Style Guide - what if we make the criteria to accept new or modify existing entries in the GTFS Style Guide a lower threshold that accepting additions to the GTFS spec itself?  For example, a 2/3 majority vote (with perhaps a minimum number of votes?) would be enough to modify the official GTFS Style Guide.

I think this would help give some type of official guidance for the corner cases that people encounter, without the overhead required to modify the spec itself.  The GTFS Style Guide would not be allowed to conflict with the GTFS spec, only provide clarity and/or best practices that aren't specifically outlined in the spec.

Sean

Aaron Antrim

unread,
Dec 14, 2015, 3:17:22 PM12/14/15
to Transit Developers, sjba...@gmail.com
I think you're right that recommendations in the style or best practices guide need a lower barrier to entry than changes to the Specification. Entries in the style guide would largely be driven by the needs of consuming applications, and needs will be particular to particular application types. For example, AVL or arrival estimates systems will have different requirements for block_id than for trip planners: trip planning software just benefits from block_id where in-seat transfers are available to customers, while most arrival estimates systems need or benefit from block_id for every record in trips.txt.

Aaron Antrim

unread,
Dec 14, 2015, 3:28:59 PM12/14/15
to Transit Developers
Hello all,

Building off of some of the communication in this thread, I attempted to propose a response to various needs for a style guide, reliable system for disseminating and publishing GTFS datasets, and overall Spec management and communication in a blog post here:

My aim was to document impressions and observations, as well as make proposals for a way forward. I welcome responses and comments!

Vasile Cotovanu

unread,
Dec 14, 2015, 4:58:53 PM12/14/15
to Transit Developers
Hello all,

Speaking of tools, you can also check https://github.com/vasile/GTFS-viz
It's a Ruby script, in few minutes you get GeoJSON, KML files that allows you quickly to visualize your dataset or even a SQLite DB that you can plug into https://github.com/vasile/transit-map web-app

Vasile

Sean Barbeau

unread,
Dec 16, 2015, 9:32:11 AM12/16/15
to Transit Developers
Aaron,
Nice post!  I agree that these are the critical issues facing the GTFS community today that all deserve further discussion.  Maybe it would be worth posting each separately in new threads here (and maybe after the holidays ;) ) to see if we can consolidate some opinions among the community on paths forward.

Sean

Joris Wu

unread,
Dec 29, 2015, 4:44:36 PM12/29/15
to Transit Developers, umair...@ymail.com
I have been integrating around 700 feeds from all over the world, and see quite a handful having quirks or errors.

A few common mistakes is that coordinates can be wrong, such as a missing minus sign, or - SNCF France - many coordinates being thousands of Km wrong.

Some do not get the quoting right ( double quoting quotes within a quoted field ), or whitespace outside quotes in a quoted field.

Some leave many rows in stop_times without time.

Many use all-capitals for names, which is not presentable. Some use numbers for names, some use extensive comments in names. To summarize, many use name fields in a way that they are not presentable as such to end users.

It seems to me that some feeds list trips from an operator point of view, resulting in suggested transfers whilst passengers perceive a continued connection. The block_id is often either not provided or not meant to be used to reconstruct the routes from a passenger point of view ( which is complicated anyhow). GTFS does not have a well-defined way to define 2 routes, with a single vehicle serving both, flipping headsigns and letting passengers stay.

The biggest issue from a transit app point of view is that many feeds are outdated at the time of publishing. I do consider this an error, as you do not know the schedule past the published period. Some are years old. Some publish feeds with a validity of only a few weeks. But it seems most just forget to update them in time. Some appear to have published a feed once, then abandoned the project.

Lastly, there is no authoritative list of current, official feeds with current URLs. I know we have gtfs-data-exchange, transitfeeds.com and several others, but none are complete, authoritative and up-to-date. I found many feeds not on these lists by organic searching. Latest addition - I mentioned to transitfeeds maintainer - was Athens in Greece. Trlilliumtransit houses many feeds, yet are not willing to provide a directory.

Regarding best practices :

I would recommmed to highly prefer using calendar.txt, using calendar_dates.txt only for exceptions. Especially for larger feeds, as the filesize will otherwise become very large. Needless to mention 700 combined feeds will become huge even withou such issue.

I highly recommend having an agency_id field, and it would be best if there is a worldwide list to make them unique.

Recommendation for the GTFS format itself :

As hinted, I recommend making feed_info.txt mandatory.

For international users, I would suggest adding an 'international_name' column for stop and route names in e.g. english, leaving the 'stop_name' and route_name columns for local language / script.

To support proper integration / aggregation, it would be best if agency_id fields could be made unique.

Joris

Joris Wu

unread,
Dec 29, 2015, 9:06:32 PM12/29/15
to Transit Developers, umair...@ymail.com
One more issue that I encounter in some feeds :

An individual trip as seen by a passenger is sometimes represented as a route. So when a PDF timetable shows a single route 'line 3B from Manukau to Auckland Airport' with a handful of individual trips, the GTFS contains a separate route per trip. I expect the passenger view of a route is here not maintained. That in turn is a weakness in the GTFS spec in that it does not spell out whether a concept is meant from a passenger point of view or not.

Joris

Andrew Byrd

unread,
Dec 30, 2015, 5:05:01 AM12/30/15
to transit-d...@googlegroups.com, umair...@ymail.com

> On 29 Dec 2015, at 22:44, Joris Wu <jor...@gmail.com> wrote:
> I have been integrating around 700 feeds from all over the world, and see quite a handful having quirks or errors.

Hi Joris,

You raise a lot of valid points and it is helpful to have lists of common errors when working on GTFS quality control software. These problems should be checked by any GTFS validator or feed cataloging / integration system.

I know all these problems can be frustrating for a data consumer (I am one) but realistically, since GTFS feeds are sourced from a huge number of organizations with varying technical expertise (or none at all) we will continue to get problematic feeds for the foreseeable future. One important expectation behind GTFS is that the consumer does more work than the producer, which encourages the producers to participate in sharing data with us.

In my view, the only thing that can really be done is automated quality checks providing clear, useful reports to the feed creator. We need a new generation of feed validation software that is more sophisticated and can even check for signs of problems across different versions of the same feed. Of course, even if we can detect and very clearly report on these errors many feed producers will have no idea how to fix them or no means to do so. As an analogy, imagine someone tells you your word processor is exporting invalid PDF files and they can’t read them. You may not be able to do anything about it but beg the word processor people for a bug fix. But still, flagging the errors is an essential first step.

More comments on your specific categories of errors in-line below.

> A few common mistakes is that coordinates can be wrong, such as a missing minus sign, or - SNCF France - many coordinates being thousands of Km wrong.

This is often due to stops missing positions, which are automatically assigned the position (0, 0) in some coordinate system, which is then de-projected to geographic coordinates. You end up with stops at the origin point of the coordinate system. For French data this yields stops in northern Africa, and for Dutch data it’s near Euro-Disney in France.

> Some leave many rows in stop_times without time.

This is allowed. Many transit operators only specify arrival and departure times at “timepoints”. According to the spec, all missing times are to be interpolated by the consumer.

> It seems to me that some feeds list trips from an operator point of view, resulting in suggested transfers whilst passengers perceive a continued connection. The block_id is often either not provided or not meant to be used to reconstruct the routes from a passenger point of view ( which is complicated anyhow). GTFS does not have a well-defined way to define 2 routes, with a single vehicle serving both, flipping headsigns and letting passengers stay.

There is in fact a standard way to represent this, and you mentioned it: the block ID.

> The biggest issue from a transit app point of view is that many feeds are outdated at the time of publishing. I do consider this an error, as you do not know the schedule past the published period. Some are years old. Some publish feeds with a validity of only a few weeks. But it seems most just forget to update them in time. Some appear to have published a feed once, then abandoned the project.

This is more of an organizational or political problem than a problem with GTFS. Feeds are often published once to demonstrate “openness” at a PR event, then poorly maintained or forgotten. Often no one has the job of maintaining the GTFS feed and it’s done as a side-job by someone with other obligations. Not much can be done about this in a systematic way, other than GTFS cataloging and integration software that watches out for such problems.

Companies like Google certainly take this step seriously, but that’s a major component of the quality of their applications and we can’t reasonably expect them to share their cleaned data. Other systems of this kind are emerging, but it will be at least a year or so until they are really in place.

> Lastly, there is no authoritative list of current, official feeds with current URLs. I know we have gtfs-data-exchange, transitfeeds.com and several others, but none are complete, authoritative and up-to-date. I found many feeds not on these lists by organic searching. Latest addition - I mentioned to transitfeeds maintainer - was Athens in Greece. Trlilliumtransit houses many feeds, yet are not willing to provide a directory.

There is no centralized quality, integration and cataloging effort for GTFS feeds. Again, this is a widely recognized need and efforts are made to develop such systems, but the main cost and difficulty here is specialized human labor. Someone needs to pay people to develop and maintain such a system and its content. There are non-negligible ongoing maintenance and management costs. Commercial entities do the work, but they don’t usually want to share their added value with their competitors. It could be done on a volunteer basis, as some have already done, but based on your comments we can conclude that hasn't yielded a satisfactory result yet. An industry consortium or public sector entity would need to step up and run such a system.

> I would recommmed to highly prefer using calendar.txt, using calendar_dates.txt only for exceptions. Especially for larger feeds, as the filesize will otherwise become very large. Needless to mention 700 combined feeds will become huge even withou such issue.

Many producers specify calendars using only specific dates and not calendar.txt, including some of the originators of the GTFS format. It’s unlikely to change, as it’s related to how their scheduling software represents their services internally. These calendar files contribute very little to the overall size of the feed so in my experience this is not a problem at all.

> I highly recommend having an agency_id field, and it would be best if there is a worldwide list to make them unique.

The notion of making identifiers globally unique is important, but registering globally unique agency IDs is not the solution. All identifiers in GTFS are feed-unique, but a feed can have more than one agency ID and most entities in a GTFS feed are not associated with an agency. In my opinion the true solution is unique feed IDs. We have made a proposal for feed IDs, and several prominent feed producers have started including them. OpenTripPlanner now recognizes feed IDs. Sometime soon I’ll revive this proposal, as both producers and consumers of feed IDs now exist.

> As hinted, I recommend making feed_info.txt mandatory.

It’s unlikely that non-backward-compatible changes will be made, which would include making something mandatory that has until now not been mandatory.

Again, thanks for the summary of problems you encountered. It is quite useful for working on feed validation / integration software.

Andrew Byrd

Joa

unread,
Dec 30, 2015, 11:58:22 PM12/30/15
to Transit Developers, umair...@ymail.com


On Wednesday, December 30, 2015 at 2:05:01 AM UTC-8, Andrew Byrd wrote:

Hi Joris,


> It seems to me that some feeds list trips from an operator point of view, resulting in suggested transfers whilst passengers perceive a continued connection. The block_id is often either not provided or not meant to be used to reconstruct the routes from a passenger point of view ( which is complicated anyhow). GTFS does not have a well-defined way to define 2 routes, with a single vehicle serving both, flipping headsigns and letting passengers stay.

There is in fact a standard way to represent this, and you mentioned it: the block ID.


Not quite. Blocks and trips in of themselves have no meaning in traveler facing capacity (a couple of exception below). Trips are vehicle trips, not passenger trips. A block is an ordered set of trips. Blocks historically are the basis for the paddles that operators pick up at "window dispatch" before they leave the yard to work the block on their shift. Scheduling systems have added a number of twists to this, but at the fundamental level, that's the reason why you'll end up pulling your hair out trying to make sense out of trips in a traveler facing capacity.
In GTFS, block data is optional. Off hand I can identify to cases where missing block data is relevant in traveler information:
1. You cannot predict next bus / train times of vehicles running on the previous trip over at a stop on the subsequent trip.
2. You miss potentially "valuable" non-transfer passenger connections to other routes where agencies interline mid-route or at transfer centers. That's the flipping the headsigns and letting passengers stay scenario

Andrew Byrd

unread,
Jan 4, 2016, 8:22:50 AM1/4/16
to transit-d...@googlegroups.com, umair...@ymail.com
> On 31 Dec 2015, at 05:58, Joa <Joachim....@gmail.com> wrote:
> Not quite. Blocks and trips in of themselves have no meaning in traveler facing capacity (a couple of exception below). Trips are vehicle trips, not passenger trips. A block is an ordered set of trips. Blocks historically are the basis for the paddles that operators pick up at "window dispatch" before they leave the yard to work the block on their shift.

I am not disputing any of the issues Joris raised. Indeed, the block_id is often not supplied or not usable in isolation to determine when passengers can stay on a physical vehicle, passing from one logical trip to another (which we usually refer to as “interlining”). I was only reacting to his final statement that GTFS has no way to represent interlining. Independent of the conventional meaning of the term “block” in transit operations, the GTFS spec clearly states that this field has a traveler-facing meaning: "The block_id field identifies the block to which the trip belongs. A block consists of two or more sequential trips made using the same vehicle, where a passenger can transfer from one trip to the next just by staying in the vehicle."

So in theory, according to the specification, GTFS has a way to express the traveler-facing notion of an in-seat transfer. What it lacks is a way to express the physical vehicle continuity underlying that possibility of interlining (the conventional reading of “block” you cited). I presume that this is because the meaning of block_id was defined before the appearance of GTFS-RT, at a time when there was no other practical use for information about the series of trips served by a single vehicle.

Now, in practice I acknowledge that block_id is not systematically used in the way prescribed by the spec. The blocks are often exported as-is from a scheduling system, even for trips that do not allow passengers to stay on board. Indeed, TriMet (one of the originators of the GTFS format) exports the block ID in its original non-rider-facing sense. We have worked with them to include some extra information that permits us to determine when two trips are interlined. But that system is a stopgap measure using the pick-up type of the final stop in a trip, which has some other awkward side effects. It should be replaced with a proper, well thought-out GTFS extension. I’ve already brought this up with people at TriMet and while it’s not always easy to find time to re-write GTFS export tools, they seemed supportive of switching to a more standard representation.

So I am not saying GTFS’s block and interlining representation is clear, complete, or convenient, and based on my experience using block_ids, we could and should come up with ways to make the system better.

I think the best solution is to redefine block_id to the conventional non-rider-facing meaning, and complement it with a new field that expresses interlining on a trip-by-trip basis. This could be either the trip_id of the following trip, or a boolean field stating whether interlining is enabled to the following trip in the block. The first option is somewhat redundant (“denormalized”) but easy to use, and redundancy has a positive side of allowing sanity checks.

-Andrew

Aaron Antrim

unread,
Jan 4, 2016, 12:14:35 PM1/4/16
to transit-d...@googlegroups.com
On 4 Jan 2016, at 5:22 AM, Andrew Byrd <and...@fastmail.net> wrote:
>
> Now, in practice I acknowledge that block_id is not systematically used in the way prescribed by the spec. The blocks are often exported as-is from a scheduling system, even for trips that do not allow passengers to stay on board. Indeed, TriMet (one of the originators of the GTFS format) exports the block ID in its original non-rider-facing sense. We have worked with them to include some extra information that permits us to determine when two trips are interlined. But that system is a stopgap measure using the pick-up type of the final stop in a trip, which has some other awkward side effects. It should be replaced with a proper, well thought-out GTFS extension. I’ve already brought this up with people at TriMet and while it’s not always easy to find time to re-write GTFS export tools, they seemed supportive of switching to a more standard representation.

I think redefining block_id to match widespread practice (blocks = vehicles) is a good approach. I know that GTFS with block_id representing vehicles is ingested by some arrival estimates systems.

Michael Smith

unread,
Jan 4, 2016, 12:39:52 PM1/4/16
to transit-d...@googlegroups.com
I am in complete agreement that GTFS terms should match those in practice. Block typically is a collection of trips that are the assignment for a vehicle for a day. Both scheduling systems and real-time information systems use this definition of block. Therefore I think that when it comes to linking trips to indicate where a passenger can stay on the vehicle should be indicated in a separate way.

Mike

--
You received this message because you are subscribed to the Google Groups "Transit Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to transit-develop...@googlegroups.com.

Joa

unread,
Jan 4, 2016, 10:45:36 PM1/4/16
to Transit Developers, umair...@ymail.com


On Monday, January 4, 2016 at 5:22:50 AM UTC-8, Andrew Byrd wrote:
> On 31 Dec 2015, at 05:58, Joa <Joachim....@gmail.com> wrote:
> Not quite. Blocks and trips in of themselves have no meaning in traveler facing capacity (a couple of exception below). Trips are vehicle trips, not passenger trips. A block is an ordered set of trips. Blocks historically are the basis for the paddles that operators pick up at "window dispatch" before they leave the yard to work the block on their shift.

I am not disputing any of the issues Joris raised.
Same here - not to drain out the attention with a discussion on trips and blocks.
The findings that Joris posted are reflecting my experience with feeds data out there.

 
 I presume that this is because the meaning of block_id was defined before the appearance of GTFS-RT, at a time when there was no other practical use for information about the series of trips served by a single vehicle.
Yes, there is life before GTFS-rt. Real-time tracking and traveler information systems were being built for quite some time before. There are agencies that are at gen three. TriMet built their second system a few years back. For the curious, there is a great study that the FTA did in the mid 90's that you should be able to find referenced from the materials that the EFF used in their efforts to invalidate ArrivalStar's patents. 
--------------------
At scale, we are not going to get for block data anything other than what is industry practice. By and large.
It would be a great accomplishment I think to be able to move block_id from optional to mandatory. Beyond just putting it in the spec, but have everybody on board.

Andrew Byrd

unread,
Jan 5, 2016, 4:25:36 AM1/5/16
to transit-d...@googlegroups.com, umair...@ymail.com
On 05 Jan 2016, at 04:45, Joa <Joachim....@gmail.com> wrote:
 I presume that this is because the meaning of block_id was defined before the appearance of GTFS-RT, at a time when there was no other practical use for information about the series of trips served by a single vehicle.
Yes, there is life before GTFS-rt. Real-time tracking and traveler information systems were being built for quite some time before. There are agencies that are at gen three. TriMet built their second system a few years back. For the curious, there is a great study that the FTA did in the mid 90's that you should be able to find referenced from the materials that the EFF used in their efforts to invalidate ArrivalStar's patents. 

There was certainly life before GTFS-RT. I remember working with a TriMet real-time arrivals feed several years before GTFS appeared, so open arrivals data predates open schedule data. But I think you’re misunderstanding my intent. When I say “the meaning of block_id was defined before the appearance of GTFS-RT” I don’t mean that the concept of blocks already existed in transit operations. That fact is well-established and I was never questioning it. I meant that block_id (the GTFS field) was defined to mean something else (passenger-facing interlining blocks), and I assume that definition was chosen because GTFS and the trip planners that consumed it had no real-time component. There was no immediate practical use for conventional block information, but there was an immediate practical use for passenger-facing interlining information. Which gives us the problematic definition of block_id we now have.

Anyway, I think the situation is clear now and passenger information systems have clearly evolved to incorporate real-time data. It seems like we all agree that block_id should represent a conventional operational block, so we need to decide on a method to represent rider-facing interlining.

Does anyone have arguments for or against the two proposed methods for representing in-seat transfers / interlining?
1. Boolean field in trips.txt saying whether passengers may stay on board, continuing on the next trip in the block
2. String field in trips.txt explicitly stating the trip_id of the following trip if the passenger may stay on board

Again, I think both would work, with the second being more denormalized/redundant (which allows for more sanity checks). 

Also, the correct place for this discussion is probably the gtfs-changes group, so I’ll post a message there.

Andrew






Reply all
Reply to author
Forward
0 new messages