Bus GTFS size & implementation

466 views
Skip to first unread message

John Paul N.

unread,
Sep 28, 2011, 9:17:42 AM9/28/11
to mtadeveloperresources
Hello fellow developers,

I read the discussion about the Metro-North GTFS calendar and the GTFS
implementation. While I have not looked at the recent Metro-North
GTFS, I have spent my experience with the Subway, NYCT Bus, MTA Bus
(not recently) and LI Bus. The NYCT Bus GTFS is much larger than the
Metro-North, and it is for a similar reason as pointed out in the MNRR
discussion: the calendar_dates exceptions identify school days AND
holidays AND replace regular service in its entirety. Thus every
regular trip is multiplied by 3 or 4, so to speak, and processing can
be long and complicated. (The Bus GTFS is also large for a different
reason which I will explain below.) So I fully understand the
frustration in the developer's quickly understanding the GTFS
implementation.

I think I have an implementation for school days and holidays that
might work. (I don't know how clearly I will be able to explain this,
but I will try.)

The basic idea is, you are obligated to consider all service
exceptions. There are 4 basic types of service patterns that I have
guessed from the MTA, through the GTFS and official schedules.

Type 1: School-day only trips (and it does not run on holidays)
Type 2: Trips that run only on holidays, supplementing regular service
Type 3: Trips that don't run on holidays, and cannot be considered
regular service
Type 4: All other trips, e.g. regular scheduled trips even on weekday
holidays

A bus has to fall under one of these types. So I suggest a sample
implementation:

========================================
calendar.txt
service_id, monday, tuesday, wednesday, thursday, friday, saturday,
sunday, start_date, end_date
A1, 1, 1, 1, 1, 1, 0, 0, 20111001, 20111231
A2, 1, 1, 1, 1, 1, 0, 0, 20111001, 20111231
B1, 0, 0, 0, 0, 0, 1, 0, 20111001, 20111231
B2, 0, 0, 0, 0, 0, 1, 0, 20111001, 20111231
C1, 0, 0, 0, 0, 0, 0, 1, 20111001, 20111231
C2, 0, 0, 0, 0, 0, 0, 1, 20111001, 20111231


The second bit in service_id:
1: This trip will operate on its normal schedule for all dates, ONLY
EXCEPTION is holidays
2: This trip will operate on its normal schedule EXCEPT holidays and
also for a certain date(s)
3 and up: This trip will operate on the date in which #2 will not run
OR it is an extra trip that will run on the date of #2

School-day only trips fall under #2. If a trip occurs ONLY when school
is not in session, and is not otherwise regular service, that falls
under #3. In both of these cases, #1 would not be superseded in
calendar_dates.txt. If the day is a holiday (weekend service on a
weekday), these trips are also number #3 and will operate under
different schedules altogether, so #1 and #2 will have to be
superseded in calendar_dates.txt.



calendar_dates.txt
service_id, date, exception_type
A2, 20111010, 2 // Columbus Day, modified weekday schedule, schools
not in attendance, but added service elsewhere
A3, 20111010, 1
A2, 20111108, 2 // Election Day, modified weekday schedule: schools
not in attendance, assumes no other added trips, so A3 is not added.
A1, 20111124, 2 // Thanksgiving Day, sunday schedule
A2, 20111124, 2
C1, 20111124, 1
C2, 20111124, 1
A2, 20111125, 2 // Day after Thanksgiving, modified weekday schedule,
different from Columbus Day schedule, added service
A4, 20111125, 1

[The C2 for Thanksgiving may be unnecessary.]

Please discuss and if there are any errors in my logic please correct
me. But if this works, this will go a long way to reduce the GTFS size
and restore all of our sanity with the numerous MTA schedules.

=======================================

Now for the other elephant in the room regarding bus GTFS size, some
of the MTA's conventions drive up the file size further and I feel
unnecessarily. In particular, the service_id and trip_id in trips.txt,
stop_times.txt and all related files.

Reducing the number of trips and stop_times will produce a significant
reduction in the GTFS size, I'd estimate at least 50%, maybe 75%. But
if the above scheme ends up not feasible, the next area to tackle to
reduce the file size is the service_id and trip_id attributes.

Here is a sample typical entry in trips.txt:

route_id,service_id,trip_id,trip_headsign,direction_id,block_id,shape_id
BX11,20110904ER,20110904ER_055400_BX11_0029_BX11_12,W FARMS RD
SOUTHERN BL,0,,BX110029

As you can see, the service_id is 10 bits long, and is for, let's say
15 unique service patterns. If you have 15 service patterns, you don't
need 10 bits, just 1 or at most 2.

The trip_id contains the current service_id plus some other
characters, including 2 bus routes, which may or may not be the same
as the route_id. Again, for the number of trips you have for each
route (usually less than 1,000 per service day), you don't really need
to represent them with so many characters. I would guess if the
trip_id is reduced by two-thirds, it would be recognizable and
distinct enough to identify. Other implementations I see usually have
10 characters but are easily identifiable for the consumer/developer.

I realize these values are likely used internally, and I am curious as
to what each set of characters between the underscores means, but if
the MTA does not want to give an explanation, I will respect that. At
least one individual I know uses the full trip_id in his analysis of
the data (for the subway GTFS, which is similar). If that individual
objects to abridging these values, I would suggest the MTA could
produce a file that will bridge between the long and shortened id's.



I sincerely hope I have been delicate about all that I have said above
and I hope my recommendations will be taken in earnest. Feel free to
discuss. With reduced, but still concise GTFS files, not only do the
developers benefit, but the MTA does as well, in freeing up bandwidth
and storage for other data.


Thank you,
John Paul

P.S. I have been dealing with a very poor Internet connection in my
home this past week. I will try to respond through my phone, but it
will not be as detailed as the long message above. I am also
nocturnal, so I may not be able to respond until after the end of the
business day.

Adam Ernst

unread,
Sep 28, 2011, 9:21:46 AM9/28/11
to mtadevelop...@googlegroups.com
Hi John,

> The second bit in service_id:
> 1: This trip will operate on its normal schedule for all dates, ONLY
> EXCEPTION is holidays
> 2: This trip will operate on its normal schedule EXCEPT holidays and
> also for a certain date(s)
> 3 and up: This trip will operate on the date in which #2 will not run
> OR it is an extra trip that will run on the date of #2

This is a good idea. There's no reason you can't have two otherwise identical rows in calendar.txt with different service_ids, one of which is removed from holidays and one which isn't.

> Now for the other elephant in the room regarding bus GTFS size, some
> of the MTA's conventions drive up the file size further and I feel
> unnecessarily. In particular, the service_id and trip_id in trips.txt,
> stop_times.txt and all related files.

...


> As you can see, the service_id is 10 bits long, and is for, let's say
> 15 unique service patterns. If you have 15 service patterns, you don't
> need 10 bits, just 1 or at most 2.

This seems less necessary. When I put the feed in a mobile app I put it in a SQLite database and change all internal textual identifiers to ints anyway, which is the most optimal way of doing it. Why don't you consider doing the same?

Adam

John Paul N.

unread,
Sep 28, 2011, 9:37:51 AM9/28/11
to mtadeveloperresources
Hey Adam,

I am using a SQLite database and I do spend a lot of time first
shortening the trip id field size, and then importing into the DB by
producing files of SQL commands, through a Java program I use. (I
couldn't get the Transforming GTFS to work for me.) Perhaps my process
and implementation is less than efficient, and it probably is. (I am
the only person working on the app. It's SchedNYC for Android, BTW.)
Only after the maybe 4th time I had to update the database did it take
1 day to process the subway GTFS. The import takes the longest time
and if a SQL command is wrong, it pushes me back significantly.

Until recently I did not know CSV can be imported into a SQLite
database through the program I use, SQLiteDatabase Browser. I have yet
to try it, but I think it may be feasible for my established DB
structure. Thanks so much for the suggestion!

The shortened id suggestion is based on the MTA's data needs. Any
unnecessary data (for the public) that can be removed should help
their internal organization. But I agree, the calendar management is
the priority.

Frumin, Michael

unread,
Sep 28, 2011, 9:52:47 AM9/28/11
to mtadevelop...@googlegroups.com
The proposal seems based on the idea that the difference between, say, a schoolday and a non-schoolday is strictly limited to the additions of certain trips on the schooldays, and that all other trips/stops remain the same. Similar for holidays, etc. Or am I missing something?

Assuming I understand correctly, have you looked at all of the schedule data to confirm that this is in fact the case? I suspect that it's not, but that's just a hunch.


Also, one additional piece of context as to why the bus NYCT GTFS is so big -- GTFS doesn't support trips that start before midnight. But our operation and fancy scheduling systems do, and we take advantage of that to optimize the cost basis for bus service. This means that, for example, the Friday service has to be defined separately in the GTFS from the Monday-Thursday service, since the "Saturday" trips that start before midnight (i.e. on Friday) need to be part of the Friday service in the GTFS. Similar complexity for transitions between schooldays and not schooldays, between any kind of day and a holiday, from holidays back to non-holidays, etc. Make sense?

Thanks,
Mike

Adam Ernst

unread,
Sep 28, 2011, 9:59:20 AM9/28/11
to mtadevelop...@googlegroups.com
On Sep 28, 2011, at 9:52 AM, Frumin, Michael wrote:

> This means that, for example, the Friday service has to be defined separately in the GTFS from the Monday-Thursday service, since the "Saturday" trips that start before midnight (i.e. on Friday) need to be part of the Friday service in the GTFS. Similar complexity for transitions between schooldays and not schooldays, between any kind of day and a holiday, from holidays back to non-holidays, etc. Make sense?

It does make sense. That explains a lot.

I think the ideal would be to have three service_ids: one for buses that run M-F regardless, one for buses that run M-Th night, and one for Friday night.

The M-F might only contain buses through say, 10pm (or whenever the schedules diverge). But this way we avoid duplicating the entire schedule for Friday and editing only the buses after 10pm.

There's nothing wrong with the current way of doing it. Is it worth the extra effort for the MTA to make the feed as elegant and simple as possible? Probably not. But if anyone at MTA decides to redo the bus feed, do take John's suggestions into account.

Adam

Frumin, Michael

unread,
Sep 28, 2011, 10:05:09 AM9/28/11
to mtadevelop...@googlegroups.com
Adam, thanks.

I don't even think it would need to be the Friday night trips that start after 10pm. It just would need to be the "Saturday" trips that actually start on Friday.

Just to be clear, your suggestion (and indeed John's suggestion) hinges on the fact that you can actually have multiple service_id's active at any given time/on the same date, even for the same route, correct? I guess I hadn't realized that such a thing was supported in GTFS. Would very much appreciate a definitive answer on this.

GTFS is a critical input into MTA Bus Time, so we will keep this discussion in mind if/when we revisit how GTFS is produced in the context of the Bus Time work.

Thanks,
Mike

-----Original Message-----
From: mtadevelop...@googlegroups.com [mailto:mtadevelop...@googlegroups.com] On Behalf Of Adam Ernst
Sent: Wednesday, September 28, 2011 9:59 AM
To: mtadevelop...@googlegroups.com
Subject: Re: [MTAdev] Bus GTFS size & implementation

Adam Ernst

unread,
Sep 28, 2011, 10:08:57 AM9/28/11
to mtadevelop...@googlegroups.com
> I don't even think it would need to be the Friday night trips that start after 10pm. It just would need to be the "Saturday" trips that actually start on Friday.

Right. 10pm was just an example.

> Just to be clear, your suggestion (and indeed John's suggestion) hinges on the fact that you can actually have multiple service_id's active at any given time/on the same date, even for the same route, correct? I guess I hadn't realized that such a thing was supported in GTFS. Would very much appreciate a definitive answer on this.

That's definitely right, and it's the best way to do it. Just avoid accidentally duplicating service (i.e. having the same bus running in two service ids that are active together). It should make the feed much simpler.

> GTFS is a critical input into MTA Bus Time, so we will keep this discussion in mind if/when we revisit how GTFS is produced in the context of the Bus Time work.

Great!

Adam

iTransitBuddy Support

unread,
Sep 28, 2011, 10:28:00 AM9/28/11
to mtadevelop...@googlegroups.com
Guys,

I use my own proprietary App to convert the GTFS to my format just fine.

I think the only change I would recommend would be to use LIRR's approach and only use calendar_dates to show which service_id's run on a specific day (flag is set to 1's only).  You may run into more service id's and entries but that's perfectly OK.  A lot of other agencies use this approach as well, MTS, NJT, LIRR, etc.

Do the Metro North ppl talk with the LIRR ppl who create their GTFS file?  I think it's more robust and accurate to abandon calendar.txt and use calendar_dates.txt 100%.

Any opinions on this?

Brian Ferris

unread,
Sep 28, 2011, 1:34:08 PM9/28/11
to mtadevelop...@googlegroups.com
Just to confirm, you can definitely have multiple service ids active at any given time, even for the same date and the same route.

There is certainly a trade-off when modeling your GTFS between the complexity of your calendar.txt and calendar_dates.txt entries and the number of trips and stop times you need.  I agree with the assessment that the NYCT MTA Bus GTFS feeds could be dramatically smaller with a more complex calendar scheme.  I've got a transformer that attempts to de-duplicate trips and update the calendar entries appropriately.  It results in pretty dramatic size reductions for the bus feeds:

Manhattan: 79% reduction in number of stop times
Brooklyn: 81% reduction in number of stop times

But as result, the calendar.txt and calendar_dates.txt are more complex as well.  So, trade-offs.

That said, considering stop times are usually the main complexity barrier / memory hog when dealing with GTFS, having 1/5 the number of stops times, especially for a big agency like the MTA, could be useful.

That said, you don't necessarily have to wait for the MTA to change their feed (they might have good reasons not to).  I added the trip de-duplication / calendar update code as a transformer in the OneBusAway GTFS transformer command line tool, which means you can run the reducer for yourself right now.  Check out the  "Merge Trips and Refactor Calendar Entries" how-to at:


Thanks,
Brian

John Paul N.

unread,
Sep 28, 2011, 6:30:23 PM9/28/11
to mtadeveloperresources
Hi all,

I have read all your comments, and I appreciate your considerations. I
don't think I'll be able to address them all in this post, but I'll
try.

Mike, in all honesty, yes, I did not look at every single trip and
stop_time. I extracted the express bus routes and I was planning to
work on the Staten Island local routes when I realized I had to update
my LI Bus data, and I am in the middle of doing that now. But I was
able to discern an "E" series for weekday school days, "C" series for
weekday non-school days, "D" series for Sunday and "A" series for
Saturday (this is off the top of my head, it may be wrong). When you
see both calendar files, you can confirm. Of the schedules I worked
with, in the NYCT express bus routes, which is of course not affected
by school schedules (except for the X32 that is now discontinued),
there is no difference in all the "E" and "C" service patterns. I
confirmed this with the public PDF schedule. I looked at the S40 a few
weeks ago, and I did see trips (1 or 2?) that are in the "E" series
but not in the "C" series. consistent with the published schedule and
the back of the bus map and my hypothesis.

The GTFS implementation is correct, as far as I can tell and as it
should be, but I am hoping you, Sarah, or anybody else there can
explain the GENERAL rhyme and reason behind the different service
patterns. (not meant to yell, just to emphasize), beyond what is given
in calendar_dates.txt. This is also not a demand, just a question to
raise discussion. As Adam pointed out, the current implementation is
fine in its dissemination of correct data, but it is far from elegant
and efficient.

Because I looked at the express bus schedule only, I did not
immediately notice a difference between Mo-Th and Fr service, but now
that I revisit calendar.txt, it is indeed so. Adam is right, the
preferred structure in this case is: one for buses that run M-F
regardless, one for buses that run M-Th night, and one for Friday
night.

You may also be able to adjust the definition of a service day, i.e.
0500-2859, but based on the discussion from the Metro-North thread,
it's looking less feasible for your organization due to the conflict
between perfomance metrics and scheduling.

In the end, I say take it or leave it. I offered my suggestions
primarily as a response to the Metro-North GTFS thread (the incorrect
showing of Yankee Stadium trips and the large file sizes), but it does
apply to most of the individual agencies due to the similar patterns
within each one. As long as your GTFS is faithfully accurate, that is
all that matters. The time spent on optimizing may indeed not be the
best use of your time, especially if the MTA has more data coming down
the pipeline. But developers deserve to know why exactly the files are
so large, and that's what I'm hoping to discover in this discussion.

And finally, Mike, since you brought it up, will the BusTime
implementation also expand to express buses?

Adam, I agree with you 100%. (I think...) Just to clarify, my first
name is John Paul. The name has been ingrained within me by my parents
and as a result I'm uncomfortable whenever I'm referred to as "John."
But no hard feelings. If you prefer, you may call me "JP".

iTransitBuddy, using calendar_dates only is best if there are many
service variations, but the majority of days have normal service, and
it does not solve the file size problem and the thing that only a few
trips are affected.

Brian, I have tried the OneBusAway GTFS transformer but had no
success. I don't know what I'm doing wrong, but I'm using the command
line from Windows and everytime I process the bus files, the program
seems to hang, even after half an hour without any messages saying
that it is processing, though I know the CPU and disk drive is
working. How long does the processing take, if anyone knows?

So, I'm extremely grateful for all the feedback!

Thank you very much,
John Paul

On Sep 28, 1:34 pm, Brian Ferris <bdfer...@google.com> wrote:
> Just to confirm, you can definitely have multiple service ids active at any
> given time, even for the same date and the same route.
>
> There is certainly a trade-off when modeling your GTFS between the
> complexity of your calendar.txt and calendar_dates.txt entries and the
> number of trips and stop times you need.  I agree with the assessment that
> the NYCT MTA Bus GTFS feeds could be dramatically smaller with a more
> complex calendar scheme.  I've got a transformer that attempts to
> de-duplicate trips and update the calendar entries appropriately.  It
> results in pretty dramatic size reductions for the bus feeds:
>
> Manhattan: 79% reduction in number of stop times
> Brooklyn: 81% reduction in number of stop times
>
> But as result, the calendar.txt and calendar_dates.txt are more complex as
> well.  So, trade-offs.
>
> That said, considering stop times are usually the main complexity barrier /
> memory hog when dealing with GTFS, having 1/5 the number of stops times,
> especially for a big agency like the MTA, could be useful.
>
> That said, you don't necessarily have to wait for the MTA to change their
> feed (they might have good reasons not to).  I added the trip de-duplication
> / calendar update code as a transformer in the OneBusAway GTFS transformer
> command line tool, which means you can run the reducer for yourself right
> now.  Check out the  "Merge Trips and Refactor Calendar Entries" how-to at:
>
> http://developer.onebusaway.org/modules/onebusaway-gtfs-modules/curre...

Timmy Douglas

unread,
Sep 28, 2011, 5:52:18 PM9/28/11
to mtadevelop...@googlegroups.com
On Wed, Sep 28, 2011 at 10:28 AM, iTransitBuddy Support <sup...@itransitbuddy.com> wrote:
Guys,

I use my own proprietary App to convert the GTFS to my format just fine.

I think the only change I would recommend would be to use LIRR's approach and only use calendar_dates to show which service_id's run on a specific day (flag is set to 1's only).  You may run into more service id's and entries but that's perfectly OK.  A lot of other agencies use this approach as well, MTS, NJT, LIRR, etc.

Do the Metro North ppl talk with the LIRR ppl who create their GTFS file?  I think it's more robust and accurate to abandon calendar.txt and use calendar_dates.txt 100%.

Any opinions on this?



I sort of disagree-- my app imports the GTFS text files into a sqlite database and directly runs queries on it. If there are not a ton of exceptions to dates (in calendar_dates.txt) for the standard services listed calendar.txt, then there can be a big space savings to the database size. It means that it would be easier to maintain (not have to update the database file as often). On the other hand, the code is a lot simpler to just handle calendar_dates. But if they're just going to release 4 months of data at a time anyways, then yeah, just using calendar_dates would be fine. Since my database is on the user's phone, a space savings is good for me.
 

iTransitBuddy Support

unread,
Sep 28, 2011, 10:56:31 PM9/28/11
to mtadevelop...@googlegroups.com
I also use a SQLite Database which is on the users phones.  My LIRR App which uses calendar_dates.txt has the database size of 5mb's which is very small.  My train queries take about 3-5 seconds.

I also have agencies like Calgary whose SQLite DB is 100mb's and the performance is still very good.  If I had my choice Metro North would use calendar_dates but my App can use either.  I think it's easier logically to put it all in calendar_dates because then it's less stress on the SQLite to pull out exceptions.

My thoughts...

Brian Ferris

unread,
Sep 29, 2011, 2:32:28 AM9/29/11
to mtadevelop...@googlegroups.com
Just as a benchmark, it takes me about a little less than 2 minutes to process the entire Brooklyn bus GTFS, so if it's taking dramatically longer, something is not right.  John, I'll contact you off list to see if we can figure out what's going on.

Brian

John Paul N.

unread,
Sep 30, 2011, 7:27:34 AM9/30/11
to mtadeveloperresources
With Brian's assistance, I was able to figure out what went wrong. I
had to increase the Java heap size (and watch what the limits of my
memory was), something I never had to do before. So I can now process
the MTA's GTFS better than I had been and I expect to have less
aggravation as a result.

Now is perhaps not the right time to ask for optimizations. The people
who worked with the Metro-North data were complaining about the file
size, I had to respond as I was working with even larger. But I am at
peace with it for now, and I am reminded of my favorite quote from
"Harry Potter and the Sorcerer's Stone," said by Ron to Harry about
Hermione: "She really needs to sort out her priorities." And that's my
final word for now.

Thanks,
John Paul

On Sep 29, 2:32 am, Brian Ferris <bdfer...@google.com> wrote:
> Just as a benchmark, it takes me about a little less than 2 minutes to
> process the entire Brooklyn bus GTFS, so if it's taking dramatically longer,
> something is not right.  John, I'll contact you off list to see if we can
> figure out what's going on.
>
> Brian
>
Reply all
Reply to author
Forward
0 new messages