Errors in the latest GTFS data for Metro North and comments on the data explosion

16 views
Skip to first unread message

Brett Bergquist

unread,
Apr 7, 2011, 8:09:15 AM4/7/11
to mtadeveloperresources
I am seeing what I believe to be errors in the latest update for Metro
North. For example, in trips.txt there is the following trips:

5,2,2983912,South Norwalk - Danbury,1810,0,,21,1
5,2564,3028925,South Norwalk - Danbury,1810,0,,21,1
5,2568,3037649,South Norwalk - Danbury,1810,0,,21,1

Schedule 2 is tuesday, wednesday, thursday, and schedule 2564 and
2568, add in days 3/22/2011 and 3/31/2011 respectively. So from this
data, train 1810 only runs on tuesday, wednesday, and thursday, but
the paper schedule does not show this.

The same thing for other Danbury trips.

---

Also, I am very troubled by the way that exceptions are now
represented. For example, I see exceptions for schedule 1 remove a
series of dates:

1,20110405,2
1,20110406,2
1,20110412,2
...

and then see schedule 2578 add these dates right back in. Analyzing
the data, I can see that there are only 20 trips that are in schedule
2578 that are not in schedule 1:

trip_short_name,trip_id,schedule
7611,3060105,2578
7613,3060106,2578
7721,3060117,2578
7725,3060119,2578
7115,3060042,2578
7119,3060044,2578
7121,3060046,2578
7123,3060047,2578
7602,3060100,2578
7124,3060048,2578
7126,3060049,2578
7117,3060043,2578
7128,3060050,2578
7312,3060066,2578
7510,3060089,2578
7514,3060091,2578
7516,3060093,2578
7501,3060085,2578
7823,3060131,2578
7827,3060132,2578

This is determined by the query:

SELECT
A.trip_short_name, A.trip_id, S.service_id as schedule
from trips A join calendar S on A.service_id = S.service_id
where (S.service_id = 2578)
and A.trip_short_name not in
(
SELECT
B.trip_short_name
from trips B join calendar S on B.service_id = S.service_id
where S.service_id = 1
)

which finds those trips that exists in schedule 2578 and are not
present in schedule 1. So only those 20 trips are really exceptions
to the schedule 1 trips and need to be handled with exceptions. A
similar thing for the schedule 2579 for example.

What this does is explode the trips data. In the delivered file,
there are 5189 trips defined. If I clean up the data manually, I can
reduce this to 1147 (I also removed trips that occurred in the past
like those scheduled for 3/22, 3/31, 4/2 since these were previous to
when I would distribute the new data). Now for each of these extra
trips defined, you need the extra stop times. In the data as
delivered, there are 64997 stop times. Cleaned up, this has 14416.
As you can see a substantial difference. This has an effect on data
size, processing speed, etc. It also has a big effect on looking at
the data to match it to the paper schedules. The paper schedules sure
do not have 5189 trips.

Previous to the new process, the data would look like my cleaned up
data. The new process seems to be very inefficient in producing the
scheduling exceptions. So before I go writing an algorithm to do what
I did manually, I am wondering if the process to produce the data
could be improved to provide data in a more efficient manner: standard
schedules like mon-fri, sat/sun, fri, sun, etc. and then just the few
exceptions to these schedules.

I can write an algorithm to rebuild the data in this format and remove
about a 4/5'ths of the data for myself, but probably all could benefit
if the delivered data were made more efficient and straight forward.

Brett

John L

unread,
Apr 7, 2011, 8:44:05 AM4/7/11
to mtadevelop...@googlegroups.com

The exceptions listed are the baseball and/or football schedules. In our current scheduling system, there is no way to determine if a trip has changed a time or stop, so to be conservative, the whole thing is added. I would also point out that most, if not all, Hudson line expresses from Croton Harmon have Yankees-East 153rd Street added as a stop so there are more than just trips being added. This is the way we will be adding special and one day schedules going forward. This would include holidays and getaways as well.
In our next iteration of scheduling software, this should not be an issue.
You can choose to do what you want with the data set, but if you eliminate a trip of the same name that has added/removed stops or a time change, you run the risk of eliminating legitimate trips.

For sure there should be a weekday 1810 and we will investigate this. I am sure this is related to bussing. However, from the above, 1810 will still be added for special and one day schedules.

I will say this, we of all developers, know about the large duplication that occurs with special schedules and one day schedules. We took great care to not provide a Monday, Tuesday through Thursday, Friday, Saturday and Sunday base schedules. The same logic for removing duplicated trips cannot be simply added to the one off schedules and have to compare stops and time differences for reach train in the schedule. We know the what and how, it is the when that is the question. The information is not wrong, does not break trip logic and conforms to the GTFS standard. 
If this poses a systemic problem to apps, well the risk gets elevated. So if it does cause problems delivering accurate information in a timely, please let me know. I can use that....

Brett Bergquist

unread,
Apr 7, 2011, 9:41:15 AM4/7/11
to mtadevelop...@googlegroups.com

Thanks for the response, John,  

 

Before my algorithm removes anything, it does compare all stops and times because you are correct, if a trip changes in its stops or times then it truly is not the same trip.   So those cannot be removed.

 

And yes, the trip information is not wrong and does conform.  It does add a burden to low memory and storage constrained devices, so that is why I am cleaning up the data.  When I come up with a programmatic solution that provides the same data in a more condensed format, taking into all of the considerations that you have listed, would it be useful to provide the application back to you for possible use?   This will probably be done in Java but possibly in Perl or PHP.

Joe Hughes

unread,
Apr 7, 2011, 9:58:38 AM4/7/11
to mtadeveloperresources
On Apr 7, 2:41 pm, "Brett Bergquist" <br...@thebergquistfamily.com>
wrote:
> And yes, the trip information is not wrong and does conform.  It does add a
> burden to low memory and storage constrained devices, so that is why I am
> cleaning up the data.  When I come up with a programmatic solution that
> provides the same data in a more condensed format, taking into all of the
> considerations that you have listed, would it be useful to provide the
> application back to you for possible use?   This will probably be done in
> Java but possibly in Perl or PHP.

It seems like the discussion of possible lossless transformations of
GTFS data would be interesting not only to mtadeveloperresources, but
also to the larger transit-developers community [1]—do share whatever
you come up with!

Cheers,
Joe

[1] http://groups.google.com/group/transit-developers

John L

unread,
Apr 7, 2011, 11:42:00 PM4/7/11
to mtadevelop...@googlegroups.com

Most definitely,  we do have some developed code but any thing to push it along is helpful.  Pst it to me if need be but need the justification from the post more so to push it so thanks for the posts!

Still looking at the Danbury stuff and may repost tomorrow latest Monday.

Brett Bergquist

unread,
Apr 8, 2011, 7:39:03 AM4/8/11
to mtadevelop...@googlegroups.com
Maybe I don't understand something, but I am having trouble figuring out some of the Yankee trip data.

Looking at the New Haven Weekend Yankee PDF, I see trip 7533 which is the first trip for a 1:05 weekend game. If I look in the data, I see:

MacbookPro:google_transit brett$ grep 7533 trips.txt
3,2577,3059567,New Haven - Yankees-E153 St.,7533,1,,6,1
MacbookPro:google_transit brett$

So this trip has the 2577 service Id. Now if I look at:

MacbookPro:google_transit brett$ grep 2577 calendar.txt
2577,0,0,0,0,0,0,0,20110403,20111007

and

MacbookPro:google_transit brett$ grep 2577 calendar_dates.txt
2577,20110403,1
2577,20110501,1
2577,20110522,1
2577,20110612,1
MacbookPro:google_transit brett$

So my understanding is that trip 7533 only runs on 4/3, 5/1, 5/22, and 6/12. But I look at the actual Yankee's schedule and see home games on 4/16 and 6/11 (Saturday's) that are at 1:05pm but do not seem to have references in the data. Is this correct? It does not seem to match the paper where there is no reference to exceptions for Saturday's for example.


Chris Schoenfeld

unread,
Apr 8, 2011, 3:32:04 PM4/8/11
to mtadevelop...@googlegroups.com, Aaron Donovan
I've always wanted to do an on-site meeting of MNR MTA Developers and
GTFS consumers to sit down and throw our heads together about
processes, distribution, changes, and QA.

I've been trying to solve this problem since 2008, but right now I
feel like a lot of us are all having similar issues and need to have a
working meeting to try to streamline the process.

There is no doubt in my mind that if we were to spend any amount of
time doing this, it would save John's team and GTFS consumers like
myself a lot of anxiety, frustration, and confusion moving forward.


Chris

John L

unread,
Apr 8, 2011, 7:37:27 PM4/8/11
to mtadevelop...@googlegroups.com

Different schedules for weekday day and evenings, weekend day and evenings.

John L

unread,
Apr 8, 2011, 7:46:45 PM4/8/11
to mtadevelop...@googlegroups.com

I am tooling with this on the weekends now seeing the interest in it ;)

Brett Bergquist

unread,
Apr 9, 2011, 12:45:46 PM4/9/11
to mtadevelop...@googlegroups.com
John, I hate to be a pain, but I am still confused, so if you could help make me smarter, it will be much appreciated.

I go to 
and see that there is a game on Saturday 4/16 at 1:05 against the Rangers.  I then go to:
and for the New Haven Line, Weekend, 1:05 games, there appears a train 7533 departing at 9:46 and arriving at 11:24.  I then go to:
and look for the same train but do not see it.  I then look at the data and there is a 7533 trip listed and this trip is associated with schedule ID 2577.  Schedule 2577 in calendar.txt says it is effective from 4/3 to 10/07 but includes no days of the week selected.  There are 5 service include exceptions with service ID 2577 in calendar_dates.txt but none of these are for 4/16.  

Where is the entry in the GTFS data for this trip for this day?    On the printed version of the schedule, I don't see anything anywhere that says train 7533 does not run on this date or that is only runs on specific date.  So I am confused at to what is correct?  Is the print version of the schedule correct?  Does train 7533 run on 4/16?

See my confusion?

John L

unread,
Apr 10, 2011, 9:50:51 AM4/10/11
to mtadevelop...@googlegroups.com
I think that is an error on our side. Seems like we don't have a Saturday 1:05 PM schedule in there.

Sorry for not understanding the original question.

Just to recap, the following will be added to the calendar_dates.txt until we work up a way to combine the like schedules

Monday 1:05 PM
Monday 7:05 PM 
Tuesday - Thursday 1:05 PM 
Tuesday - Thursday 7:05 PM
Friday 1:05 PM 
Friday 7:05 PM
Saturday 1:05 PM 
Saturday 4:05 PM and 4:10 PM
Saturday 8:05 PM
Sunday 1:05 PM 
Sunday 4:05 PM and 4:10 PM
Sunday 8:05 PM

Which we want to turn into

Monday - Friday 1:05 PM
Monday - Friday 7:05 PM
Saturday - Sunday 1:05 PM
Saturday - 4:05 PM and 4:10 PM
Saturday - Sunday 8:05 PM

Then comes football season which complicates it further when there are Giants and Jets games on the same weekend as the Yankees. With the lock out, I don't expect the schedules to be released in April as hoped so we just have to wait for that but it would look like this in the worst case

Monday - Friday 1:05 PM
Monday - Friday 7:05 PM 
Saturday - Sunday 1:05 PM
Saturday - Sunday 1:05 PM + Football
Saturday - 4:05 PM and 4:10 PM
Saturday - 4:05 PM and 4:10 PM + Football
Saturday - Sunday 8:05 PM
Saturday - Sunday 8:05 PM + Football

I also cannot just add the trains to a regular schedule like 

1,1,1,1,1,1,0,0,04012011,06302011
2577,0,1,1,1,0,0,0,04012011,06302011

because they really don't run on the regular schedule hence why they are all zeros.

I also cant just add the additional service for those games into one schedule and leave out the regularly scheduled service because of the Hudson line express trains making an added stop for game day only or any other slight change in time. I can only do the combine of like schedules like we did for the base schedules.

Also, be glad the Yanks are away on Memorial Day and Fourth of July this year >.< 

So at worst through baseball and football season you will have 8 schedules plus holidays (3 schedules per holiday) for a total of 11 active service types in the GTFS if we update them after the holidays or 17 if we leave them in. This is in addition to the 5 or 6 base schedules for an overall total of 16 - 23 schedules. This is not including the November and December holiday schedules either.

Complicated to say the least

Brett Bergquist

unread,
Apr 13, 2011, 8:38:31 PM4/13/11
to mtadevelop...@googlegroups.com
No problem John. Thanks for taking the time too look!

Wayne

unread,
Apr 18, 2011, 7:20:44 PM4/18/11
to mtadevelop...@googlegroups.com, Aaron Donovan
Been buried the last few weeks, but did want to chime in here...

I think Chris has a good idea and it could be very beneficial to both sides if we have a well defined agenda, and clearly defined expectations, before meeting; what's in-scope, out-of-scope, take-aways, etc.

Part of the struggle, IMHO, sometimes stems from not having a good understanding as to "why" data is tagged the way it is; and I realize it may depend heavily on legacy processes and current limitations - not looking to point fingers, just a better understanding. The point is, if as developers we have a better understanding and/or comfort that the data format will remain consistent AND we better understand the decisions that determine the data set, then we should be able to reduce the time it takes to get new sets implemented.

We definitely want to contribute in streamlining the process.

Other thoughts?

Reply all
Reply to author
Forward
0 new messages