NYC subway GTFS unreadable.

671 views
Skip to first unread message

Susan Donovan

unread,
Aug 16, 2018, 7:20:03 PM8/16/18
to GTFS-realtime

Summary:
-I want to be able to use GET in the esp8266 libary to load a text URL that contains the times for the next trains at a particular stop. I can parse any text-based format: JSON, XML anything, but GTFS isn't formatted as text.
-I'm programming using the Arduino IDE which uses C/C++ 
-I can also use PHP to pre-process the data if needed, but can't find a simple way to do this
-I'd rather not have to install a new application or language on my server if possible. 
-I don't have an easy way to run python

How can I get the real-time subway data to my appliance? 
If I must install and app or learn a whole new language or protocol which one?



I've been wanting to make a transit appliance for my husband that shows the next train for a long time, but GTFS has always stopped me. I already got it working for NYC buses and metro north. 


enter image description here

In the case of the bus and metro north there was a JSON file or and XML file that I could parse for the next two buses or trains for a given stop id.

But the subway uses GTFS. And even after getting a developers key and reading the documentation:

http://datamine.mta.info/sites/all/files/pdfs/GTFS-Realtime-NYC-Subway%20version%201%20dated%207%20Sep.pdf

The data look like this:


"
 1.0´‹“€ >

 1.0
 G≥Í“€∞

 36000001£

 2

 109550_G..N20180815*G >
 1G 1815 CHU/CRS#õ‹“€õ‹“€"G26N( >
 E2#±›“€±›“€"G24N( >
E2E2#Ì›“€Ì›“€"G22N( >
 D2D2J
 36000002">
2
109550_G..N20180815*G >



So seeing this I get that it's not text. I'm programing an esp8266, a tiny microprocessor. How can I parse this? I looked in to creating a webpage on my server (where I have PHP) to do the parsing, google says GTFS works with PHP but all of the applications I've found require installing a whole lot of stuff on my server. Or they use python which isn't available on my server.

I feel like I must be missing something. What is the simplest way to do this?


If the MTA's "train time" site was more text based I'd just scrape it. I'm that frustrated.










Joachim Pfeiffer

unread,
Aug 17, 2018, 12:23:26 AM8/17/18
to gtfs-r...@googlegroups.com
For the subway feed, MTA uses GTFS-Realtime with extensions for their subway. GTFS-RT is based on a data exchange mechanism called Protocol Buffers. The development workflow has you compile the formal ProtoBuffer definitions of GTFS-RT plus NYCT-extentions into a target codebase which you integrate into your train departure parser. The above resources as well as the ProtoBuffer compiler are open source. I compiled into Java, and the resulting codebase is 3.6MB large. All this enables you to periodically download the real time feed data from MTA's web site into structured, readable datatypes. All this is designed for a server-proxy setup that you build and run to feed your clients, i.e. your microcontroller based devices.
Good news:
1. I compiled the codebase that I still have in use wayback in 2013. GTFS-RT and ProtoBuffers have proven rather stable over time
2. Once you got this going, you can use your work for other agencies that offer realtime data in GTFS-RT
Not so good news:
1. This is not suitable for individual clients such as smartphone apps, not to mention microcontrollers, as far as I can see anyway. Least of which, because:
2. You need a server-proxy as MTA limits per-service access to their GFTS-RT feed files to no faster than 30s periods IIRC. Assuming you had hundreds of clients in the field: They would outpace the frequency at which you're allowed/asked to hit MTA's website by a wide margin
JP



--
You received this message because you are subscribed to the Google Groups "GTFS-realtime" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gtfs-realtime+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gtfs-realtime/e575fe24-f177-4ea5-9efd-0fe5dcb1084c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Susan Donovan

unread,
Aug 17, 2018, 7:19:11 AM8/17/18
to GTFS-realtime
Thank you for the clear response. I'm just building this for fun and it will be one of a kind so I don't think having too many queries will be an issue. 

Just to check if I understand: GTFS tries to avoid taxing servers by making app builders download a huge file with everything (and host files that help them make sense of that file) then re-serve the information to their app users or customers.

But what if I just want to post the next two trains at a particular stop on my personal website? GTFS has nothing to help with that?

I guess I'm going to have to set up a whole huge database on my webserver to get two lines of text. 

But I'm still really motivated to make this happen. Can I use PHP and MySQL?
To unsubscribe from this group and stop receiving emails from it, send an email to gtfs-realtim...@googlegroups.com.

Stefan de Konink

unread,
Aug 17, 2018, 7:25:20 AM8/17/18
to gtfs-r...@googlegroups.com
On vrijdag 17 augustus 2018 13:19:11 CEST, Susan Donovan wrote:
> Thank you for the clear response. I'm just building this for
> fun and it will be one of a kind so I don't think having too
> many queries will be an issue.

It will be an issue, since your single 'fun app' is using the same
resources at the NYC subway server as an intermediate server for 100.000+
people.


> Just to check if I understand: GTFS tries to avoid taxing
> servers by making app builders download a huge file with
> everything (and host files that help them make sense of that
> file) then re-serve the information to their app users or
> customers.
>
> But what if I just want to post the next two trains at a
> particular stop on my personal website? GTFS has nothing to help
> with that?

GTFS and GTFS-RT exchanges all information in an efficient way. Unlike
other services you can't make subscriptions to single stops or single
routes. This is also what is the question at your embedded implementation.
Theoretically you could process protobuf in C(++) or Lua. But given the
size of the protobuf files the bandwidth required is probably killing.


> I guess I'm going to have to set up a whole huge database on my
> webserver to get two lines of text.
>
> But I'm still really motivated to make this happen. Can I use PHP and MySQL?

Theoretically you wouldn't have to. GTFS-RT TripUpdates itself my contain
everything you need, very depedent on what you want to display. You would
just need to filter the trips/routes you are interested in. This could
cause an update to be send to your microcontroller.

If it is just a countdown to the bus this would work, if you want to have
directions, that might be more of a problem.

--
Stefan

Susan Donovan

unread,
Aug 17, 2018, 8:00:09 AM8/17/18
to GTFS-realtime
Thanks Stefan,

This clarifies whats going on and what I'm up-against. There is a protocol buffer library for C++ but I don't see people having much success getting programs that use it for Arduino complied under 32k.

So, that is out. 

What I can do is write something in PHP. Can PHP parse the GTFS and serve up the two lines of text I need? (next arrival times of a particular train at a particular stop) I think this is part of the "trip update" data set. Back to reading documentation and other forums.

Thank you for your help. I will post here if I make more progress. 

S.

Joachim Pfeiffer

unread,
Aug 17, 2018, 10:13:31 AM8/17/18
to gtfs-r...@googlegroups.com
Yes, controlling agency's server loads are at least part of the motivations behind GTFS-RT and how it's used to enable next bus and next train predictions. Back when Google started bringing realtime transit data into Maps, they realized they cannot scale up by calling agency APIs querying individual stops like regular developer APIs offer. It's easy to envision how their potential user base puts them at risk of taking down transit agency's servers, akin to denial of service cyber attacks. So they lobbied for GTFS-RT. Today, Google and everybody else download three files for vehicle location, trip updates relative to the the static schedule obtained from the agency posted GTFS files, and service messages. This provides a snapshot of the full scope of the agency's service provision, all lines all stops all buses all trains (that are realtime enabled). (Side note: Nobody on the list please read into this they're the bad guys or something, we wouldn't have the level of publicly accessible transit data if it wasn't for Google pushing for this and GTFS before). Consumers of GTFS-RT data then integrate this to feed their websites, mobile apps etc.
You may have noticed the distinction between GTFS-RT and GTFS I made. The former is the binary-only ProtoBufffer based format to delivery the realtime data snapshot of vehicle locations and so forth. GTFS feeds are separate from that. GTFS has come before GTFS-RT, and contains the static schedule of an agency or operator. At minimum, a new GTFS feed is expected to be posted by the agency days before a new schedule goes into effect. GTFS feeds include stops (stations) with their geo-coordinates, stop IDs, stop names, routes, trips and the scheduled stop departure times and so forth. Generally, when producing a next train time, you need to have this available and processed as well, so you can estimate a next-train time of train approaching a station. This is generally done by correlating the GTFS-RT provided TripUpdate data with the scheduled data of that trip in GTFS. If there's no TripUpdate for the trip (or, in the case of NYCT Subway: train) in GTFS-RT, it is assumed the train runs according to GTFS schedule, and you use that. In any case, that's how I process GTFS-RT from agencies across the land. In the case of NYCT Subway there might be a shortcut around all this. I haven't studied their particular implementation enough, but it might be worth a look to see if they include everything you need in GFTS-RT with NYCT extensions, so you can avoid the extra steps of correlating TripUpdates with the static schedule.
JP

To unsubscribe from this group and stop receiving emails from it, send an email to gtfs-realtime+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gtfs-realtime/40ac59cd-ba54-4312-8bd7-9925271b4d0a%40googlegroups.com.

Paul Harrington

unread,
Aug 17, 2018, 5:12:39 PM8/17/18
to GTFS-realtime

As an aside Joa (sorry for hijacking your thread Susan) do you know if MTA would work with a standard protocol buffer jar ?  I use protobuf-java-2.6.1.jar . Would it be that the parse of TripUpdates and VehiclePositions would fail with this version or would it work but just without the extra information provided by the extension ?

Regards Paul.

To unsubscribe from this group and stop receiving emails from it, send an email to gtfs-realtim...@googlegroups.com.

Stefan de Konink

unread,
Aug 17, 2018, 5:15:27 PM8/17/18
to gtfs-r...@googlegroups.com
On vrijdag 17 augustus 2018 23:12:39 CEST, Paul Harrington wrote:
>
> As an aside Joa (sorry for hijacking your thread Susan) do you
> know if MTA would work with a standard protocol buffer jar ? I
> use protobuf-java-2.6.1.jar . Would it be that the parse of
> TripUpdates and VehiclePositions would fail with this version or
> would it work but just without the extra information provided by
> the extension ?

Will work, it will basically ignore fields it doesn't know about.

--
Stefan

Nikhil VJ

unread,
Sep 18, 2018, 5:13:47 AM9/18/18
to GTFS-realtime
Hi Susan,

Very interesting project! Consider this approach, and disclaimer: I'm just thinking aloud, have not actually done this yet:

1. At least in Python I know the GTFS-RT feed converts directly to hierarchical dict/json. You can think of the protocol buffer format as a compressed json. I'm saying this practically, not "by the book". I guess in PHP it would be similar.

2. Host on some service that lets people host wordpress websites or forums (you may already own an account if you work in php+myql.. they're the cheaper web hosts that give you a cPanel ), 

3. Write a PHP script to process GTFS-RT's .pb file, and save the output to a .json or .csv or whatever suits you. You can cut out unnecessary info and keep just what you need.

4. The saved file should become available on a public URL.

5. Make your Arduino download that saved file from your website every 30 secs.

6. Two ways of doing the timing:

6.a. Set a CRON job from cPanel to download the .pb file every 30 secs or so and then run the .php script. That would be two shell commands separated by a semicolon (;) in your CRONjob. Or put the commands in a .sh shell script and run it in the job. And then your Arduino only needs to download the .json / ,csv from one fixed URL.

6.b. Keep the downloading .pb action within your PHP script itself, and make the Arduino hit your .php script's URL, that will trigger the script, which downloads the .pb, processes it and "echo"s your data as output. This would entail some wait time though and your Arduino shouldn't timeout on the connection too soon.

7. The second option has an advantage that you will pester the transit agency only when your Arduino is running. But this won't work if there's multiple devices out there all hitting your php script, so second option is for single-user only while first option is more suited for multiple users. 

8. The first option has a risk of a device downloading a file while its being written, so we have to code to protect against that. Make your php do the processing and store the output in a variable and then at end do a quick file write. Build some way of checking file integrity at Arduino side.

9. There can be ways to schedule CRONjob to run only in day hours, weekdays etc. Another hack is to place or remove a "lock" or "flag" file on the server, and the shell script in your CRONjob checks for its presence before running.

10 I didn't mention MySQL here, because in GTFS-RT you don't really need to archive the data (you can if you want to). The purpose is to fetch whatever is latest. The feed retains past data till an agency-decided time limit, so even updates that came longer than 30 secs ago will be reflected in the feed until a more recent update comes to replace it. And seeing that your final output is a LED display, I reckon you'll be needing a very small part of the data coming in anyways. You could use your db in place of the .json/.csv output file discussed above.

All the very best!

- Nikhil VJ
Pune, India

Busparrot

unread,
Sep 18, 2018, 12:22:32 PM9/18/18
to gtfs-r...@googlegroups.com

Reply all
Reply to author
Forward
0 new messages