Requested improvements for HTTP access to real-time data: compression and conditional requests

149 views
Skip to first unread message

Mark Mentovai

unread,
Aug 15, 2024, 2:37:26 PMAug 15
to MTA Developer Resources
In experimenting with the real-time GTFS-RT data feeds, it occurred to me that accessing them makes relatively inefficient use of the network. These suggested improvements would mutually benefit both MTA and developers, by reducing the amount of data transferred over HTTP. I predict a 97% decrease in network utilization for fast-polling clients, based on two standardized, well-understood, and easily implemented techniques. These techniques generally enjoy CDN support, for easy integration into existing infrastructure.

Compression
Although a binary format, the encoded protocol buffers used in GTFS-RT generally compress quite well, owing to the large amount of repetitive data contained in feed messages. Looking at a current set of feeds, the uncompressed size of all 10 feeds together is 817kB. Compressing them individually using gzip’s default compression level (6), they total 163kB, for 80% compression.  Even at gzip’s fastest level (1), 78% compression is possible. Other compression techniques, such as zstd and brotli, are other viable options with their own tradeoffs.

Compression can be implemented at the HTTP layer with Content-Encoding, providing compressed data to clients that advertise the capability via Accept-Encoding, without impact to clients that do not have this ability.

Considering that many consumers are polling this data frequently, implementing a compression scheme like this could save considerable network bandwidth. For example, polling at 1Hz, the uncompressed data (without overhead) consumes about 6.4Mb/s. At 80% compression, this would shrink to 1.3Mb/s. The bandwidth savings from compression is realized independent of poll rate.

Conditional requests
Choosing a polling interval present its own problems. Many developers are interested in maximal data freshness, but because the actual time that a feed’s content will be updated is unpredictable, this can only be achieved under the current design by decreasing the polling interval as much as practical, such as to 1Hz. As it is now, each request for a GTFS-RT feed will be fulfilled in its entirety, even when the feed hasn’t changed since the last time a client requested it. Assuming that each feed’s content is updated, on average, every 7½ seconds, over the span of a minute, a client polling at 1Hz will retrieve new data 8 times, and a copy of data that it already has 52 more times.

Conditional requests can be implemented at the HTTP layer with ETag/If-None-Match or Last-Modified/If-Modified-Since. In either case, the server would need to provide information at the HTTP layer that it’s not currently providing (ETag or Last-Modified). An ETag computed as a hash of the feed content would be preferable, because the limited (1-second) timestamp resolution of Last-Modified could mask updates on the rare occasions that a feed is updated very rapidly.

With this in place, a client that continues to poll at 1Hz would only receive new feed content when updated, averaging 8 times per minute, and would receive nothing for the other 52 requests during the minute. Independent of compression, in this example, the client polling at 1Hz would see its network bandwidth reduced to 875kb/s (without accounting for overhead). It should be noted that the savings are not as dramatic with extended polling intervals.

Compounding these techniques
Both compression and conditional requests can be stacked atop one another. In the example of 1Hz polling, network bandwidth could be reduced to 175kb/s on average, a 97% reduction from the current 6.4Mb/s.

This request has clear benefits to developers consuming the MTA GTFS-RT API, as it enables them to conserve network bandwidth while achieving lower latency.

It has clear benefits to developers and MTA alike, considering that network utilization carries costs to all involved.

I encourage MTA to investigate and adopt these techniques, to reduce inefficiencies inherent in the current design.

Mark

Fisher, Will

unread,
Aug 15, 2024, 3:17:32 PMAug 15
to mtadevelop...@googlegroups.com

Hi,

 

This is something we can look into. We don’t see that many consumers asking for data that often, but I can see it being an issue for consumers who do. Compression would be huge, as a start.

 

Best,

Will

 

--
You received this message because you are subscribed to the Google Groups "mtadeveloperresources" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mtadeveloperreso...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mtadeveloperresources/CAO5tm1sCPZuGOo_NeXjNhcsmcCe%3DSDfrBgecp5%3DGgizc3SisVQ%40mail.gmail.com.

Confidentiality Note: This e-mail, and any attachment to it, may contain privileged and confidential information and is intended for the use of the individual(s) or entity named on the e-mail. Unauthorized disclosure of this message is prohibited. If you have received this message in error, please notify the sender immediately by return e-mail and destroy this message and all copies thereof, including all attachments.

Avi Herman

unread,
Aug 15, 2024, 11:12:59 PMAug 15
to mtadeveloperresources

Hi,

Great idea, Mark—utilizing HTTP-level compression and conditional requests could drastically cut down network usage. Additionally, implementing these optimizations would streamline data access for all developers and consumers alike.

As a Computer Science student at NYU, I'm particularly interested in network efficiency. If the MTA is open to it, I'd love to contribute to this project as a fall intern to help bring these improvements to life. Happy to send over a resume and some personal projects I made with the MTA's API. Would either of you know who I could get in touch with?

Best,
Avi Herman

Avi Herman

unread,
Aug 15, 2024, 11:12:59 PMAug 15
to mtadevelop...@googlegroups.com
Hi, not sure if my other message sent on Google Groups, but here is what I had:

Great idea, Mark—utilizing HTTP-level compression and conditional requests could drastically cut down network usage. Additionally implementing these optimizations would streamline data access for all developers and consumers.

As a Computer Science student at NYU, I'm particularly interested in network efficiency. If the MTA is open to it, I'd love to contribute to this project as a fall intern to help bring these improvements to life. Would either of you know who to contact?  Happy to send over a resume and my projects using the MTA API.

All the best,

Avi Herman


Stephen Bauman

unread,
Aug 21, 2024, 6:22:03 AMAug 21
to mtadeveloperresources
The MTA isn't the only organization that produces real time feeds. There are probably hundreds around the world. 

The GTFS-RT spec reduces the burden on developers to use these feeds with a single software approach. There are many open source tools that help developers that analyze/re-package multiple GTFS-RT feeds from many different sources. Having the MTA unilaterally deviate from this standard would be a step backwards in compatibility and cooperation among transit operators. Imagine the chaos, should thousands of operators opt to unilaterally change their real time feed - each to a different standard.

At present, all an operator has to do is produce a protobuf for any of its extensions and all a user has to do is incorporate that protobuf extension to decode the feed. Let's try to keep it that way.

My problem with MTA's GTFS-RT feed is when it deviates from the standard. 

It would be nice, if subway trip id's did not change en route, or be the same as the trip id's for the corresponding trip in the static GTFS - as per spec. 

It's nice that the MTA's bus GTFS-RT includes an estimate of the number of passengers on a bus. It would have been better had this extension been registered as per spec or even documented in a protobuf extension. Instead, the MTA used an extension that had already been registered to another organization.

Dan Jabbour

unread,
Aug 21, 2024, 9:02:43 AMAug 21
to mtadevelop...@googlegroups.com
Stephen -
These suggestions do not change the data distributed by the MTA, they simply take advantage of HTTP standards that have been in place for decades during data transmission.
Both would be opt-in at the client side as the server needs to respect incoming headers when issuing a response. Those that do not wish to take advantage of the improvements would simply make no changes and everything would stay the same.
Returning an ETag or Last-Modified and respecting If-None-Match / If-Modified-Since would really simplify things for us.
-Dan

Sunny

unread,
Aug 22, 2024, 7:31:59 AMAug 22
to mtadeveloperresources
We are in the process of re-architecting the infrastructure that serves up the feeds (they are stored in an S3 bucket) and will see if these changes are possible to implement.

To the best of our ability we will try not to change the URL's of the files.

Reply all
Reply to author
Forward
0 new messages