Caching GTFS-RT?

Sheldon A. Brown

unread,

Feb 28, 2017, 12:01:10 PM2/28/17

to transit-d...@googlegroups.com

Hello All,

I'm tasked with hosting a GTFS-RT feed for a large transit agency.
This feed could potentially be requested several hundred times a
second but will only be updated every 15-30 seconds (for example). I
was looking to see what others have done in cases like this? Things
like:

1) Application level caching -- rolling your own
2) Webserver level caching -- Apache mod_cache or equivalent
3) AWS CloudFront or equivalent
4) Something else?

Any anecdotes and experiences are welcome!

Thanks,

Sheldon

Paul Harrington

unread,

Mar 1, 2017, 3:44:52 AM3/1/17

to Transit Developers

I'm only a consumer but the GTFS-RT feeds I've seen are not too big. So if I were you, every time I'd get an updated feed, I read it into a memory buffer and serve it from there to clients.

You would serve the old buffer when you are reading an update into a new buffer and as soon as this is complete remove the old buffer and serve the new.

If you still think capacity may be a problem for a single server you could use something like haproxy (linux only) or an nginx reverse proxy with multiple servers as backends.

Regards Paul.

Sean Barbeau

unread,

Mar 1, 2017, 9:44:34 AM3/1/17

to Transit Developers

It's been a while since I've looked in detail at this (and you probably already know the following :) ), but IIRC the onebusaway-gtfs-realtime-exporter project (https://github.com/OneBusAway/onebusaway-gtfs-realtime-exporter) does exactly what Paul says - it holds an in memory representation of the protocol buffer based on the last poll of the data source, and on HTTP GET from a consumer (it uses Jetty embedded web server) it writes it to the output stream.

We used the onebusaway-gtfs-realtime-exporter library in the two GTFS-rt feeds we've built:

...and to my knowledge haven't seen any scalability issues yet, although we're nowhere near hundreds of hits per second. The above feeds can be put behind any kind of load balancer (they support only FULL_DATASET, not INCREMENTAL), and while there may be split-second differences if the requester hits a different machine in two sequential requests (i.e., they aren't guaranteed in lock-step), it shouldn't matter. I think even in the hundreds of hits per second this should scale fine, though. I think AWS Cloudfront would mainly help if you think requests would be coming in geographically from around the world, not just your local area.

Sean

Sheldon A. Brown

unread,

Mar 1, 2017, 10:09:00 AM3/1/17

to transit-d...@googlegroups.com

Thanks Sean,

the feed will contain several thousand buses combined with agency wide
service alerts, so I could see it being rather large. Combine that
with many requests per second and the network traffic alone gives me
concern.

The input to the feed is queue based meaning that without attempts to
synchronize output across a load balancer, clients will get very
different answers. While you point out that's not such a bad thing,
it makes sense to minimize that behavior if possible. It will be
inspired by the references you've posted above but due to some
integration dependencies will end up being quite different (still open
source though).

I've read that the combination of CloudFront's expiring policy and
perhaps some file renaming logic creates a read-through cache. Having
user request traffic not affect my back-end is very appealing to me.
I was wondering if others have tried that specifically, or have other
equally interesting alternatives.

Sheldon

> --
> You received this message because you are subscribed to the Google Groups
> "Transit Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to transit-develop...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Tony Laidig

unread,

Jun 8, 2018, 3:04:30 PM6/8/18

to Transit Developers

I wanted to resurrect this thread with a similar question: has anyone had issues with distribution of GTFS-rt via a CDN?

I'm tempted to recommend an off-the-shelf CDN for an agency that operates at similar scale to what Sheldon described (thousands of vehicles, likely hundreds of regular requestors), but wanted to query the hivemind if there were any pitfalls from existing implementation.