GTFS-RT trip updates without enough memory for full message

97 views
Skip to first unread message

Dro Sohrabian

unread,
Dec 9, 2023, 8:36:36 PM12/9/23
to GTFS-realtime
Hi all,

I am trying to get an Arduino device to fetch the .pb and parse the GTFS-RT trip updates feed. Unfortunately, the processor can't store the entire decoded result in memory. (only 192KB) RAM.

Are there any solutions for streaming the protocol buffer message in chunks to only grab the info you are after? For example, I only want to get one stop's worth of updates. 

I have tried writing code that parses through hexadecimals 500 bytes at a time, and throws out data so it can fit in RAM as it's going. It is quite challenging so far. I want to see if anyone has other solutions to recommend.

Thank you!

Stefan de Konink

unread,
Dec 10, 2023, 3:05:21 AM12/10/23
to gtfs-r...@googlegroups.com
On Sunday, December 10, 2023 2:36:36 AM CET, Dro Sohrabian wrote:
> Are there any solutions for streaming the protocol buffer message in chunks
> to only grab the info you are after? For example, I only want to get one

<https://github.com/google/transit/issues/84>

We have created some software that feeds an MQTT server all tripUpdates in
pieces. This would allow you to subscribe on a specific topic, for a single
trip. That would still give some challenges if you are after a single bus
stop.

--
Stefan

andrew byrd

unread,
Dec 10, 2023, 11:18:14 AM12/10/23
to gtfs-r...@googlegroups.com
Hello,

To elaborate on previous comments: There are multiple issues at play here. What Stefan is referring to (in the linked ticket) is a discussion on the "incrementality" field in GTFS-RT. The spec still says it's unsupported and its behavior undefined, but in practice people have been using it for about ten years in a fairly standardized way.

## Incremental Message Reception

GTFS-RT messages are often distributed by HTTP GET where you as the consumer have to make a request each time you want to check for updates, and you will receive all the messages at once about all parts of the transit system, including all the messages you've already seen before.

The incrementality field allows for some other options. There are three aspects to this as I've seen it used in practice:

a)  Differential vs. full-dataset: The incrementality field can take on two values: differential and full_dataset. In full-dataset mode, you'll get one big FeedMessage containing every update for every trip or vehicle. In differential mode, you'll receive updates for each trip or vehicle as they stream in, either individually or in small blocks. This may include a guarantee that an update will be provided on every entity at least once every n minutes, or alternatively the producer sending the full dataset when you first connect, then sending only changes. Once you're in differential mode, this opens up possibilities b and c.

b) Poll vs. Push: In practice differential messages are usually distributed individually or in small blocks via a message queue. This means the notifications are pushed by the message queueing system as soon as they arrive, rather than pulled by the consumer via occasional polling. It would in principle also be possible to provide differential updates by polling (like full-dataset updates), with the producer tracking the last time each consumer polled, but I don't know of any such implementations. Combining differential+push means that you can receive vehicle positions and trip updates immediately after the vehicles report their position. In some places vehicles provide updates every few seconds, so their position is genuinely known in real time. For example, a system like this: https://junaan.fi/Helsinki (which I think is consuming a different data format, but from a system that communicates internally using differential streaming GTFS-RT).

c) Filtering: Message queue systems often allow filtering by topic. A continuous stream of immediate small messages provided by (b) is already an improvement over full dataset fetching, but if you only care about one route (for an arrival time display panel for example) you don't want to continuously receive and parse thousands of messages per second looking for the relevant ones. So rather than a "firehose" of every message, you subscribe to a topic that includes only messages for that one route. You then receive a single message every few seconds with the latest predictions for the routes you care about.

The advantages of a system like this are evident, and at least in cases I've seen, differential message passing systems are often considered the only reasonable option when a large (national/metropolitan) realtime passenger information system goes through an intentional and thorough design process.

So: if your data producer provides streaming, filtered differential messages or is planning to, this seems perfect for your use case. In fact it is designed with your use case in mind, with similar needs to arrival displays, bus tracking apps, and journey planner routing instances.

If they do *not* provide streaming filtered differential messages (which is the norm), then you will indeed need to periodically download the entire full-dataset feed, parse the entire thing and scan through it until you find the single message you need :)

## Incremental Parsing

I believe all automatically generated parsers I've seen for protobuf messages are intended to parse an entire message and return a full in-memory representation of that message. In the case of GTFS-RT, the top-level FeedMessage will generally contain the entire dataset as a repeating FeedEntity, so the entire deserialized dataset has to fit into memory (which is the problem you were originally asking about).

I think like many serialization libraries, protobuf was originally intended for tiny messages flying back and forth ("user added these three characters to sentence 12") so the libraries around it don't worry much about memory consumption for the in-memory representation.

What might be better for your situation is an event / callback driven parser, which calls a programmer-supplied callback function right after it finishes parsing each message and doesn't retain the in-memory representation after your callback returns. You could either modify the generated parser or write your own parser that works this way.

I believe the generated parsers are typically recursive descent parsers, defining a separate function that consumes each message type, using the stack to track progress, and calling the right function to consume each chunk of bytes representing each field or sub-message. This kind of parser is also relatively easy to write by hand.

In this kind of parser, It should be possible to modify the (generated or hand-written) code so it calls your callback function whenever it finishes parsing a particular message type (specifically FeedEntity), then deallocate or recycle the memory chunk into which it's deserializing the FeedEntity as it moves down the list. Each FeedEntity would be overwritten or lost as soon as the next one is parsed, but you can consume them on the fly, or make copies of the few you want to consume. The trade-off here is that you will only be able to see the individual messages one by one, and the outer FeedMessage would appear to have a missing or invalid list of FeedEntities.

Rather than modifying the function that parses FeedEntity, you'd probably want to modify the function that parses FeedMessage (which contains the repeating FeedEntity field) so it doesn't allocate a giant empty array of FeedEntity in advance. In that function's loop handling the list, after calling the FeedEntity parser function, it would call your callback then free the returned FeedEntity [1]. This seems like a good option on the surface, but the generated protobuf parsers I looked at don't actually generate full parser functions for each message type. They reuse some generic code for every message type, such that it might be hard to get it to behave in this special way for only repeating FeedEntity instances without making the parsers for all other message types into event-driven parsers. Though that wouldn't necessarily be a bad thing...

The alternative is then to write the parser yourself. Depending on your background or interests, you might find this fairly straightforward. There are tons of references on recursive descent parsers, and the protobuf wire format is well documented. One tricky part is variable-width integer encoding. What I'd probably recommend is generating a parser from the protobuf definition, then reusing all its basic number decoding functions and constants, but calling them from hand-written versions of some or all of the actual parsing functions.

The other trick is going to be keeping the size of the input buffer bounded. I think what you want there might be a circular buffer, with network packets going in and the event-driven parser reading out of it.

I hope this gives you some ideas! It was also a good opportunity for me to think through some upcoming documentation for differential mode GTFS-RT. Anyone please correct me if I made any mistakes in there!

Andrew

[1] Note that C protobuf parsers generated by protobuf-c let you specify your own allocator. It should be possible to make a simple slab/arena allocator that lets you undo the last N allocations at near-zero cost in one operation. At each iteration of the loop decoding FeedEntities in the FeedMessage, you could supply the slab allocator to the FeedEntity parser, call your callback, and reset the slab allocator. See: https://github.com/conveyal/vanilla-extract/blob/master/pbf-read.c#L299 However, again I don't see a straightforward way to ensure this happens only on one type of message.


-- 
You received this message because you are subscribed to the Google Groups "GTFS-realtime" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gtfs-realtim...@googlegroups.com.


andrew byrd

unread,
Dec 10, 2023, 1:05:23 PM12/10/23
to gtfs-r...@googlegroups.com


On Mon, Dec 11, 2023, at 00:17, andrew byrd wrote:
If they do *not* provide streaming filtered differential messages (which is the norm), then you will indeed need to periodically download the entire full-dataset feed, parse the entire thing and scan through it until you find the single message you need :)

I just realized Stefan might be suggesting that if your producer does not do differential data, you could run the new differential system being built for the Netherlands, have it occasionally ingest a full dataset feed from your provider and create your own differential message broker. Then your embedded system can subscribe to only a limited amount of messages from your own broker. 

-Andrew 

Stefan de Konink

unread,
Dec 10, 2023, 2:51:14 PM12/10/23
to gtfs-r...@googlegroups.com
On Sunday, December 10, 2023 7:04:51 PM CET, andrew byrd wrote:
> I just realized Stefan might be suggesting that if your
> producer does not do differential data, you could run the new
> differential system being built for the Netherlands, have it
> occasionally ingest a full dataset feed from your provider and
> create your own differential message broker. Then your embedded
> system can subscribe to only a limited amount of messages from
> your own broker.

Exactly. But we do not make individual StopUpdates available on a separate
topic since that would certainly be outside the common use cases. You could
technically do so obviously. But would then end up with topics in the
structure of:

tripUpdate/stop_id/trip_id, and then subscribe towards tripUpdate/stop_id/#

--
Stefan

Dro Sohrabian

unread,
Dec 10, 2023, 10:41:19 PM12/10/23
to GTFS-realtime
Thank you Andrew and Stefan for your excellent and detailed replies on this. Really appreciate your time. I see there has been dialog swirling about how to roll this type of funcationality out for some time. Unfortunately, my data provider (Greater Cleveland RTA) only broadcasts full_dataset. I'm happy to hear the differential spec is moving in general.

I didn't catch that the MQTT server in Netherlands was an option, though I agree it doesn't sound like the best solution for reasons listed. I am interested to learn more about it (and MQTT) nonetheless. How does one subscribe and is there more documentation on it, Stefan?

In retrospect, what I've written so far is an attempt at writing my own parser like you said. With shoddy pattern recognition functions and help of a mini protobuf library , I made progress on reading in 500 bytes at a time and fully parsing out the updates I'm after for an entire message. (i.e. 2 or 3 routes of interest for a LED arrival board using this SBC by maker called Adafruit).  But it's not reliable. I believe I ran into variable-width issues exactly as you describe, which without a deeper understanding of the format makes finding the boundaries of the recursive message structure a hassle. And one mistake throws off the chain of parsing fixed-size buffers for the remainder of the FeedMessage, as the trimmings (the start or subsequent message) get handed off to the next iteration. What is a good resource for better understanding how to detect those boundaries when cycling through arbitrary size chunks? Are there any existing libraries or reference that are good to dig into? Do I need to venture out of Python? Also curious if this sounds impractical or not a good path to pursue! I have to admit I am somewhat out of depth on aspects of all this, low- and high-level.

-- Dro
Reply all
Reply to author
Forward
0 new messages