Assessing accuracy of real-time arrival estimate systems

284 views
Skip to first unread message

Aaron Antrim

unread,
Apr 19, 2013, 5:21:40 PM4/19/13
to transit-d...@googlegroups.com
Does anyone know of a defined/published methodology for assessing and comparing the accuracy of real-time arrival estimate systems?

…i.e.: how to:
* select routes/times of day for sampling
* gather data for arrival estimates and actual arrival times
* compare between predicted and actual times
* statistical process to make assessments and compare systems

It might be useful to have some benchmark data, too -- or range of acceptable performance.

I'm not even sure how necessary this is or how much variation there is between arrival estimates systems, but my immediate thought is that a methodology like this would be pretty handy and a few people may have developed it (or parts of it).

I saw TCRP Synthesis 48 and Synthesis 73 reports on real-time arrival estimate systems but after a quick scan, I didn't see anything that addressed this question.


-- 
Aaron Antrim
www.trilliumtransit.com
Portland, Oregon

Sean Barbeau

unread,
May 14, 2013, 10:05:55 AM5/14/13
to transit-d...@googlegroups.com
Aaron,
I’m not aware of any published methodologies that evaluate the accuracy of estimated arrival times, in terms of sampling trips/routes/stops.  I’ve seen requirements in RFPs for confidence levels, such as “Estimates shall be within 30 seconds of actual arrival times 95% of the time”, but I haven’t seen precise methods for how that is calculated.  I agree that this would definitely be useful.

From my own experience, I think there are two general methodologies to go about this:
  1. Try to produce reports from the AVL system that compare actual arrival bus arrival time to prediction value (i.e., you have a prediction, and then the bus drives by the stop, and then you have the actual arrival time – then compare the two).  However, it’s likely that this isn’t something most systems currently report.  Agencies could ask their vendor about costs for this, but just be aware that if it’s the same vendor that installed the system, the results and how they put the query together may be biased to make the system look good.
  2. Ground-truth testing – get out into the field and manually look at when the bus arrives at a stop, and compare that to estimate.
I recently did an ad-hoc evaluation of the estimated arrival times we’re feeding into our OneBusAway system in Tampa using #2.  Spreadsheet with values are attached.  Primary purpose was to get a handle on accuracy of arrival times that would be customer facing for people using the OneBusAway mobile apps, since the AVL system had been installed for years but had only been used for internal operation purposes.
 
This was an iterative process, based on some backend configuration issues, so we assessed this multiple times.  One key improvement  for us (using an OrbCAD system) was to use a column “predicted_deviation” instead of “deviation”, much more accurate. 

Process to select trips/stops wasn’t very scientific, other than trying to pick different trips and not continuously sample the same bus at stop after stop (since you’re likely picking up the same error several times).
 
Methodology was to use the OBA Android app, and take a screenshot of the app when the bus actually arrived at the stop.  Screenshot of phone app then tell you everything you need (how far off the estimate was, route name, , stop name, etc.), and I then transcribed this into Excel.  Powerpoint slides with screenshots is also attached.
 
I realize that for some agencies this is somewhat chicken-egg, since you might not want to invest in a solution like OneBusAway until you know the quality of the real-time data.  We were already working on a OneBusAway project and the AVL system was already installed, so the overhead for us was fairly low.

Sean
OBA Arrival Time Checks.pptx
OBA Ride Checks.xlsx

Prashant Singh

unread,
May 14, 2013, 2:29:41 PM5/14/13
to transit-d...@googlegroups.com
Some quick thoughts.

Another issue to consider: when a bus is pulling up to a stop, riders don't generally need to ask, "when is the bus arriving?" When a bus is predicted to arrive in 15 minutes, though, a rider might try to do 10 or 12 minutes worth of activity (like getting coffee or walking to the bus stop). In that case, the prediction is important, and I have to imagine the error variance increases with the time to arrival.

I'm really interested in the error vs. time-to-arrival for different stops and routes in a system. If a trip is consistently slower than scheduled, then you'll see a low-variance error that increases with time to arrival. That means the prediction can be smarter, but it also means the schedule should probably be adjusted.

It's possible to record programmatically the various predicted arrival times through the day by just querying the API on an interval. It would be cool to implement a ground-truth system with a web app or SMS app, so that you can just indicate actual arrival times by texting in a route number and stop ID or similar. Then the two data sets can be compared offline.

I even see value in evaluating the progression of the real-time updates without ground truth. If the predicted arrival is 15 minutes, then 10 minutes later, the predicted arrival should be 5 minutes. If that's consistently untrue, then it affects the rider's experience.

prashant.



--
You received this message because you are subscribed to the Google Groups "Transit Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to transit-develop...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Barbeau, Sean

unread,
May 14, 2013, 3:04:48 PM5/14/13
to transit-d...@googlegroups.com

Prashant,

Agreed, a ground truth app would be very useful. 

 

I should mention that there is a starting point for that in the OneBusAway “Problem Reporting” feature in the mobile apps, although this feature wasn’t functional in Tampa at the time when we needed to evaluate the system.  The current Problem Reporting feature is also more tailored towards general feedback, not primarily reporting errors in estimated arrival times, although with relatively little work I believe it could be modified and streamlined to meet this need (something like a “bus just arrived” button on the main estimated arrival display screen).  So if someone decides to go down this route of implementing a tool to collect this data, I’d recommend starting with the open-source OneBusAway suite (https://github.com/OneBusAway/onebusaway-application-modules/wiki).

 

There is probably some work in travel time reliability for road networks, for example from the SHRP2 projects (http://www.trb.org/StrategicHighwayResearchProgram2SHRP2/Blank2.aspx), that could inform a more thorough investigation of this problem, from evaluating predictions to user perception of estimates.

 

Sean

--
You received this message because you are subscribed to a topic in the Google Groups "Transit Developers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/transit-developers/JJ1REEpknv4/unsubscribe?hl=en-US.
To unsubscribe from this group and all its topics, send an email to transit-develop...@googlegroups.com.

Michael Frumin

unread,
May 14, 2013, 3:14:31 PM5/14/13
to transit-developers
Guys, this is great stuff.  Very useful.

But I wonder why a ground-truthing app is needed?  Why not just assume the "actual" arrival time is the time that the AVL system sees the bus get to/pass the stop?

It's more of a post-processing application, and doesn't require human effort to stand out at the stop and click an app.  What am I missing?

Thanks,
Mike

Barbeau, Sean

unread,
May 14, 2013, 3:47:39 PM5/14/13
to transit-d...@googlegroups.com

Mike,

Agreed, that situation is ideal (comparing actual arrival to estimated arrival based on real-time data archived from the AVL system) and is equivalent to the #1 approach in my original email.  If you’re designing a new AVL system this would be the way to go.

 

But, not all existing AVL systems provide stop-level of precision of AVL position data, and even if they do, it’s not necessarily exposed to developers/analysts.  Even with access to a copy of HART’s production AVL database, we had several challenges.  HART’s OrbCAD system only logs GPS data at timepoints, which can be every 6 stops or so.  You can also have estimate propagation delay issues, so when the AVL estimate is recorded in the agency’s database isn’t necessarily the same time it gets shown to riders via an app (this is actually something to consider in #1 too, depending on where you are logging the estimates).  We were concerned with this due to the number of links in our information chain, as well as the infrequent HART GPS/estimate updates.  Another challenge is that HART’s AVL system doesn’t log historical estimate information, only historical position information, so we would have had to implement our own software/database to log historical estimates.

 

Depending on one’s requirements and AVL features the above may or may not be deal breakers for a fully automated post-processing solution.

 

Sean

Michael Frumin

unread,
May 14, 2013, 4:07:42 PM5/14/13
to transit-developers
Sean, thanks.

But correct me if I'm wrong -- in your powerpoint you said you are refreshing the GTFS-realtime every 15 seconds.  How often does the underlying AVL system update its location for a given bus (I assume it updates locations more frequently than the every-6-stops timepoints)?

I wouldn't be too concerned with what an AVL system does and doesn't log -- as long as you are getting the location of each bus every 15-30 seconds, then you have plenty of information (which you can log/archive yourself) to make a pretty good interpolation of when each bus reached each stop, no?

Thanks,
Mike

Barbeau, Sean

unread,
May 14, 2013, 5:23:58 PM5/14/13
to transit-d...@googlegroups.com

Mike,

In latest generation, we’re refreshing GTFS-realtime from OrbCAD database contents every 5 seconds.  From what we’ve been told and seen, underlying AVL updates positions at timepoints per vehicle (e.g., every 6 stops or so, depending on route), so when a certain vehicle position gets updated varies.  This is a constraint of the proprietary AVL radio network (Motorola circa 2006), which is built for voice traffic, with low bandwidth data as an add-on.  With traffic in certain areas of Tampa, 6 stops can be a long time, especially if you’re waiting for traffic lights (e.g., 2 minutes).  We tightened up our refresh rate to reduce the propagation latency on our end as much as possible (recall that OBA still needs to refresh its position from the GTFS-realtime feed, which is every 15 seconds currently for us, and then the app still needs to refresh its data from OBA REST API) after some of the initial predictions seen via the mobile app weren’t great.  Underlying prediction engine is OrbCAD proprietary, so we don’t know how predictions are calculated (I’m assuming based on some historical data – possibly varying in time without necessarily receiving a vehicle position update).  From last round of ground truth tests (see spreadsheet) accuracy visible to app users seemed reasonable, which is what we were most concerned with at the time, so we didn’t dig any deeper than that.

Michael Smith

unread,
May 16, 2013, 6:45:31 PM5/16/13
to transit-d...@googlegroups.com
We have found that several components are needed for prediction verification. An automated system for looking at historic arrival predictions is key because only such a system can thoroughly show how well a system is truly working. It is the only way you can cover all the stops, day and night, weekdays and weekends, etc. A separate system for examining real-time accuracy is also important because it is useful for diagnosing problems in real-time and for managing a system (such as determining when to notify passengers of a problem with SMS). And lastly, a "ground truth" system where predictions are manually verified is still key for giving the transit agency the confidence that the prediction accuracy software is actually providing accurate results. Otherwise the agency will have no idea if the system truly works or not.

Surprisingly, "ground truth" tests are surprisingly difficult. You would be amazed how testers can provide bad data. Expect to spend a lot of time creating good processes and then going over the data in great detail to make sure that the testers did what they were supposed to do.

Mike

Ken Conaway

unread,
May 17, 2013, 10:24:40 AM5/17/13
to transit-d...@googlegroups.com
What is the current polling rate for Hart for the ACS system? DC Metro has the same system and their rate is about every 2 mins, but they're in the process of switching over to another vendor to improve the predictions. They are targeting a 30 second polling rate vs the current 2 minute rate.

Barbeau, Sean

unread,
May 17, 2013, 10:45:17 AM5/17/13
to transit-d...@googlegroups.com
Ken,
HART has told us that their update rate is also around 2 minutes, or when a bus passes a timepoint. For HART, the bigger issue is the capacity of their radio system in terms of data bandwidth. They are using a proprietary Motorola radio that was designed primarily for voice, with data as an add-on. HART has told us that they are concerned with system reliability when increasing the update rate to any more frequently than it is now, because of the radio channel.

Sean

Michael Smith

unread,
May 17, 2013, 11:53:05 AM5/17/13
to transit-d...@googlegroups.com
I highly recommend measuring the update rate. We have found that for systems from particular vendors that the update rate is far worse than it is supposed to be.

Mike
> You received this message because you are subscribed to the Google Groups "Transit Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to transit-develop...@googlegroups.com.

Joa

unread,
May 17, 2013, 1:01:59 PM5/17/13
to Transit Developers


On May 16, 3:45 pm, Michael Smith <msmithnext...@gmail.com> wrote:
> We have found that several components are needed for prediction
> verification. An automated system for looking at historic arrival
> predictions is key because only such a system can thoroughly show how well
> a system is truly working. It is the only way you can cover all the stops,
> day and night, weekdays and weekends, etc.

The reality of real-time next bus and train times being provided as a
"by-product" of CAD/AVL systems does not support this claim. Assuming
decent poll rates (2/s or better), they just fine without "looking at
historical arrival predictions".

"HART's OrbCAD system only logs GPS data at timepoints, only logs GPS
data at timepoints, which can be every 6 stops or so."
This is more likely a function of the underlying comms system. Latch-
on of data for transit onto public safety (read voice-centric) LMR
systems was being championed for a while, but that arrangement didn't
make waves as working terribly well. I suppose HART would fall in that
category also, I am not familiar with their system however. There
might be examples were decent polling rates can be accomplished. At
any rate, I believe everybody's given up on that concept and if
upfront cost is the issue (ref.: FCC drive for spectral efficiency),
there is the concept of getting on public safety networks for voice
and using commercial carriers for data. Not necessarily the ideal
arrangement, but emergency communications can be covered via public
safety voice (assuming an SLA is in place that assures public safety
doesn't cut off transit), and you make work what you got from
commercial carriers, which might not be terrible in everyday
operations.

Joa

unread,
May 17, 2013, 1:08:07 PM5/17/13
to Transit Developers
Correction: 2/min, that is.

On May 17, 10:01 am, Joa <joachim.pfeif...@gmail.com> wrote:
>

Michael Smith

unread,
May 17, 2013, 2:00:06 PM5/17/13
to transit-d...@googlegroups.com
Joa,

You are correct that using public safety/voice channels has not proven to be the best solution. But it should be noted that a large number of AVL systems still use this unfortunate technology and some agencies are still pursuing it (i.e. San Francisco Muni). Therefore it still often needs to be dealt with.

Mike

Joa

unread,
May 18, 2013, 11:30:22 AM5/18/13
to Transit Developers


On May 17, 11:00 am, Michael Smith <msmithnext...@gmail.com> wrote:
> But it should be noted that a large number of AVL systems still use this unfortunate technology and some agencies are still pursuing it (i.e. San Francisco Muni). Therefore it still often needs to be dealt with.
>

Today, things are more nuanced than you paint it. I figure you refer
to MTA's radio system replacement project. Here, transit uses its own,
dedicated spectrum that exclusively runs data for transit, purpose
built to meet the agency's needs, using radio equipment that is built
for this use. With "tack-on" I was referring to cases where efforts
were made to "mix in" transit data with public safety. That's indeed
an unfortunate combination, because that either was made under the
concept that transit data could be squeezed through a trunking control
channel at capacity, or that it could run along with public safety
data, which has a rather different usage profile.
Choosing dedicated and exclusive systems over commercial carrier
services allow agencies to require contractors to provide capacity and
coverage that meets the agency's needs, e.g. coverage inside equipment
storage facilities that have zero or sketchy commercial carrier
coverage. This is important in order to support pull-out into service.
Other cases are tunnels and generally areas with poor commercial
carrier coverage. The contracts are structured such that the
contractor agrees to provide the coverage and capacity needed to
support their CAD/AVL system that provides the required functionality
at a required performance level. (Side stepping to the OP: While
functional performance requirements are set in stone in the contract,
the methods to measure the performance are typically defined in design
phases, because they depend on the product installed). No pointing
fingers as the risk to make things work reside with the contractor,
without risk transfers to the agency. When you're at the end of the
contractor, it takes cojones to sign up for an arrangement that has
the implementation risk stay with you, but there are qualified people
who do it. So why would an agency want to accept anything less and get
into the business of being a system integrator?
Also, good luck getting carriers to do anything for you, and even if
they are supportive, their hands might be tied. Consider this:
Commercial carrier networks are "cell" systems that rely on small, low
sites where efforts to build out can receive fierce push-back from
neighborhood activities. In contrast, transit data can often be
collocated with existing public safety radio sites, or at least "out
of the way".
Finally, it doesn't even take a large earthquake to take down
commercial carrier coverage. When there's a large event in Golden Gate
Park, carriers will shuttle in mobile base stations, but these go only
so far. It's reassuring when you know your operations can continue
without being affected by fluctuations in demand from the general
public.

Michael Smith

unread,
May 18, 2013, 1:54:47 PM5/18/13
to transit-developers
Joa,

You have clearly stated the arguments that agencies use to support not using a commercial cellular system. But in working with many transit agencies I have found that those arguments simply do not hold up. The non-cellular based systems simply don't meet the specs. It doesn't matter what the requirements are if the vendor simply cannot meet them. With the non-cellular systems you get low reporting rates, dead zones, poor coverage in garages and tunnels, long lasting outages when a tower goes down, huge delays due to radio systems requiring separate higher powered transceivers hat neighbors will not benefit from, etc. Plus they cost far more. The data doesn't lie. 

Mike


Joa

unread,
May 19, 2013, 12:28:35 AM5/19/13
to Transit Developers


On May 18, 10:54 am, Michael Smith <msmithnext...@gmail.com> wrote:
> Joa,
>
> You have clearly stated the arguments that agencies use to support not
> using a commercial cellular system. But in working with many transit agencies...

(Shrugs) - since you brought up San Francisco.

Everybody's situation is different, see above ("the concept of getting
on public safety networks for voice and using commercial carriers for
data."). Generally, the FCC mandated spectral efficiency requirements
in the UHF and 700MHz spectrum will tend to push more agencies into
commercial carrier turf, because the capital investments become cost
prohibitive. The perfect storm in the US: Not only are investments
into transportation infrastructure puny (as fraction of GDP, compared
to developed countries). At the same time, cost for equipment that
meets rather onerous (from a transit perspective) spectral efficiency
requirements are considerably higher than have been (that's adjusted
for inflation).
That said - it is good practice to frame procurements in a way that
keeps system integration and implementation risk with the contractor.
So if the specs aren't met, the contractor's going to have to come out
and fix things, or compensate the agency in proportion to the level
the performance they signed up to deliver is lacking.

Michael Frumin

unread,
May 19, 2013, 1:04:19 AM5/19/13
to transit-developers

Fascinating as the debate over commercial cellular vs private radio is (and recognizing that the "right" answer of course varies depending on the many context-specific objectives and constraints), I wonder what any of it has to do with methodologies for validating and evaluating the performance of different bus arrival time prediction algorithms?

Dyer, James

unread,
May 19, 2013, 5:23:40 PM5/19/13
to transit-d...@googlegroups.com
Sean Barbeau noted that he could not check predictions against location
fixes at stops, due to the low frequency of those fixes, and that the
low frequency was a constraint of the radio network. Joachim Pfeiffer
repeated the first sentence of Sean's message and guessed the same
constraint.
<_Jym_>

Michael Smith

unread,
May 19, 2013, 8:24:16 PM5/19/13
to transit-d...@googlegroups.com
Exactly. Communication technology strongly affects the AVL reporting rate which in turn strongly affects both prediction accuracy and the ability to measure the accuracy.

Mike

Michael Frumin

unread,
May 19, 2013, 8:34:15 PM5/19/13
to transit-developers

Agreed.  It affects everything about the AVL system, including many things not mentioned in this thread.

But the update interval of the system is something that is known more or less a priori.  You don't need to think deeply and gather lots of ground truth data or do lots of fancy post processing to evaluate it.  Or even if you did, you would probably discuss it in an email thread with subject along the lines of "evaluating private and commercial radio systems for AVL and real-time arrival estimates."

What I was really liking about this thread was the collective thinking about how to analyze and evaluate a prediction system *given* a certain update interval and radio system performance.  Is anyone interested in getting back to that?

Thanks,
Mike

Rich Fantozzi

unread,
May 20, 2013, 12:00:30 PM5/20/13
to transit-d...@googlegroups.com
Just a quick plug this would be a very good question for http://area51.stackexchange.com/proposals/49339/open-transportation-technology
Still need more commitment.


On Friday, April 19, 2013 5:21:40 PM UTC-4, Aaron Antrim wrote:

Michael Frumin

unread,
Jun 7, 2013, 1:39:51 PM6/7/13
to transit-developers
So, backing up the discussion to before we started discussing radio systems and update intervals, I wonder if anyone has further thoughts on this discussion in the context of a fixed update interval.  Assume your system advertises the ability to update data for each bus every X seconds, and delivers on that.  What next?  How to formulate a scheme for setting standards, measuring them, etc?

From what I've seen, people tend to measure the size of the confidence interval, at different confidence levels, for the predictions at different amounts of time prior to the bus actually arriving the stop. For example: if the confidence interval for predictions 10 minutes prior to arrival at the 95% confidence level is measured at +/- 3 minutes, it means that 95% of the time the predicted arrival time, 10 minutes prior to actual arrival time, was within 3 minutes of the actual arrival time.

That seems like a fairly reasonable framework (though you might choose to adjust any of the particular numeric values), but leaves me wondering how to actually do the measurements?  First off -- assume you trust your system to accurately measure and store in a database somewhere when a given bus arrived a given stop.  The question then is how to sample the actually-produced predictions to calculate those confidence interval metrics.

I say "sample" because I doubt you want to store every single prediction the system ever makes.  And sampling should be pretty easy to implement given the presence of all these great API's these days.  So what sampling scheme would you use?  Eg:

* Totally random: randomly pick a stop, randomly pick a bus currently predicted to arrive at that stop, and save the current timestamp and the predicted arrival time.

* Stratified sampling: somehow segment/preference the stops you will sample and which arrival predictions you will when sampling those stops.

* Bus-following: randomly pick a bus, follow it along its current trip, saving all predictions offered as to when that bus will arrive all downstream stops, saving current timestamp and predicted arrival times.

Then, for any of these strategies, compare the samples to the actual arrival times, and generate your confidence intervals.

Also, how do you handle the case when a bus never arrives a stop it was at some point predicted to arrive.  Say, for example, the system says that a given bus is predicted to arrive a given stop in 12 minutes.  But then that bus never arrives because, for example, it has a mechanical problem and goes out of service.  Obviously that sort of thing is impossible to predict, so is it fair to hold that against the prediction system?

Thoughts?

Thanks,
Mike


Reply all
Reply to author
Forward
0 new messages