Applying astroML's Bayesian histogram bin-determination to bus service frequencies

Emil Vaughan

unread,

Jan 31, 2013, 7:23:16 AM1/31/13

to astroml...@googlegroups.com

I wonder if someone can help me with the following problem.

I'm looking at bus service frequencies in London, and I'm using the astroML hist() function to try to divide the day into different bins according to service frequencies.

For example, the 205 bus runs every half an hour at night, every 5-7 minutes during rush hour, and every 7-8 minutes at other times. In fact, these are the gaps in minutes between departures:

[12, 19, 30, 31, 30, 30, 30, 30, 30, 30, 30, 13, 12, 12, 10, 10, 9, 8, 8, 7, 7, 5, 5, 5, 7, 7, 8, 7, 8, 7, 8, 7, 8, 7, 8, 7, 8, 7, 7, 8, 8, 7, 8, 7, 8, 7, 8, 7, 8, 7, 8, 7, 8, 7, 8, 7, 7, 8, 8, 7, 7, 8, 8, 7, 8, 7, 8, 7, 8, 7, 8, 7, 8, 7, 8, 7, 8, 7, 8, 7, 8, 7, 8, 7, 7, 7, 7, 6, 6, 6, 6, 7, 8, 7, 8, 7, 8, 7, 8, 7, 8, 7, 8, 7, 8, 8, 8, 7, 8, 8, 8, 7, 8, 8, 9, 8, 8, 8, 7, 10, 10, 10, 10, 10, 10, 10, 10, 10, 11, 12, 13, 11, 12, 13, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12]

In terms of time, the 205 bus begins journeys at the following times of day (times are given as seconds since midnight):

[0, 720, 1440, 2580, 4380, 6240, 8040, 9840, 11640, 13440, 15240, 17040, 18840, 19620, 20340, 21060, 21660, 22260, 22800, 23280, 23760, 24180, 24600, 24900, 25200, 25500, 25920, 26340, 26820, 27240, 27720, 28140, 28620, 29040, 29520, 29940, 30420, 30840, 31320, 31740, 32160, 32640, 33120, 33540, 34020, 34440, 34920, 35340, 35820, 36240, 36720, 37140, 37620, 38040, 38520, 38940, 39420, 39840, 40260, 40740, 41220, 41640, 42060, 42540, 43020, 43440, 43920, 44340, 44820, 45240, 45720, 46140, 46620, 47040, 47520, 47940, 48420, 48840, 49320, 49740, 50220, 50640, 51120, 51540, 52020, 52440, 52860, 53280, 53700, 54060, 54420, 54780, 55140, 55560, 56040, 56460, 56940, 57360, 57840, 58260, 58740, 59160, 59640, 60060, 60540, 60960, 61440, 61920, 62400, 62820, 63300, 63780, 64260, 64680, 65160, 65640, 66180, 66660, 67140, 67620, 68040, 68640, 69240, 69840, 70440, 71040, 71640, 72240, 72840, 73440, 74100, 74820, 75600, 76260, 76980, 77760, 78480, 79200, 79920, 80640, 81360, 82080, 82800, 83520, 84240, 84960, 85680]

I'm using the following code to make a histogram

from astroML.plotting import hist
import numpy as np

hist(np.array(arrival_times), bins="knuth", ax=ax, histtype="bar")

However, this only divides the day into two bins. I'd ideally like it to produce a few more bins (say 4), but I couldn't work out how to vary the parameters to the hist function to do this. I don't want all bins to have the same width, which is what lead me to astroML in the first place. Does anyone have any ideas? Thanks,

Emil

Jake Vanderplas

unread,

Jan 31, 2013, 10:43:50 AM1/31/13

to astroml...@googlegroups.com

Hi Emil,
If the resulting histogram has only two bins, then this means that according to the default fitness model of the Bayesian Blocks method, two bins are the optimal data representation.

Just as a quick experiment, use bins=48 (i.e. regular half-hour bins) rather than bins='blocks': you'll see that there's a distinct break around 20000s (that is, about 6am) where the frequency goes from low to high - the frequency of buses is more-or-less constant on either side of this break. That's what the two bins of the 'blocks' method are telling you.

There's an added detail here: the bus schedule data is cyclical, which means the "events" fitness function is not quite correct. The code is designed to be flexible enough that if you come up with a more suitable fitness function, you can implement it and use that within the blocks function (see FitnessFunc and derived classes in the source: https://github.com/astroML/astroML/blob/master/astroML/density_estimation/bayesian_blocks.py ) You might also check appendix C of Scargle 2012 (https://github.com/astroML/astroML/blob/master/astroML/density_estimation/bayesian_blocks.py ) and see if any of those alternative fitness functions seem to make more sense for your data.

If you do end up implementing a different fitness function, please think about contributing it to astroML so others can use it! You can do that through a github pull request.

Another possibility is to tweak the prior of the events fitness function - it's set according to numerical simulations in the Scargle paper. I'm not sure if this quite makes sense for bus arrival times, but it would look something like this:

   from astroML.plotting import hist
   from astroML.density_estimation.bayesian_blocks import bayesian_blocks, Events

   fitness = Events(p0=0.2) # prior calibrated to a false detection rate of 0.2
   bins = bayesian_blocks(arrival_times, fitness=fitness)
   hist(arrival_times, bins=bins, normed=True) # normed should be true for unequal bins

Hopefully that was helpful - good luck with your analysis
   Jake

--
You received this message because you are subscribed to the Google Groups "astroML-general" group.
To unsubscribe from this group and stop receiving emails from it, send an email to astroml-gener...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Jake Vanderplas

unread,

Jan 31, 2013, 11:03:15 AM1/31/13

to astroml...@googlegroups.com

One more thought - I wonder if a histogram is really the best way to represent this data. I'd probably instead just plot headway or bus frequency vs time:

headways = arrival_times[1:] - arrival_times[:-1]
busses_per_hr = 3600. / headways

Jake Vanderplas

unread,

Jan 31, 2013, 11:07:17 AM1/31/13

to astroml...@googlegroups.com

Sorry - accidentally hit send. That should have been:

import matplotlib.pyplot as plt
arrival_times = np.asarray(arrival_times)

headways = arrival_times[1:] - arrival_times[:-1]

   bus_freq = 3600. / headways # convert headways to buses per hour

   plt.subplot(211)
   plt.plot(arrival_times[:-1], headways)
   plt.subplot(212)
   plt.plot(arrival_times[:-1], bus_freq)

You could probably smooth these curves as well, if you prefer.

Hope that helps,

Emil Vaughan

unread,

Jan 31, 2013, 1:06:01 PM1/31/13

to astroml...@googlegroups.com

Hi Jake, thanks for this! The reason for wanting the histogram is just
the binning - I don't want to present the data, I just want to bin the
frequencies into a few time periods. I did some plots like you
suggested and I wonder if smoothing these is the way to go here.

I'll also have a look at making a fitness function.

Emil

Reply all

Reply to author

Forward