Re: [transit-developers] Compression strategies for GTFS data

791 views
Skip to first unread message

Brian Ferris

unread,
Apr 10, 2013, 5:55:50 PM4/10/13
to transit-d...@googlegroups.com
You might consider a custom binary format that keeps just the parts of the GTFS feed that you'll actually be using on the phone.    There are lots of strategies and tricks you can use, some of which will depend on the phone platform you are targeting.  Just keep in mind that GTFS was designed to make it easy for transit agencies to edit with off-the-shelf tools like text editors and spreadsheet programs, at the expense of most-space-efficient compactness.  However, there is nothing that says you can't parse the GTFS into some more efficient form.

Brian


On Wed, Apr 10, 2013 at 2:14 AM, Thomas Gallagher <gall...@gmail.com> wrote:

I am looking to develop a transit app using GTFS static data. One of the constraints I've set to myself is that the app should use minimal mobile data transfers. Therefore, I would like to embed all the data in the app.

My issue is that GTFS data sets are usually quite large (85MB uncompressed for the city of Sydney for example). I've done a bit of reverse engineering on other apps out there and found out that some of them have managed to compress all that data into a much smaller file (I'm talking about a few MB at most).

Using 7zip, I've managed to compress my 85MB data set down to 5MB which is acceptable for me. The next step is for me to use that 7z file into my app and that's where I'm stuck. There's no way I'm going to uncompress it and put it in a SQL database as that will use too much space on the phone. So I was wondering what are my other options.

Thanks

--
You received this message because you are subscribed to the Google Groups "Transit Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to transit-develop...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Andrew Byrd

unread,
Apr 10, 2013, 9:26:00 PM4/10/13
to transit-d...@googlegroups.com
Hi Thomas,

Consider that the largest table in a GTFS feed is usually stop_times.txt.

First, departure and arrival times can be represented as a number of
seconds (or minutes, 1/2, 1/4, or 1/8 minutes...) after midnight and
stored directly as integers rather than text. Next, consider that the
time difference between an arrival and a departure time is generally
much smaller than the offset of the arrival time from midnight, so the
dwell time can fit into a narrower data type.

Also, consider that schedule data are often redundant in a way that
general-purpose compression algorithms cannot perceive. Because you have
some insight into the semantics of the data you can spot and exploit
these patterns. Many of the trips in your input GTFS feeds may be exact
copies of other trips shifted in time, or you may be able to shift one
trip in time and subtract the other, storing only small residuals.

These are not the only options, but just examples of ways you could
heavily compact the GTFS feed while keeping it in a form that is
directly usable by an application.

-Andrew
> <mailto:transit-developers%2Bunsu...@googlegroups.com>.

Quentin Zervaas

unread,
Apr 10, 2013, 9:39:21 PM4/10/13
to transit-d...@googlegroups.com
A lot can also be done with the shapes data. But I'll let you figure that one out :)

Johnson Chetty

unread,
Apr 11, 2013, 12:36:36 AM4/11/13
to transit-d...@googlegroups.com
On 11 April 2013 06:56, Andrew Byrd <and...@fastmail.net> wrote:
Hi Thomas,

Consider that the largest table in a GTFS feed is usually stop_times.txt.

First, departure and arrival times can be represented as a number of
seconds (or minutes, 1/2, 1/4, or 1/8 minutes...) after midnight and
stored directly as integers rather than text. Next, consider that the
time difference between an arrival and a departure time is generally
much smaller than the offset of the arrival time from midnight, so the
dwell time can fit into a narrower data type.

Also, consider that schedule data are often redundant in a way that
general-purpose compression algorithms cannot perceive. Because you have
some insight into the semantics of the data you can spot and exploit
these patterns. Many of the trips in your input GTFS feeds may be exact
copies of other trips shifted in time, or you may be able to shift one
trip in time and subtract the other, storing only small residuals.

Hey,

The above usecase is exactly what frequencies.txt aims to do. (and does very well IMO)
stop_times.txt is large only because if it explicitly mentions every trip, but when used with frequencies.txt it really cuts down the size.
frequencies.txt allows one to represent a repetitvie group of trips in a timespan.

Has that been tried yet?
--
Regards,
Johnson Chetty




Joa

unread,
Apr 11, 2013, 1:28:25 AM4/11/13
to Transit Developers
There is a great deal of repetition in stop_times.txt. In database
speak, GTFS is not normalized.
You will find that the route patterns (sequence of stops) are unrolled
and also the run times. Extracting the route and run time patterns and
referencing them from the trips should get you down to between 20 to
50% of the original feed data, depending how many (or few) patterns
exist. This will also play out nicely when populating a database,
although, as you mention, this will likely go beyond what the DBMS' on
smartphones are design to handle.

stefanw

unread,
Apr 12, 2013, 10:14:56 AM4/12/13
to transit-d...@googlegroups.com
Hi Thomas,

I recently found a protobuffer model for GTFS:

Might be useful.

Cheers
Stefan

Laurent GRÉGOIRE

unread,
Apr 12, 2013, 10:43:48 AM4/12/13
to transit-d...@googlegroups.com
Hi Thomas,

You can use certain data patterns specific to transit data to easily
compress it. If you make the assumption that your data will be
read-only, you can use the following ideas:

- Compress strings using dictionnaries [1] or interning [2]. Lots of
stop names will use the same keywords / start. Compressing route names
is probably not necessary.

- For stop latitude/longitude you do not need double (64 bits), float
would suffice. You can even reduce amount of storage by using the fact
that all stops belongs to the same area: store only the delta lat/lon
from the middle point of your set (16 bits may suffice in that case if
the covered area is not too large -- use fixed-point arithmetics).

- Store stop-time delta, not absolute values, since the delta between
stop points is usually rather small. Round up to the minute if you do
not need precision to the second. Use variable-length code to further
optimize storage [3].

- Compress common trips (identical stop sequence and stop delta) by
storing only start time and a common pattern. A more optimized approach
would be to store for all common stop sequences a base pattern and a
delta to pattern: a variable-length coding would be even more efficient
as the delta to the pattern would probably contains lots of zeros. You
could also re-use common stop sequence from trip using distinct stop
sequence, at the expense of a bit more complexity.

- Intern/dictionnarize trip headsigns. Usually there are lots of
duplicates here.

- Store pickup/dropoff type in a bit set (8 bits should suffice).

- Re-shuffle stop ID's and use an internal index instead of the original
ID. If not needed externally, you can even dump the original string stop ID.

- Stop sequence is probably not necessary depending on your storage
(array index is sufficient).

- Encode only departure if departure and arrival times are equals;
encode the delta if not. Again, use variable-length encoding.

- Dump trip ID and use internal indexing, probably not needed as it is
only used as an internal key to stop_times.

- For calendars, encode a start date and a variable-length bit set to
represent a set of days starting from the start date. Eventually trim
dates in the past or in the far future, depending on your application.
The added value of this is that you do not need to compute the effective
set of active days for your calendar during a pre-processing or reading.

- For shape files, re-use common segments, and apply same strategy as
above: use variable-length coding, encode only the delta from the
previous point, use fixed-point arithmetics, re-index shape ID to use an
internal index... Dump shapes if not needed.

All this should not have a big impact on reading performance (decoding
should be fast) -- it can maybe sometimes improve it (less memory =
better cache hit, less data=faster processing).

HTH,

--Laurent

[1] www.dcc.uchile.cl/~gnavarro/ps/sea11.1.pdf
[2] http://en.wikipedia.org/wiki/String_interning
[3] http://en.wikipedia.org/wiki/Variable-length_code

Andrew Byrd

unread,
Apr 12, 2013, 11:15:25 AM4/12/13
to transit-d...@googlegroups.com
On 04/12/2013 04:43 PM, Laurent GR�GOIRE wrote:
> - For stop latitude/longitude you do not need double (64 bits), float
> would suffice. You can even reduce amount of storage by using the fact
> that all stops belongs to the same area: store only the delta lat/lon
> from the middle point of your set (16 bits may suffice in that case if
> the covered area is not too large -- use fixed-point arithmetics).

Thanks Laurent for this list of excellent advice.

Note that 32-bit floats concentrate resolution near the prime (0)
meridian. If storing standard lat/lon (rather than deltas from a center
point) resolution is relatively low on the other side of the Earth.

A 32-bit float reserves some bits for the exponent, yielding a
worst-case resolution of something like (earth_circumference / 2^23) or
~5 meters. A fixed point representation ("binary radians") will provide
more uniform resolution, earth_circumference / 2**32 or ~1 cm resolution
in the worst case (longitude at the equator).

-Andrew

Laurent GRÉGOIRE

unread,
Apr 12, 2013, 12:46:06 PM4/12/13
to transit-d...@googlegroups.com
On 12/04/2013 17:15, Andrew Byrd wrote:
> Note that 32-bit floats concentrate resolution near the prime (0)
> meridian. If storing standard lat/lon (rather than deltas from a center
> point) resolution is relatively low on the other side of the Earth.
>
> A 32-bit float reserves some bits for the exponent, yielding a
> worst-case resolution of something like (earth_circumference / 2^23) or
> ~5 meters. A fixed point representation ("binary radians") will provide
> more uniform resolution, earth_circumference / 2**32 or ~1 cm resolution
> in the worst case (longitude at the equator).

To alleviate that, my suggestion was to use the delta between the point
and the "middle" point (either weighted average of all points, or the
middle point of the bounding box). This lead to:

1) If using floating point arithmetics (32 bits floats), you thus have
the nice property that the closest the point to average, the more
precise the location: which could fit nicely with the fact that most of
the points of interest (and the most "urban") are probably closer to
your center.

2) If using fixed point arithmetics, you have a constant precision all
over the place. Depending on the size of your bounding box, 32 bits
should give you a fair precision (a 100km square area gives you a
precision of ~0,05 mm).

Saying that, I'm wondering why we do not always use fixed-point
arithmetics when storing lat/lon. Floating-point greatly favor people
living at the equator and/or along greenwich meridian :)

--Laurent

Andrew Byrd

unread,
Apr 13, 2013, 10:27:52 AM4/13/13
to transit-d...@googlegroups.com
On 04/12/2013 06:46 PM, Laurent GR�GOIRE wrote:
> Saying that, I'm wondering why we do not always use fixed-point
> arithmetics when storing lat/lon. Floating-point greatly favor people
> living at the equator and/or along greenwich meridian :)

Well, the floating point representation is particularly accurate around
a point in the gulf of Guinea that GTFS feeds indicate is heavily served
by high speed transit ;)

Consider also that when running on a platform like the JVM that has
silent overflow, integers wrap around from +180 degrees to -180 degrees
(INT_MAX to INT_MIN) which corresponds well to the shape of the Earth.
You don't need to range check longitude operations! Unfortunately
latitude is not quite as neat: (int x int) is a torus rather than a
sphere, so you either have to do a pac-man wrap around from North pole
to South, or there are two representations for every point on the
Earth's surface.

Between the more even distribution of precision and this characteristic,
I also wonder why fixed point lat/lon isn't more common.

-Andrew

Reply all
Reply to author
Forward
0 new messages