Feed identification and naming

862 views
Skip to first unread message

Andrew Byrd

unread,
Dec 19, 2013, 4:14:52 AM12/19/13
to gtfs-c...@googlegroups.com
TL;DR: It is important to reliably match IDs in GTFS-RT messages to
those for GTFS entities in static feeds. GTFS IDs are feed-scoped, but
feeds have no unique ID so there is no name for these scopes. Multiple
feeds may use the same IDs to refer to different entities, and an
entity's ID may change across versions of a single feed. GTFS file names
are currently uninformative. A feed identification and file naming
scheme is proposed which facilitates realtime/static entity matching and
feed versioning.

Problem
=======

The GTFS-RT spec states that every GTFS-RT feed corresponds to a single
GTFS feed. Consider the case where you have a journey planner that uses
a GTFS-RT feed in its routing process, or a real-time API server that
uses GTFS feeds to build its internal model of the transit system. When
you are working with only one GTFS feed and a single corresponding
GTFS-RT feed, all is well because all identifiers are feed-unique, i.e.
they are not split into separate agency namespaces.

Now consider the case where you have two GTFS feeds and two
corresponding GTFS-RT feeds. Both GTFS feeds come from different
agencies, but both use the same set of identifiers for their disjoint
sets of stops and trips (e.g. integers). We know that each RT feed is
associated with a specific GTFS feed, but we have no way of identifying
the objects created from that feed. One strategy for confronting this,
the one adopted by OneBusAway and therefore by software using OBA for
GTFS import such as OpenTripPlanner, is to use an agency ID to identify
a feed. However, agency IDs fail to uniquely identify GTFS feeds
because: a) one feed may contain more than one agency, b) two feeds may
provide information about the same agency, c) two feeds providing data
about different agencies may nonetheless use the same agency ID.

Within a single GTFS feed, there are elements associated with an agency
such as trips, and others which are not such as stops. Even trips (which
have an agency) are required to have feed-unique identifiers independent
of the agency field. In fact, for purposes other than fare computation,
agencies in GTFS are on the same level as route color -- they are
informational, not structural. In database parlance, they are not
included in composite foreign keys. GTFS IDs are naturally feed-scoped,
but feeds have no unique ID so there is no name for these scopes.

Proposed solution
=================

GTFS feeds should have unique identifiers, and GTFS-RT feeds should
reference that identifier to allow matching of IDs. Here, "GTFS feed"
does not mean a single instance of a feed, but rather a series of
successive releases or versions of a feed containing information about
the same agencies and routes. All versions of a feed would use the same
identifier. Identifiers would be alphanumeric character strings limited
to a certain maximum length. Two feeds with the same ID refer to the
same entities, and the newer one supersedes the older one. The version
ID should encode the order in which feeds were released. We propose that
the version number be the release date in ISO8601 format. If the dash
separators are not used, this is an increasing integer; if the dashes
are retained the dates still follow a lexical ordering.

GTFS feed files are often named "google-transit.zip", so when a GTFS
consumer downloads several feeds they cannot be distinguished by name.
Additionally, GTFS is no longer a Google-specific format. Therefore we
propose a new naming convention: the unique feed ID followed by the
version ID and a standard GTFS file extension. For example,
OVAPI-20131127.gtfs.zip.

There should be a way to know when a GTFS-RT feed is out of sync with
the GTFS feed one is using. The RT feed should contain a feed ID and
version number which match those in the GTFS feed. When the RT feed's
version number changes, this is a sign that a new version of the GTFS
feed is being referenced, and that feed should be fetched and re-loaded.

Occasionally one might want the identifiers from two feeds to collide.
One example is the New York City bus feeds, with one feed per borough.
These feeds reference the same stops where trips cross borough
boundaries, using the same IDs (e.g. stops 402130, 402132, 402134).
Perhaps splitting feeds in this way should be discouraged, but where
this practice is deemed necessary a subfeed ID could be employed: feed
identifiers would be split into two parts using a special character
(underscore) and only the first part of the id is used to determine
identifier scope; identifiers used in more than one subfeed of the same
top-level feed would be in the same namespace. The result is essentially
a single large feed split across several files. Thus the first two feeds
in this list would have the same stop, trip, and route namespace, which
would be separate from namespaces of the last two feeds:
NYCTBus_Brooklyn-2013-12-01.gtfs.zip
NYCTBus_Queens-2013-12-01.gtfs.zip
NYSubway-2013-12-01.gtfs.zip
LIRR-2013-12-01.gtfs.zip

A specific feed version is uniquely identified by a tuple of (feed_id,
subfeed_id, version) where subfeed_id can be the empty string. Within a
single series of feeds having the same (feed_id, subfeed_id), versions
are ordered lexically by version string. It is debatable how useful the
subfeed feature would be, since a robust multi-feed GTFS reader would
have an entity deduplication mechanism, and the practice of splitting
feeds is relatively uncommon.

Some changes to feed_info.txt would be necessary. A new field feed_id
would be added. Its value must match the first element in the GTFS file
name (including the subfeed ID). The value in the existing feed_version
field should match the version element of the file name. It is
recommended or required that version names be the release date. The feed
validator could verify that URLs, feed IDs, and version numbers match
those in the filename and the registry.


Maintaining Uniqueness
======================

Uniqueness of feed IDs could be ensured with a simple lightweight CSV
registry. IDs would be assigned on a first-come first-served basis,
perhaps with some priority given to "official" feeds from transport
authorities. Each identifier would be associated with a feed-producing
organization and a canonical URI where static and realtime feeds are to
be retrieved. Fields in the feed registry could include: feed_id,
organization_name, gtfs_url, gtfsrt_http, gtfsrt_websocket.

The main purpose would be to prevent naming conflicts when new feeds
appear, so registration could even happen passively, with feed users
adding an existing feed's ID to the registry. This would also serve as a
survey of available public GTFS feeds.


Points to clarify
=================

ID form: case-sensitive alphanumeric strings, up to N characters?
Should dash separators be removed from date-based version IDs, yielding
an integer? Consumers are likely to treat version names as strings in
either case. How do we handle a bugfix feed release within the same day?
Maybe an additional field is in order, but a simple letter suffix would
not disrupt the lexical ordering of versions within a day. Adding the
ISO8601 time (THH:MM:SS) would have the same effect, but colons are
generally to be avoided in filenames.


This change should not significantly complicate the process of producing
and releasing a feed. It amounts to a naming convention and a
(potentially passive) simple registration process. If a feed producer
does not follow the convention, it will be straightforward for a feed
consumer to simply rename files when they are retrieved.

Andrew Byrd, Conveyal
Jorden Verwer, Go About

Frank

unread,
Dec 19, 2013, 1:08:34 PM12/19/13
to gtfs-c...@googlegroups.com
 -1

Having a date in the name of the GTFS url makes two things overly complicated:

    1)  With a date-based naming scheme, the Agency now has to update the url paths on their GTFS web page every time a new feed is generated.  (And no, I can't just have a link to a folder on my website where I place new gtfs files; PCI security audit rules dictate that my website not have endpoints that do raw directory listings ... the folder would have to contain an index.html file, so I'd be required to update that index.html page with new file name on every GTFS publish).

    2) As a consumer of GTFS who has cron scripts that automatically download an Agency's GTFS data, my cron scripts would no longer have a consistent url path to follow.  Without a predictable url path, my automation scripts would have to obtain the GTFS url path by interrogating (screen-scraping) the Agency's GTFS webpage.


jorden...@goabout.com

unread,
Dec 19, 2013, 1:36:43 PM12/19/13
to gtfs-c...@googlegroups.com
Hello Frank,

You make a good point, but perhaps we can find some kind of compromise where there's a canonical URL (the one with the date in it) and also a symlink (or just a copy, if a symlink is unsupported) with a URL ending in something like -latest.gtfs.zip, so that you'd get the best of both worlds.

Regards,

Jorden

Andrew Byrd

unread,
Dec 19, 2013, 6:00:18 PM12/19/13
to gtfs-c...@googlegroups.com
Hi Frank,

I was thinking about this as a feed consumer, thanks for the feed
publisher perspective. Both are valid points, and in any case
information encoded in the file name is easily corrupted. The file
naming scheme is the less important part of the idea, is not essential,
and perhaps should not have been lumped together with the feed
identification component.

The important part is unique feed identifiers and the additional fields
in feed_info.txt, which give a name to the feed-scope and allow matching
RT entities to those created from multiple static feeds.

-Andrew
> --
> You received this message because you are subscribed to the Google
> Groups "General Transit Feed Spec Changes" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to gtfs-changes...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/gtfs-changes/f7e1ef28-f2d3-4a69-87f6-aa29c333d0a6%40googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

Brian Ferris

unread,
Dec 19, 2013, 6:09:45 PM12/19/13
to gtfs-c...@googlegroups.com
While I think it'd be cool to have a global id system for transit feeds, I've always been a little skeptical that this would be used / maintained over time.  I guess I'd be happy to be proven wrong.

However, for the purpose of connecting GTFS and GTFS-realtime feeds, might it not be simpler to just include a link to the real-time feed in the GTFS feed?


Andrew Byrd

unread,
Dec 19, 2013, 6:35:14 PM12/19/13
to gtfs-c...@googlegroups.com
Hi Brian,

It might be simpler in some ways, but imagine the case where a feed
provider has several realtime endpoints: one is an HTTP pull alerts
feed, one is HTTP pull trip updates, and one is a differential trip
updates websocket. Various feed consumers might want to use one or
several of these sources, each of which will have a different URL.

The basic functional need is to give the feed namespace an unambiguous
name so messages can be matched to the proper one when several are
present. Would the static feed contain a list of all the realtime URLs?
In feed consumer apps, does it make general sense to label the feed
namespace with a realtime data URL, or more bizarrely, a set of realtime
data URLs?

An additional problem is that this implies a single source for a feed.
This is not truly the case with open data, which may be copied, proxied,
transformed, and redistributed. If the realtime data has been filtered
and re-broadcast, its URL may no longer match anything mentioned in the
static feed.

The idea of some worldwide, unique, centralized registry of feed names
might arguably go against the decentralized nature of feed production
and distribution. But if we look at this pragmatically, what really
matters is that within a single coverage area (city, regional, or
continental) there is some widely recognized unique name for each feed
that interacts with other outside data sources. Basically feeds with
realtime data in regions with multiple feeds would truly need to use
feed IDs for this to function. Registration serves mainly to prevent
name collisions, and eventual registration or use of feed IDs outside
the places where it's necessary (multiple feeds + realtime) would just
be extra value, a list of various available feeds.

-Andrew
> <mailto:gtfs-changes%2Bunsu...@googlegroups.com>.
> > To view this discussion on the web visit
> >
> https://groups.google.com/d/msgid/gtfs-changes/f7e1ef28-f2d3-4a69-87f6-aa29c333d0a6%40googlegroups.com.
> > For more options, visit https://groups.google.com/groups/opt_out.
>
> --
> You received this message because you are subscribed to the Google
> Groups "General Transit Feed Spec Changes" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to gtfs-changes...@googlegroups.com
> <mailto:gtfs-changes%2Bunsu...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/gtfs-changes/52B37A82.1000209%40fastmail.net.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "General Transit Feed Spec Changes" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to gtfs-changes...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/gtfs-changes/CAG9YwWDo9d4gC5Rfm9e6RS4io5H1JjssDOq--osXxdbhiv-pnw%40mail.gmail.com.

Andrew Byrd

unread,
Dec 19, 2013, 7:23:04 PM12/19/13
to gtfs-c...@googlegroups.com
On 12/20/2013 12:35 AM, Andrew Byrd wrote:
> that interacts with other outside data sources. Basically feeds with
> realtime data in regions with multiple feeds would truly need to use
> feed IDs for this to function. Registration serves mainly to prevent...

Correction: "Basically _only_ feeds with associated realtime data in
regions with multiple feeds would truly need to use feed IDs for this to
function...

Matt Conway

unread,
Dec 19, 2013, 10:47:23 PM12/19/13
to gtfs-c...@googlegroups.com

One way to maintain the centralized repository would be to make the IDs reversed domain names a la Java, i.e. BART's feed ID would be gov.bart. I think someone suggested this at TransportationCamp West 2011, but I don't recall who.

Matt

Brian Ferris

unread,
Dec 20, 2013, 2:32:23 AM12/20/13
to gtfs-c...@googlegroups.com
The basic functional need is to give the feed namespace an unambiguous
name so messages can be matched to the proper one when several are
present. Would the static feed contain a list of all the realtime URLs?

Sure.  Why not?  We're talking about maybe 1-3 URLs in practice.  That doesn't seem like an unreasonable amount of data to include in a feed.  If someone modifies a GTFS feed in some way, it will be their decision whether keeping those URLs still makes sense.

In feed consumer apps, does it make general sense to label the feed
namespace with a realtime data URL, or more bizarrely, a set of realtime
data URLs?

I think you are mixing two different problems here:

1) Your desire to match a given GTFS feed with its associated GTFS-realtime feed.  Again, I think this could be solved with less complexity by just putting the URLs in the GTFS feed.

2) Your desire to have a unique, stable namespace/id for each feed such that when implementing transit applications that combine data from multiple feeds, you can generate a unique id for every entity.

I don't argue that #2 isn't useful or important, but I think you can solve #1 without it.  As for the unique namespace, at Google we just configure a unique id by hand when adding a new partner.  This has the nice property of being a stable id, even if the partner's GTFS gets corrupted in some way.  More importantly, it doesn't require any effort by the agency, which does work well with the decentralized nature of open data.

For me, the idea of a global database of transit feeds usually comes up through discussion of transfers and fare systems that cross transit network and agency boundaries.  Aka, it'd be nice to some way for two transit feeds to reference each other.  Global ids would help, but I think it'd be more practical to favor local coordination.  Aka two agencies working together to link their data could pick some common id that's likely to be globally unique (common examples include their domain name) to help match between feeds, but it doesn't have to be TRULY globally unique (feed consumers could apply some simple geographic constraints to avoid applying connections where they don't exist).

Andrew and others, I'm curious to hear your thoughts on what other kind of applications you think would be possible if we had a global id database for every transit feed.


jorden...@goabout.com

unread,
Dec 20, 2013, 4:38:05 AM12/20/13
to gtfs-c...@googlegroups.com
> Andrew and others, I'm curious to hear your thoughts on what other kind of applications you think would be possible if we had a global id database for every transit feed.
It'd make it much easier for someone to find all GTFS feeds that might be relevant to them.

Another important advantage of our proposal (which I think has been overlooked so far) is being able to deduce from a GTFS-RT feed that the corresponding static GTFS feed is out of date. This is a very useful and cool feature, in my opinion.

Regards,

Jorden

Brian Ferris

unread,
Dec 20, 2013, 4:43:27 AM12/20/13
to gtfs-c...@googlegroups.com
Do you really need a gobal id system for that?  Once you've established the connection between a GTFS feed and a corresponding GTFS-realtime feed (again, putting the URL in the GTFS seems straight-forward), then the combination of feed_index.txt - feed_version in the GTFS feed and a similar version field in the GTFS-realtime feed should be enough for that.


--
You received this message because you are subscribed to the Google Groups "General Transit Feed Spec Changes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gtfs-changes...@googlegroups.com.

Andrew Byrd

unread,
Dec 20, 2013, 6:08:15 AM12/20/13
to gtfs-c...@googlegroups.com
Hi Brian and Jorden,

Indeed we're blending several problems because they happened to arise
together in our experience. We can try to address them more separately.

Yes, I believe you need a global ID system to make these things work
well, but the URL _is_ a global ID system, so it's an option. Using the
URL rather than another ID does conflate the network location of a
single resource with that resource's namespace, and that namespace could
in theory span several feeds from several sources.

The potential problems arise in the case where feed publication for a
region is not under centralized control, where a static feed's producer
does not know or approve of a corresponding real-time feed, or simply
doesn't bother to include its URL in the static feed. I suppose in those
cases it would be reasonable to expect more manual configuration of the
feed consumer application. In the majority of cases feeds will come in
pairs or triplets, all amicably coordinated.

In either case we still need a name by which to identify GTFS entity
namespaces in API conversations. GTFS feeds are allowed to contain data
for multiple agencies, so agency ID is not a viable option. If feeds do
not have predefined unique IDs, feed consumers must invent names for
them and perform extra configuration of their applications.
Automatically using an agency name for this purpose is treating the
special case of GTFS where there is one agency per feed as the norm,
which can be very confusing for the API user. Here I'm hinting at a
specific situation in OpenTripPlanner and OneBusAway, but one which will
be present in other multi-feed consumer applications.

Would it hurt to allow a multi-agency feed producer to optionally
specify a feed ID in their feed? Just introducing the concept of a feed
ID that is separate from an agency ID and allowing that feed ID to be
specified in the feed itself would seem to make things cleaner.

The other source of difficulties is splitting feeds across multiple
files. This creates a situation where a realtime feed could in theory
apply to a family of schedule feeds, and requires one-to-many
relationships in both directions. Perhaps the solution is to simply
avoid splitting feeds into pieces. I've always wondered why someone
would want e.g. the Brooklyn buses without the Queens buses. (The lack
of geographic overlap between the borough feeds is remarkable, but the
constraints placed on practical systems by administrative boundaries is
another subject.)

Whatever we decide on, for most common cases it would be nice to have
fields that associate static and real-time URLs, and those could be in
feed-info.txt in the schedule feed rather than an external registry.
This approach is less fragile in that it eliminates a centrally
maintained resource.

-Andrew

On 12/20/2013 10:43 AM, Brian Ferris wrote:
> Do you really need a gobal id system for that? Once you've established
> the connection between a GTFS feed and a corresponding GTFS-realtime
> feed (again, putting the URL in the GTFS seems straight-forward), then
> the combination of feed_index.txt - feed_version in the GTFS feed and a
> similar version field in the GTFS-realtime feed should be enough for that.
>
>
> On Fri, Dec 20, 2013 at 10:38 AM, <jorden...@goabout.com
> <mailto:jorden...@goabout.com>> wrote:
>
> > Andrew and others, I'm curious to hear your thoughts on what other
> kind of applications you think would be possible if we had a global
> id database for every transit feed.
> It'd make it much easier for someone to find all GTFS feeds that
> might be relevant to them.
>
> Another important advantage of our proposal (which I think has been
> overlooked so far) is being able to deduce from a GTFS-RT feed that
> the corresponding static GTFS feed is out of date. This is a very
> useful and cool feature, in my opinion.
>
> Regards,
>
> Jorden
>
> --
> You received this message because you are subscribed to the Google
> Groups "General Transit Feed Spec Changes" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to gtfs-changes...@googlegroups.com
> <mailto:gtfs-changes%2Bunsu...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/gtfs-changes/aad39869-9544-47de-b5ea-e83beea8f79c%40googlegroups.com.
>
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "General Transit Feed Spec Changes" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to gtfs-changes...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/gtfs-changes/CAG9YwWBwaX5jADn9ML%2BnckkUutgh%2BMtjaSczs4-GzX_uMYoRGg%40mail.gmail.com.

Andrew Byrd

unread,
Dec 20, 2013, 6:12:25 AM12/20/13
to gtfs-c...@googlegroups.com
A lot could be done with a couple of extra fields in feed-info. We
should separately consider all these points:

concise regionally unique feed IDs
ability of feed producers to suggest such a unique ID in the feed
inclusion of RT URLs in schedule feeds
inclusion of schedule URLs in RT feeds
inclusion of version information in schedule and RT feeds
centralized registry of such IDs and URLs

-Andrew
> --
> You received this message because you are subscribed to the Google
> Groups "General Transit Feed Spec Changes" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to gtfs-changes...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/gtfs-changes/aad39869-9544-47de-b5ea-e83beea8f79c%40googlegroups.com.

Brian Ferris

unread,
Dec 20, 2013, 12:36:44 PM12/20/13
to gtfs-c...@googlegroups.com
concise regionally unique feed IDs: +1
ability of feed producers to suggest such a unique ID in the feed: +1
inclusion of RT URLs in schedule feeds: +1
inclusion of schedule URLs in RT feeds: ~0
inclusion of version information in schedule and RT feeds: +1
centralized registry of such IDs and URLs: happy to let someone else do the work ;)



Ross Peterson

unread,
Dec 20, 2013, 1:48:28 PM12/20/13
to gtfs-c...@googlegroups.com
concise regionally unique feed IDs: +1

Thinking about this from the perspective of the nascent proposals to extend GTFS for flexible transit services, it is becoming clear that the eligibility data common to many flexible transit providers (i.e. human service, paratransit, community transit providers) should be maintained in a feed separate from their service data (service area polygon, hours of operation, etc.). More on this here. Unique identifiers will greatly aid in matching these data sets.

Ross

Andrew Byrd

unread,
Jan 27, 2014, 5:31:52 PM1/27/14
to gtfs-c...@googlegroups.com
On Friday, December 20, 2013 6:36:44 PM UTC+1, Brian Ferris wrote:
concise regionally unique feed IDs: +1
ability of feed producers to suggest such a unique ID in the feed: +1
inclusion of RT URLs in schedule feeds: +1
inclusion of schedule URLs in RT feeds: ~0
inclusion of version information in schedule and RT feeds: +1
centralized registry of such IDs and URLs: happy to let someone else do the work ;)

Seeing no complaints and some +1s, we plan to implement a subset of this proposal on the producer side in the Dutch OpenOV feeds. feed_id and feed_version fields will be included in feed_info.txt, and corresponding information will be added to the realtime feeds. The realtime fields could be added under OpenOV's 1003 extension. In feeds with only one agency, the feed ID could default to the agency ID.

Consuming the information in OpenTripPlanner is going to involve proposing a patch to OneBusAway.

Including realtime URLs is more of a challenge: OpenOV has separate trip update, vehicle position, and alert feeds available via HTTP, as well as websockets streaming updates. It does not seem wise to introduce a separate feed_info field for every possible GTFS-RT data source type. Any suggestions on how we might handle this?

As for the registry of feed IDs, I could host it. It does not need to be exhaustive or perfect, just serve as a reference and prevent name collisions. 

-Andrew

Stefan de Konink

unread,
Jan 27, 2014, 5:42:55 PM1/27/14
to gtfs-c...@googlegroups.com
On Monday, January 27, 2014 11:31:52 PM CEST, Andrew Byrd wrote:
> Any suggestions on how we might handle this?

What about a repeated string field? That would either allow to specify
seperate feeds, or an all in one go feed.

> As for the registry of feed IDs, I could host it. It does not
> need to be exhaustive or perfect, just serve as a reference and
> prevent name collisions.

What about asking Pieter and get this on datahub.io?

Stefan

Andrew Byrd

unread,
Jan 28, 2014, 7:25:33 AM1/28/14
to gtfs-c...@googlegroups.com

On 01/27/2014 11:42 PM, Stefan de Konink wrote:
> On Monday, January 27, 2014 11:31:52 PM CEST, Andrew Byrd wrote:
>> Any suggestions on how we might handle this?
>
> What about a repeated string field? That would either allow to specify
> seperate feeds, or an all in one go feed.

Yes, of course we could just concatenate the URLs together with a
separator character, but then the URLs are somewhat useless because we
don't know which kind of information they point to. The protocol part of
each URL will allow us to distinguish between websockets and HTTP pull,
but a client won't know which feed contains trip updates, alerts,
vehicle positions, or some combination of those.

>> As for the registry of feed IDs, I could host it. It does not need to
>> be exhaustive or perfect, just serve as a reference and prevent name
>> collisions.
>
> What about asking Pieter and get this on datahub.io?

Right, it might be better to host this list as part of some more general
service. I would like to keep the registry as simple as possible though.
Something like a hand-maintained CSV file with columns feed_id,
feed_provider, feed_url, notes.

-Andrew

Brian Ferris

unread,
Jan 31, 2014, 4:54:58 AM1/31/14
to gtfs-c...@googlegroups.com
Maybe it's worth having a separate file?  Something like realtime_feeds.txt with fields like "url" (string) and "trip_updates" "alerts" & "vehicle_positions" (all 1 or 0 to indicate entity types present in the feed).  Might be overkill but maybe not...


--
You received this message because you are subscribed to the Google Groups "General Transit Feed Spec Changes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gtfs-changes+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gtfs-changes/52E7A1BD.3060503%40fastmail.net.

Stefan de Konink

unread,
Jan 31, 2014, 4:55:39 AM1/31/14
to gtfs-c...@googlegroups.com
On Friday, January 31, 2014 10:54:58 AM CEST, Brian Ferris wrote:
> Maybe it's worth having a separate file? Something like
> realtime_feeds.txt with fields like "url" (string) and
> "trip_updates" "alerts" & "vehicle_positions" (all 1 or 0 to
> indicate entity types present in the feed). Might be overkill
> but maybe not...

+1

Andrew Byrd

unread,
Feb 3, 2014, 5:44:18 AM2/3/14
to gtfs-c...@googlegroups.com
On 01/31/2014 10:54 AM, Brian Ferris wrote:
> Maybe it's worth having a separate file? Something like
> realtime_feeds.txt with fields like "url" (string) and "trip_updates"
> "alerts" & "vehicle_positions" (all 1 or 0 to indicate entity types
> present in the feed). Might be overkill but maybe not...

+1 this seems like the only way to include all the relevant information.
So the original proposal now becomes:

1. Unique feed identifiers
An optional field "feed_id" is added to feed_info.txt. This field
contains a globally unique identifier for this feed. Here "feed" does
not refer to a single zipped GTFS file, but rather all such files
produced by the same feed provider and are intended to cover the same
set of services. When two GTFS files have the same feed_id, one
supersedes the other. This relationship is determined by the
feed_version field and/or the validity dates also found in
feed_info.txt. In feeds with only one agency, if no feed_id is
explicitly provided the feed ID defaults to the only agency ID. In APIs
route_ids, stop_ids etc. will often need to be qualified with a feed_id.
Therefore we propose that the colon character be reserved as a separator
and be forbidden from inclusion in feed IDs. The colon is commonly used
in scoping and for hierarchical relationships in URIs.

2. Feed ID registration
These globally unique IDs are registered in a CSV file called
"feeds.txt" at a stable URL. The registry need not need be exhaustive,
but it should serve as a reference and prevent name collisions,
especially in regions where multiple GTFS feed namespaces exist. I have
registered the domain transitfeed.org for this purpose.

3. Corresponding feed_ids in realtime feeds.
The new fields feed_id and feed_version will be added under OpenOV's
1003 extension. Question: is this a good place to add them?

4. Specifying realtime feeds in schedule feeds
A new optional file "realtime_feeds.txt" is added to GTFS feeds, with
the following required fields: "url", "trip_updates", "alerts", and
"vehicle_positions". The "url" specifies the location at which the feed
is found (a string). The other three fields contain a 1 or a 0 to
indicate which entity types are present in the feed. The protocol part
of the URL is sufficient to determine whether this is an HTTP pull
(http://) or streaming differential (ws://) realtime feed.

Any comments or concerns?

-Andrew

Stefan de Konink

unread,
Feb 3, 2014, 6:31:14 AM2/3/14
to gtfs-c...@googlegroups.com
On Monday, February 3, 2014 11:44:18 AM CEST, Andrew Byrd wrote:
> On 01/31/2014 10:54 AM, Brian Ferris wrote:
> 1. Unique feed identifiers
> An optional field "feed_id" is added to feed_info.txt. This
> field contains a globally unique identifier for this feed. Here
> "feed" does not refer to a single zipped GTFS file, but rather
> all such files produced by the same feed provider and are
> intended to cover the same set of services. When two GTFS files
> have the same feed_id, one supersedes the other.

This semantic should be defined extremely well. So feed merging becomes
possible in the tools that are already available for it.

> 2. Feed ID registration
> These globally unique IDs are registered in a CSV file called
> "feeds.txt" at a stable URL. The registry need not need be
> exhaustive, but it should serve as a reference and prevent name
> collisions, especially in regions where multiple GTFS feed
> namespaces exist. I have registered the domain transitfeed.org
> for this purpose.

I would suggest going for the 5-stars approach. Get this to work as linked
open data. Hence the feed registration gives you an URL. The base URL +
/stop/ + stop_id gives you a semantic relation, similar to your feed. I
would also propose a feed version number.


> The protocol part of the
> URL is sufficient to determine whether this is an HTTP pull
> (http://) or streaming differential (ws://) realtime feed.

I am not sure because would directly limit the possible protocols. We
should first standardise http and websockets/pubsub so they are concerned
the only valuable methods. While I am in favor to do so, I also see that we
are using ZeroMQ as well.

Stefan

Andrew Byrd

unread,
Feb 3, 2014, 7:27:42 AM2/3/14
to gtfs-c...@googlegroups.com
Hi Stefan,

Comments in-line below.

On 02/03/2014 12:31 PM, Stefan de Konink wrote:
> On Monday, February 3, 2014 11:44:18 AM CEST, Andrew Byrd wrote:
>> On 01/31/2014 10:54 AM, Brian Ferris wrote:
>> 1. Unique feed identifiers
>> An optional field "feed_id" is added to feed_info.txt. This field
>> contains a globally unique identifier for this feed. Here "feed" does
>> not refer to a single zipped GTFS file, but rather all such files
>> produced by the same feed provider and are intended to cover the same
>> set of services. When two GTFS files have the same feed_id, one
>> supersedes the other.
>
> This semantic should be defined extremely well. So feed merging becomes
> possible in the tools that are already available for it.

Can you give a specific example of where the given definition would
cause problems? What kind of merging are you thinking of, multiple feeds
from the same provider?

>> 2. Feed ID registration
>> These globally unique IDs are registered in a CSV file called
>> "feeds.txt" at a stable URL. The registry need not need be exhaustive,
>> but it should serve as a reference and prevent name collisions,
>> especially in regions where multiple GTFS feed namespaces exist. I
>> have registered the domain transitfeed.org for this purpose.
>
> I would suggest going for the 5-stars approach. Get this to work as
> linked open data. Hence the feed registration gives you an URL. The base
> URL + /stop/ + stop_id gives you a semantic relation, similar to your
> feed. I would also propose a feed version number.

It turns out that feeds already have version numbers, that's the only
reason I didn't include the concept in this revision of the proposal.

I don't fully understand the rest of what you are proposing. Would you
like to see an API for fetching summary information and subsets of the
GTFS feeds served up at the same domain as the registry? I am currently
adding an API of this kind to OpenTripPlanner so this is technically
feasible, but I cannot currently offer to deploy and support it for
every feed in the world :) This seems like a separate project to me.

>> The protocol part of the URL is sufficient to determine whether this
>> is an HTTP pull (http://) or streaming differential (ws://) realtime
>> feed.
>
> I am not sure because would directly limit the possible protocols. We
> should first standardise http and websockets/pubsub so they are
> concerned the only valuable methods. While I am in favor to do so, I
> also see that we are using ZeroMQ as well.

Is ZeroMQ still in use for GTFS-RT transport? I was under the impression
we had switched to websockets specifically to avoid having to support
multiple competing methods to do the same thing.

-Andrew

Stefan de Konink

unread,
Feb 3, 2014, 7:33:49 AM2/3/14
to gtfs-c...@googlegroups.com
On Monday, February 3, 2014 1:27:42 PM CEST, Andrew Byrd wrote:
>> This semantic should be defined extremely well. So feed merging becomes
>> possible in the tools that are already available for it.
>
> Can you give a specific example of where the given definition
> would cause problems? What kind of merging are you thinking of,
> multiple feeds from the same provider?

Scenario: 2014-01-01 - 2014-12-31 is provided. A newer feed has a validity
of 2014-01-02 - 2014-01-05. Is there data for 2014-01-06 - 2014-12-31 now?

Scenario: 2014-01-01 - 2014-12-31 is provided. A new feed with the same
validity is presented for just one route. Should the route be merged or
should the feed be replaced?


> I don't fully understand the rest of what you are proposing.
> Would you like to see an API for fetching summary information
> and subsets of the GTFS feeds served up at the same domain as
> the registry?

No, I would like to have URI's, just URI's no APIs, that are unique.


> Is ZeroMQ still in use for GTFS-RT transport? I was under the
> impression we had switched to websockets specifically to avoid
> having to support multiple competing methods to do the same
> thing.

I am all in favor of formally standardizing it to http/websockets but first
do that, if it isn't done so already.

Stefan

Andrew Byrd

unread,
Feb 3, 2014, 7:57:58 AM2/3/14
to gtfs-c...@googlegroups.com
On 02/03/2014 01:33 PM, Stefan de Konink wrote:
> On Monday, February 3, 2014 1:27:42 PM CEST, Andrew Byrd wrote:
>>> This semantic should be defined extremely well. So feed merging becomes
>>> possible in the tools that are already available for it.
>>
>> Can you give a specific example of where the given definition would
>> cause problems? What kind of merging are you thinking of, multiple
>> feeds from the same provider?
>
> Scenario: 2014-01-01 - 2014-12-31 is provided. A newer feed has a
> validity of 2014-01-02 - 2014-01-05. Is there data for 2014-01-06 -
> 2014-12-31 now?
>
> Scenario: 2014-01-01 - 2014-12-31 is provided. A new feed with the same
> validity is presented for just one route. Should the route be merged or
> should the feed be replaced?

Yes, those are problematic situations. Unfortunately there does not seem
to be any obvious "principle of least surprise" interpretation. Even if
the semantics are not fully specified the proposed system will still
work for the common case: sequential versions, each one covering a
period later in time and fully replacing the previous one.

>> I don't fully understand the rest of what you are proposing. Would you
>> like to see an API for fetching summary information and subsets of the
>> GTFS feeds served up at the same domain as the registry?
>
> No, I would like to have URI's, just URI's no APIs, that are unique.

Ah, that's a good extension of this idea. I'm not sure whether it
belongs in the GTFS spec proper, but it should be straightforward. If we
reserve the colon character for a feed_id:entity_id separator, you can
make hierarchical URIs with no problem by prepending the entity type in
the same manner:

gtfs:{entity_type}:{feed_id}:{entity_id}

for example:

gtfs:feed:trimet
gtfs:stop:nl:145882
gtfs:route:trimet:78

It's not obvious to me whether the entity type or feed_id should come
first, but the order above allows truncating the URI down to the last
two levels in API calls where it's obvious that for a particular field
we want a gtfs:stop or gtfs:route URI so it feels more natural.

However, the fact that IDs can change between versions is very
problematic. Linked open data specialists, your comments are welcome!

-Andrew

Andrew Byrd

unread,
Feb 11, 2014, 9:20:01 AM2/11/14
to gtfs-c...@googlegroups.com
On 02/03/2014 11:44 AM, Andrew Byrd wrote:
> 3. Corresponding feed_ids in realtime feeds.
> The new fields feed_id and feed_version will be added under OpenOV's
> 1003 extension. Question: is this a good place to add them?

Brian, on the gtfs-realtime list you recently mentioned reserving a new
non-extension field (field 6 for direction_id). Would it be conceivable
to do the same thing for the proposed GTFS-RT feed_id and feed_version
fields? It seems a bit odd to include them in an OpenOV extension. It
would be a lot cleaner if they were simply fields in the FeedHeader. For
example,

optional string static_feed_id = 4;
optional string static_feed_version = 5;

I'm also not sure whether "static" is the most common / correct prefix
to indicate a corresponding non-realtime GTFS feed.

-Andrew

Brian Ferris

unread,
Feb 11, 2014, 9:54:04 AM2/11/14
to gtfs-c...@googlegroups.com
I think reserving the fields would be appropriate, as these would be generally useful fields.  As for what to call the fields, I also tend to think that "static" is maybe not the most appropriate prefix.  However, I'm not sure what I'd use instead.  Maybe "gtfs_feed_id" and "gtfs_feed_version"?




-Andrew

--
You received this message because you are subscribed to the Google Groups "General Transit Feed Spec Changes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gtfs-changes+unsubscribe@googlegroups.com.

Andrew Byrd

unread,
Feb 11, 2014, 10:17:49 AM2/11/14
to gtfs-c...@googlegroups.com
On 02/11/2014 03:54 PM, Brian Ferris wrote:
> I think reserving the fields would be appropriate, as these would be
> generally useful fields. As for what to call the fields, I also tend to
> think that "static" is maybe not the most appropriate prefix. However,
> I'm not sure what I'd use instead. Maybe "gtfs_feed_id" and
> "gtfs_feed_version"?

I agree that we should avoid "static", which is just part of the jargon
that developed around a few specific projects I worked on and is not
very clear.

However, I believe GTFS-RT is perceived as part of GTFS, so gtfs_feed_id
does not necessarily clarify the field's purpose and even looks a bit
redundant.

I don't see any harm in just calling the field "feed_id", but the
corresponding "feed_version" might be confusing. One might expect the
gtfs-rt feed version to change every time it's updated, rather than
every time the corresponding schedule feed is updated.

Maybe schedule_feed and schedule_feed_version? main_feed and
main_feed_version?

-Andrew

Brian Ferris

unread,
Feb 11, 2014, 10:23:52 AM2/11/14
to gtfs-c...@googlegroups.com
I'm ok with "schedule_feed_id" but not so much with "main_feed_id" or just a naked "feed_id".  Between "gtfs_feed_id" and "schedule_feed_id", I'm not sure either is an absolute slam dunk, in that I'm guessing some people will get exactly what the field means and some won't.  But that's ok and I'd ultimately be fine with either.  Anyone else have any thoughts?




-Andrew

--
You received this message because you are subscribed to the Google Groups "General Transit Feed Spec Changes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gtfs-changes+unsubscribe@googlegroups.com.

Jorden Verwer

unread,
Feb 11, 2014, 11:47:58 AM2/11/14
to gtfs-c...@googlegroups.com

What about parent_feed_id and parent_feed_version?

Jorden

Andrew Byrd

unread,
Mar 31, 2014, 10:08:15 AM3/31/14
to gtfs-c...@googlegroups.com
On Monday, February 3, 2014 12:31:14 PM UTC+1, Stefan de Konink wrote:
On Monday, February 3, 2014 11:44:18 AM CEST, Andrew Byrd wrote:
> On 01/31/2014 10:54 AM, Brian Ferris wrote:
> 1. Unique feed identifiers
> An optional field "feed_id" is added to feed_info.txt. This
> field contains a globally unique identifier for this feed. Here
> "feed" does not refer to a single zipped GTFS file, but rather
> all such files produced by the same feed provider and are
> intended to cover the same set of services. When two GTFS files
> have the same feed_id, one supersedes the other.

This semantic should be defined extremely well. So feed merging becomes
possible in the tools that are already available for it.


One helpful rule would be that feed providers who use a feed_id should make an effort to keep other IDs within their namespace (stops, routes, etc.) consistent across different versions of the feed. 

-Andrew

Andrew Byrd

unread,
Mar 31, 2014, 10:14:27 AM3/31/14
to gtfs-c...@googlegroups.com
On Monday, February 3, 2014 11:44:18 AM UTC+1, Andrew Byrd wrote:
2. Feed ID registration
These globally unique IDs are registered in a CSV file called
"feeds.txt" at a stable URL. The registry need not need be exhaustive,
but it should serve as a reference and prevent name collisions,
especially in regions where multiple GTFS feed namespaces exist. I have
registered the domain transitfeed.org for this purpose.

A minimal prototype of this is in place. OpenOV and TriMet have agreed to add feed_ids to their GTFS schedule feeds, and I have added entries for them in the CSV file at http://gtfs.org/feeds.txt

We are now taking steps toward using this information when interpreting GTFS-RT messages in OpenTripPlanner.

Any other feed producers who want to help this proposal advance, please contact me off-list so I can add you to the registry. This would be especially relevant if you are in a large metropolitan region with multiple feeds where GTFS-RT is in use or expected to appear.

-Andrew

Aaron Antrim

unread,
Apr 9, 2014, 2:52:32 PM4/9/14
to gtfs-c...@googlegroups.com
Andrew,

Can you provide guidelines on what should be used for the feed_publisher_name and feed_publisher_url?

One of the reasons I ask is because of multi-agency feeds.  Here is an example.

One feed, humboldtcounty-ca-us (http://trilliumtransit.com/transit_feeds/humboldtcounty-ca-us/gtfs.zip) has several agencies:
Redwood Transit system, Eureka Transit Service, and Arcata & Mad River Transit System.

We manage the single GTFS feed on behalf of these transit agencies.

I assume that I should list this feed once, and include Trillium Solutions, Inc. as the feed publisher?  This would mean that the agency name would not actually be included in feeds.txt.

Excerpt from feeds.txt for humboldtcounty-ca-us below.

feed_id,feed_publisher_name,info_url,feed_publisher_url,feed_url


--
You received this message because you are subscribed to the Google Groups "General Transit Feed Spec Changes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gtfs-changes...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gtfs-changes/2b448d67-91f6-442f-85b2-a34e95763e5c%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Andrew Byrd

unread,
Apr 10, 2014, 8:36:50 AM4/10/14
to gtfs-c...@googlegroups.com
On 04/09/2014 08:52 PM, Aaron Antrim wrote:
> Can you provide guidelines on what should be used for the
> feed_publisher_name and feed_publisher_url?
Hi Aaron, my intent was for these to reflect the feed publisher, not the
agencies contained in the feed.
> I assume that I should list this feed once, and include Trillium
> Solutions, Inc. as the feed publisher? This would mean that the
> agency name would not actually be included in feeds.txt.
Correct, the feed should only be listed once and all contact information
should be for Trillium. feeds.txt is not intended to contain full feed
metadata, just associate feed IDs with feeds and establish who to
contact about any problems (id clashes).

Information about which agencies are included in a feed, as well as
other information like geographic extent can be derived from the feed
itself. Ideally all such metadata will eventually be produced by
automatically downloading, verifying, and processing the feeds, then
supplied in a linked open data API.

-Andrew

Andrew Byrd

unread,
Mar 12, 2016, 12:04:37 PM3/12/16
to General Transit Feed Spec Changes
Hello gtfs-changes list,

About 2 years ago we proposed adding unique identifiers to GTFS feeds. The feed producer includes this feed ID as a new column 'feed_id' in feed_info.txt. Because all the other identifiers in a GTFS feed are feed-unique, qualifying them with this feed_id is enough to make them globally unique. This is very practical in applications or data stores that combine information from many feeds. The feed_ids are preferably short, clearly reflect the feed contents, and do not contain the colon character which allows us to unambiguously interpret scoped IDs of the form TriMet:1234 or NL:6543.

OVApi (full coverage of the Netherlands), Portland TriMet, and Trillium (producing many feeds in the US) started including feed IDs in all their feeds shortly after that proposal. Mapzen and Conveyal have decided together to consume, use, and display feed_ids in their GTFS editing and management software, and have begun doing so. We plan to collaboratively reserve IDs for any feeds that don't include one, encourage their producers to include a feed_id, and fall back on our reserved feed_ids when producers can or do not include them.

There was some talk at that time about making a single centralized feed_id registry to avoid and resolve any ID collisions. This started out as nothing more than a CSV file on Github. Designating an official central ID registry seemed like the weak link in the concept, and I think we can do without it. Now, two years later, I am aware of several projects underway that will produce, exchange, analyze, and serve up large amounts of GTFS data. I expect things will work out with a simple consensus / cooperative system as outlined above. Anyone needing to choose an ID for their new feed can simply look at any catalog / API that tracks feed_ids and make sure their new ID is not already in use. All feed catalogs should end up containing the same information about feed_ids.

Considering the long-term presence of the feed_id column in the wild and several pieces of software that use it, are we ready to add this as an optional field in the spec?

I would submit a pull request to https://github.com/google/transit but it looks like that change process is not ready for prime-time yet.

Andrew

Joris Wu

unread,
Mar 14, 2016, 6:41:26 PM3/14/16
to General Transit Feed Spec Changes
A handful of votes up for this approach. That would also promote the use of a feed_info on its own.

When integrating many. many feeds for an international journey planner, the question often arises where a feed actually comes from.

As feeds can both move around in different URL's , and are often copied or merged, there is hardly a way to detect whether two feeds come from the same source.

Joris
Reply all
Reply to author
Forward
0 new messages