The issue of multiple, changing GTFS file sets over time is a challenging one.
At Metro Transit (Minneapolis-St. Paul), we publish a new file set every week that contains the next seven weeks of schedules. So a month of GTFS-ride detail data would need to be matched to 4 or 5 different GTFS file sets (but only to a portion of each of those file sets). That gets both large and complex pretty quickly.
Another approach would be to think about the original, public information GTFS file sets as something completely different from the historical, data matching GTFS file sets. In other words, we don’t HAVE to match GTFS-ride to the public GTFS files. We can instead, after the fact, create a special GTFS file set just for matching the GTFS-ride data. We would carefully craft that GTFS file set to contain all schedules needed by the GTFS-ride data, but not a lot more than that.
On the plus side, this will make the use of the GTFS-ride data easier. I would presume that most of the problems that Aaron identifies could be avoided.
On the negative side, this will make the generation of the complete GTFS-ride + GTFS schedule data even harder. Are there existing tools that consume multipe GTFS file sets and output them as a single combined set? That would be a big help.
Has anyone approached the GTFS data need in this way?
John Levin