Major changes to Transxchange2GoogleTransit

135 views
Skip to first unread message

Daniel Thomas

unread,
Jun 15, 2012, 12:25:39 PM6/15/12
to googletran...@googlegroups.com
I have made some further significant changes to Transxchange2GoogleTransit in order to get it to parse the TNDS correctly. Among other things it now runs significantly more quickly - it takes significantly less time to produce the feed than to run the validator on the feed whereas before it took orders of magnitude more time.
I have also moved quite a lot of code towards the standard java conventions on various things.
Since there are quite a lot of changes and so filing an issue for each one and dealing correctly with dependencies between them would be quite a lot of overhead I have pushed all this to github: https://github.com/drt24/googletransitdatafeed and will keep that up to date with any further changes I make. Of course it is all Apache 2 and I would love it if this could all get upstream.

I have particular concerns about the following changes I have made in that I am not sure that they are correct:
this fixes a bug where null dates were given when a service ran on all bank holidays but I am not convinced  that the solution does not result in a more subtle bug as I am fairly confident that the validation tests don't cover that.

If you have any questions or suggestions regarding the code I would love to hear them.

Daniel

JP

unread,
Jun 16, 2012, 1:52:40 PM6/16/12
to GoogleTransitDataFeed
Daniel,
Thank you for sharing your code. I agree, pushing out the whole
shebang helps with understanding the changes better
I've built your fork of the converter from the sources you posted and
let it run against my validation data sets and a couple of weeks old
published data from TfL. Here are my observations, findings and a few
recommendations:

(1) Your fork flows the content of TXC/<LineName>s into the GTFS/
route_long_name field of routes.txt. The original converter populates
route_short_name instead. I've posted an enhancement request in the
project's issues database to add a switch to the converter
configuration file that controls where the content of TXC/<LineName>
goes.

(2) Where stops do not possess stop coordinates, you've removed the
OpenRequired entries and now leave the stop coordinate fields empty.
OpenRequired was introduced in an effort to help pinpoint deficiencies
of the GTFS output. I believe either way is fine, as the GTFS feed
validator will not pass stops without coordinates in any case (at
least that's been the way in the past)

(3) The format of the NaPTAN stops.csv file that you've used seems to
be different from what I pull once a month and what other users have
(to my knowledge). Version 2, as issue 334 suggests. The difference is
in the stop column names, "AtcoCode" instead of "ATCOCode", or
"Latitude" instead of "Lat", for example. If you do not need the
NaPTAN stop helper function that creates stop names based on the
"usable stop name rules", I recommend you use the NaPTAN Stop Column
Pick feature that can be set up in the converter's configuration file.
This way, you can account for the differences. For you reference, the
original converter assumes the following stop file headline:
"ATCOCode","GridType","Easting","Northing","Lon","Lat","CommonName","Identifier","Direction","Street","Landmark","NatGazID","NatGazLocality","ParentLocality","GrandParentLocality","Town","Suburb","StopType","BusStopType","BusRegistrationStatus","RecordStatus","Notes","LocalityCentre","SMSNumber","LastChanged"
As is, your fork of the converter will not run when the above format
is presented to it, and may folks are still using it.
I will however look into adding a switch in the converter
configuration file to be able to accept the v2 format as well. This
way, v2 can be covered and we have backwards compatibility

(4) You've forked version 1.5 of the original converter. I've released
version 1.6 since. It includes a few changes to the handling of
calendar dates; I do not know whether that might yield improvements on
your end

(5) I let lose the converter fork on the TfL published timetable data,
two weeks old circa. The script, which spends most of the time
executing the converter, ran for 3h11min on my box, which is a 2008
dual Core Mac, 16bit Java mode. By comparison, the original converter
took 3h20min for the same input data set and that was while I was
using the box for some other stuff. This means I do not see the major
performance improvements that you suggest. I suspect you may have a
problem where the converter falls into an exception handler unnoticed,
suggesting much better performance but in reality skipping large junks
of data. This is pure speculation on my part though.

Conclusion:
On a functional level, your fork boils down to changes following
findings (1) and (3) above. I might uncover a few more things as I dig
into it. As in the past, isolated change requests will be introduced
into the main line of the converter. I think I'm not going to go for
the repackaging of the Java classes of the converter however. Your
take on it is nicer than the current structure, but does not yield
significant functional improvements. With that, don't get me wrong, it
is great to see others take a stab at the converter code and get their
hands dirty. Great effort, which I appreciate much in that it helps
polishing the converter.

JP


On Jun 15, 9:25 am, Daniel Thomas <dr...@srcf.net> wrote:
> I have made some further significant changes to Transxchange2GoogleTransit
> in order to get it to parse the TNDS correctly. Among other things it now
> runs significantly more quickly - it takes significantly less time to
> produce the feed than to run the validator on the feed whereas before it
> took orders of magnitude more time.
> I have also moved quite a lot of code towards the standard java conventions
> on various things.
> Since there are quite a lot of changes and so filing an issue for each one
> and dealing correctly with dependencies between them would be quite a lot
> of overhead I have pushed all this to github:https://github.com/drt24/googletransitdatafeedand will keep that up to
> date with any further changes I make. Of course it is all Apache 2 and I
> would love it if this could all get upstream.
>
> I have particular concerns about the following changes I have made in that
> I am not sure that they are correct:https://github.com/drt24/googletransitdatafeed/commit/c5415065bc8c65f...

Daniel Thomas

unread,
Jun 18, 2012, 9:15:14 AM6/18/12
to googletran...@googlegroups.com
On Sat, 2012-06-16 at 10:52 -0700, JP wrote:
> Daniel,
> Thank you for sharing your code. I agree, pushing out the whole
> shebang helps with understanding the changes better
> I've built your fork of the converter from the sources you posted and
> let it run against my validation data sets and a couple of weeks old
> published data from TfL. Here are my observations, findings and a few
> recommendations:
>
> (1) Your fork flows the content of TXC/<LineName>s into the GTFS/
> route_long_name field of routes.txt. The original converter populates
> route_short_name instead. I've posted an enhancement request in the
> project's issues database to add a switch to the converter
> configuration file that controls where the content of TXC/<LineName>
> goes.

I have pushed the changes for making this configurable modulo
configuration file parsing and updated the issue to say that.

> (2) Where stops do not possess stop coordinates, you've removed the
> OpenRequired entries and now leave the stop coordinate fields empty.
> OpenRequired was introduced in an effort to help pinpoint deficiencies
> of the GTFS output. I believe either way is fine, as the GTFS feed
> validator will not pass stops without coordinates in any case (at
> least that's been the way in the past)

I think it is better to not provide data when valid data cannot be
provided rather than to provide data that a naive validator might accept
and which something which could ignore the absence of good data would
try to parse and fail.

> (3) The format of the NaPTAN stops.csv file that you've used seems to
> be different from what I pull once a month and what other users have
> (to my knowledge). Version 2, as issue 334 suggests. The difference is
> in the stop column names, "AtcoCode" instead of "ATCOCode", or
> "Latitude" instead of "Lat", for example. If you do not need the
> NaPTAN stop helper function that creates stop names based on the
> "usable stop name rules", I recommend you use the NaPTAN Stop Column
> Pick feature that can be set up in the converter's configuration file.
> This way, you can account for the differences. For you reference, the
> original converter assumes the following stop file headline:
> "ATCOCode","GridType","Easting","Northing","Lon","Lat","CommonName","Identifier","Direction","Street","Landmark","NatGazID","NatGazLocality","ParentLocality","GrandParentLocality","Town","Suburb","StopType","BusStopType","BusRegistrationStatus","RecordStatus","Notes","LocalityCentre","SMSNumber","LastChanged"
> As is, your fork of the converter will not run when the above format
> is presented to it, and may folks are still using it.
> I will however look into adding a switch in the converter
> configuration file to be able to accept the v2 format as well. This
> way, v2 can be covered and we have backwards compatibility

Version 1 of NaPTAN is not as readily available as version 2, I think it
has been deprecated. It is for example not listed on
http://data.gov.uk/dataset/naptan
I switched the column names as per the information on converting from v1
to v2 in the v2.4 naptan schema. Since there is no overlap between the
two specs perhaps a better solution than another configuration option
would be to do column.equals("AtcoCode") || column.equals("ATCOCode")
etc. as that way both formats will automatically work?

> (4) You've forked version 1.5 of the original converter. I've released
> version 1.6 since. It includes a few changes to the handling of
> calendar dates; I do not know whether that might yield improvements on
> your end

I forked the most recent version in SVN which should be the most recent
version you have been working on? It includes comments about 1.6 and 1.7
and claims a version number of 1.7.5.

> (5) I let lose the converter fork on the TfL published timetable data,
> two weeks old circa. The script, which spends most of the time
> executing the converter, ran for 3h11min on my box, which is a 2008
> dual Core Mac, 16bit Java mode. By comparison, the original converter
> took 3h20min for the same input data set and that was while I was
> using the box for some other stuff. This means I do not see the major
> performance improvements that you suggest. I suspect you may have a
> problem where the converter falls into an exception handler unnoticed,
> suggesting much better performance but in reality skipping large junks
> of data. This is pure speculation on my part though.

When I run using a modified londonconvert.sh I don't get any significant
difference in time taken between the first patch that causes it to start
working and the most recent commit. However when I profiled the code
using visualVM when running on the TNDS (SW region) then the vast
majority of the time taken was spent in the consolidateStops routine,
perhaps it just has far more duplicated stops in it when processed than
the london data does.
Timing the difference between the code before and after the changes to
the consolidation methods:
Old time: 375392 ms
New time: 81350 ms
It appears I exaggerated slightly, only one order of magnitude (and I
haven't rerun these tests enough times to know the variance).

> Conclusion:
> On a functional level, your fork boils down to changes following
> findings (1) and (3) above. I might uncover a few more things as I dig
> into it.
I also removed invalid use of Easting and Northing as latitude and
longitude (you had already partially removed this). Added the
Configuration class which cleans up the API for using
Transxchange2GoogleTransit considerably. There are various other small
fixes to make it validate correctly such as fixes to the validation
reference to make it valid CSV. Admittedly most of the changes are
non-functional ones to cleanup the code in one way or another. I teach
java to undergrads and so can get a little picky about code style.

> As in the past, isolated change requests will be introduced
> into the main line of the converter. I think I'm not going to go for
> the repackaging of the Java classes of the converter however. Your
> take on it is nicer than the current structure, but does not yield
> significant functional improvements.
So what structure would you prefer? Transxchange2GoogleTransit.java
can't really remain in the default package as that violates the java
best practices fairly thoroughly. The new classes need to go somewhere
and they shouldn't really go in transxchange2GoogleTransithandler as
they are not handlers.

> With that, don't get me wrong, it
> is great to see others take a stab at the converter code and get their
> hands dirty. Great effort, which I appreciate much in that it helps
> polishing the converter.

Good :-) I think I have made most of the changes I need to to make
things work for TNDS conversion (the feedvalidator doesn't give me any
errors when I convert the SW region now). Still 29424 warnings though
12236 of which are due to a potential bug in the calendar_dates.txt
output where the same service_id and date gets generated with an
exception_type of 1 and 2 but I haven't sufficiently diagnosed whether
that is a bug or a problem with the TNDS.
signature.asc

Daniel Thomas

unread,
Jun 21, 2012, 8:46:54 AM6/21/12
to googletran...@googlegroups.com
On Mon, 2012-06-18 at 14:15 +0100, Daniel Thomas wrote:
> I think I have made most of the changes I need to to make
> things work for TNDS conversion

It turns out that wasn't quite the case.
To get conversion to actually work on the whole TNDS data set it was
necessary to fix a few more bugs, some of which required major rewrites
to particular files.
In particular the main focus was on getting memory consumption down. It
now runs happily with a max memory of 1GB and is done in 78 minutes
while before it would have easily eaten 10GB. This was mainly due to a
bug in calendar dates where it iterated from the start date of a service
to its end date creating calendar date entries, some services had start
dates of 1900 and end dates of 9999 and so this would easily end up
consuming several GB. I restricted this down to 1 year before now and
five years after now. I haven't got around to making that configurable
yet.
https://github.com/drt24/googletransitdatafeed/commit/7d0362bc64cef5507fbbc968494956c4845c987f#L3R512 has that particular part of the changes.

I also had to do some fairly major work to TransxchangeStops:
https://github.com/drt24/googletransitdatafeed/commit/fc129a918e372911d0e4f8254e46d1b175f3d8c7 to deal with the fact that some of the XML files didn't have all the expected elements and so there were more stop ids than stop names.
Similarly
https://github.com/drt24/googletransitdatafeed/commit/dd83d3ba25810b4304563213cf4e4bcb4c4d5e17 was required to deal with the lack of a name for agencies.

There is definitely some sort of bug in the way that CalendarDates works
in that it can end up generating very many does not run on this day
exceptions rather than saying runs on these two days only. Additionally
a we probably want to do more validation on start end end dates, a start
date of 00010101 is a bit silly, buses aren't going to be invented for
nearly 2000 years at that point.


Daniel

charl...@gmail.com

unread,
Nov 26, 2013, 12:15:28 PM11/26/13
to googletran...@googlegroups.com
Daniel,

I had a thought for an App that could hook into the GO transit App but have traction worldwide. The App would be programmable to alert the user to their stop coming up. Vibration, Alarm, Buzz, the user sets it so that they don't sleep through their stop ! It might apply eventually to subways but reception is a problem, unless the subway stops themselves can act as a trip for the alarm. I'd call it FULL STOP.
Toronto, London, Tokoyo, New York !!
Remember me if you make a million !

Charlie B

Daniel Thomas

unread,
Nov 27, 2013, 4:13:24 AM11/27/13
to googletran...@googlegroups.com, charl...@gmail.com
Hello Charlie B,

There is nothing new under the sun. :-)

Best,

Daniel
Reply all
Reply to author
Forward
0 new messages