GTFSVTOR - an open-source GTFS validator

192 views
Skip to first unread message

Laurent GRÉGOIRE

unread,
May 29, 2020, 9:24:44 AM5/29/20
to gtfs-c...@googlegroups.com
Hi all,

We just published on github a beta release of a new open-source (GPL)
GTFS validator (GTFSVTOR). The source code and the bundled release are
available here:
https://github.com/mecatran/gtfsvtor

The goal of this project is to provide a faster implementation of the
"standard" current python validator, being at the same time (when
possible) backward compatible regarding validation rules. It is
written in Java.

Feel free to use, share and give feedback. Hope this can help!

--
Laurent Grégoire

Stefan de Konink

unread,
May 29, 2020, 9:32:05 AM5/29/20
to Laurent GRÉGOIRE, gtfs-c...@googlegroups.com
Hi Laurent,

Amesome initiative! I already see some numbers for our Dutch NL feed.

Would be able to elaborate here on how you are able to achieve those
performance numbers with per (fat)object stoptimes, without any use of any
basic types?

<https://github.com/mecatran/gtfsvtor/blob/master/src/main/java/com/mecatran/gtfsvtor/model/impl/SimpleGtfsStopTime.java>

Could you also mention how many memory the system is using for the
different examples feeds?

--
Stefan

Holger Bruch

unread,
May 29, 2020, 11:50:21 AM5/29/20
to GTFS Changes
Hi Laurent,

Impressive! Thank you very much for the initiative. I applied it to the current Germany feed (Download, size of 278MB) and with 6GB memory I got the following results on my machine:

Routes|Stops |Trips  |Times |Shapes|Feedvalidator|GTFSVTOR
20359 |560937|1945028|37700K| 7966K|?            |1m27s



The first run slowed down when memory usage came close to the 4GB. A short note in the README and example how to supply GTFSVTOR_OPTS would help others. 

Transfers.txt checks complained about duplicated from/to stop_ids while they are distinct when from/to_route_id is taken into account.
I expected some further errors to be reported, which were not, and will take a look into the code/TODOs.

Do you plan an extension mechanism? I have some dataset specific checks I currently run via SQL statements after importing with rtfs-sql-importer, like e.g. stop_id matches IFOPT-pattern, coords in expected bounding box and the like. 

Thanks again, Laurent, for the great tool!

Another question (perhaps rather a new post): How do you report issues you found? For German feeds, I started a GitHub project (https://github.com/mfdz/GTFS-Issues/issues, only in German, sorry) just to document errors (validation errors as well as content issues). Slowly it's getting some more attention from publishing agencies.

Best regards,
Holger
 

Laurent GRÉGOIRE

unread,
May 29, 2020, 2:34:40 PM5/29/20
to gtfs-c...@googlegroups.com
Hi Holger,

> Routes|Stops |Trips |Times |Shapes|Feedvalidator|GTFSVTOR
> 20359 |560937|1945028|37700K| 7966K|? |1m27s

Thanks for reporting this! Maybe we can add this to the current
performances stat list?

> The first run slowed down when memory usage came close to the 4GB. A short note in the README and example how to supply GTFSVTOR_OPTS would help others.

Good idea, will do that.

> Transfers.txt checks complained about duplicated from/to stop_ids while they are distinct when from/to_route_id is taken into account.
> I expected some further errors to be reported, which were not, and will take a look into the code/TODOs.

Thanks for reporting. The aim is to be backward-compatible with the
python validator when possible. Do not hesitate to create github
issues regarding each discrepancy, bugs or feature requests. I will
create some to kick-start the process ;)

> Do you plan an extension mechanism? I have some dataset specific checks I currently run via SQL statements after importing with rtfs-sql-importer, like e.g. stop_id matches IFOPT-pattern, coords in expected bounding box and the like.

Yes and no, depending on the level of specificity of each validation.
Adding specific extended validators is indeed possible: there is the
possibility to create a validator disabled by default, with the
ability to enable it via a config file if needed. Bounding box
validation would be a perfect fit for this; also IFOPT-pattern
matching. Although if the validator is really specific to some
GTFS/agency/region, it would be better to create an external library
(jar file). Having said that, the code is still beta, so the validator
API is still in flux.

> Another question (perhaps rather a new post): How do you report issues you found? For German feeds, I started a GitHub project (https://github.com/mfdz/GTFS-Issues/issues, only in German, sorry) just to document errors (validation errors as well as content issues). Slowly it's getting some more attention from publishing agencies.

I'm not sure I understand well. You mean an automated way of creating
issues from the validator? Ideally we should be able to output an
issue list as a parseable format, such as Json; but I prefer to wait
for the code to be a bit more stable before adding this.

HTH,

--Laurent

Laurent GRÉGOIRE

unread,
Jun 1, 2020, 9:57:06 AM6/1/20
to gtfs-c...@googlegroups.com
Hi Stefan,

On Fri, 29 May 2020 at 15:32, Stefan de Konink <ste...@konink.de> wrote:
> Would be able to elaborate here on how you are able to achieve those
> performance numbers with per (fat)object stoptimes, without any use of any
> basic types?
> <https://github.com/mecatran/gtfsvtor/blob/master/src/main/java/com/mecatran/gtfsvtor/model/impl/SimpleGtfsStopTime.java>

The default implementation used are the "flyweight" versions:
https://github.com/mecatran/gtfsvtor/blob/master/src/main/java/com/mecatran/gtfsvtor/model/impl/SmallGtfsStopTime.java
https://github.com/mecatran/gtfsvtor/blob/master/src/main/java/com/mecatran/gtfsvtor/model/impl/SmallGtfsShapePoint.java

> Could you also mention how many memory the system is using for the
> different examples feeds?

I do not have precise numbers at hand, but the whole NL feed validates
fine on a 8Gb RAM machine, using in-memory db.

--Laurent

Stefan de Konink

unread,
Jun 1, 2020, 11:26:43 AM6/1/20
to gtfs-c...@googlegroups.com
On Monday, June 1, 2020 3:56:52 PM CEST, Laurent GRÉGOIRE wrote:
> The default implementation used are the "flyweight" versions:

Thanks, still surprising big ;) But cool that this is able to do the trick.
I'll certainly want to test this code as drop in replacement.

>> Could you also mention how many memory the system is using for the
>> different examples feeds?
>
> I do not have precise numbers at hand, but the whole NL feed validates
> fine on a 8Gb RAM machine, using in-memory db.

8GB is affordable, almost a commodity with the latest Pi's, but I still
consider this rather big. Especially since I know how much we can compress
this.

The in memory option, that is only the multimap approach. Nothing external?
<https://github.com/mecatran/gtfsvtor/blob/master/src/main/java/com/mecatran/gtfsvtor/dao/impl/InMemoryDao.java#L46>

--
Stefan

Laurent GRÉGOIRE

unread,
Jun 1, 2020, 12:13:56 PM6/1/20
to gtfs-c...@googlegroups.com
Hi Stefan,

On Fri, 29 May 2020 at 22:35, Stefan de Konink <ste...@konink.de> wrote:
> Was there a reason you took this approach, instead of the R5 stuff?
> Personally I am kind of puzzled if the way we are doing this kind of stuff
> in Java with objects makes sense at all. I noticed the CSV parser is kind
> of different as well. I think an object facade that works in front of a
> column store could be much more effective. Hence every attribute of the
> attribute is stored in an array of a simple type, the intermediate object
> is aware of datastore and has an index. SQL - especially for wrong data -
> might not make sense, but then every CSV value should be a string anyway.

Currently a static table approach would not save enough memory-wise
compared to the current full object creation. The biggest drawback is
about the speed penalty of resurrecting the objects in full, as many
validators do a full scan in turn (block ID overlap, too fast travel,
first/last stop time check, duplicated trip detection...).

Instead of this, I would prefer to refactor the code, by: first)
disabling the trip stop times list random access capability of the
DAO, and second) introducing a third kind of validator
("TripTimesStreamingValidator") which would accept trip + stop times
in streaming mode, reducing the number of time each trip stop times
list is built and accessed. This would allow:
- using internal memory, an aggressive stop time list packing, using
variable-length-encoding time deltas, interning common patterns for
deltas, stop ID list, sequences and shape dist traveled... I think
this approach can reduce memory usage a lot (I think we can achieve an
order of magnitude compared to the current situation).
- using external storage, if needed one day, by allowing a much more
optimized use of it, by replacing random access by a sequential
streaming.

I postponed the implementation of this, because: first) this has some
impact, especially dropping stop time list random access capabilities,
and second) this have a non negligible speed penalty when stop times
are not grouped by trip ID (but to be honest, a large majority of GTFS
do have stop times grouped by trip IDs, I do not even remember having
seen one having not).

HTH,

--Laurent
Reply all
Reply to author
Forward
0 new messages