Hi Stefan,
On Fri, 29 May 2020 at 22:35, Stefan de Konink <
ste...@konink.de> wrote:
> Was there a reason you took this approach, instead of the R5 stuff?
> Personally I am kind of puzzled if the way we are doing this kind of stuff
> in Java with objects makes sense at all. I noticed the CSV parser is kind
> of different as well. I think an object facade that works in front of a
> column store could be much more effective. Hence every attribute of the
> attribute is stored in an array of a simple type, the intermediate object
> is aware of datastore and has an index. SQL - especially for wrong data -
> might not make sense, but then every CSV value should be a string anyway.
Currently a static table approach would not save enough memory-wise
compared to the current full object creation. The biggest drawback is
about the speed penalty of resurrecting the objects in full, as many
validators do a full scan in turn (block ID overlap, too fast travel,
first/last stop time check, duplicated trip detection...).
Instead of this, I would prefer to refactor the code, by: first)
disabling the trip stop times list random access capability of the
DAO, and second) introducing a third kind of validator
("TripTimesStreamingValidator") which would accept trip + stop times
in streaming mode, reducing the number of time each trip stop times
list is built and accessed. This would allow:
- using internal memory, an aggressive stop time list packing, using
variable-length-encoding time deltas, interning common patterns for
deltas, stop ID list, sequences and shape dist traveled... I think
this approach can reduce memory usage a lot (I think we can achieve an
order of magnitude compared to the current situation).
- using external storage, if needed one day, by allowing a much more
optimized use of it, by replacing random access by a sequential
streaming.
I postponed the implementation of this, because: first) this has some
impact, especially dropping stop time list random access capabilities,
and second) this have a non negligible speed penalty when stop times
are not grouped by trip ID (but to be honest, a large majority of GTFS
do have stop times grouped by trip IDs, I do not even remember having
seen one having not).
HTH,
--Laurent