"Though compared with the sequence-to-sequence or neural transducer architecture, the hybrid approach is admittedly less appealing
as it is not end-to-end trained, it is still the best performing system
for authors’ practical problems. It also has the advantage that it can
be easily integrated with other knowledge sources (e.g., personalized
lexicon) that may not be available during training. I"
I think it's funny that the authors feel the need to "apologize" that their system is not end-to-end trained. The reasons they indicate for its use (better performance, seamless integration of new knowledge sources) are a big plus over end-to-end systems