The point of the parallel data processing is that it utilizes CPU to construct next batch of data (query the dataset, aggregate examples, apply transformations). At the same time the computation (the gradient step) is run on GPU. It means that you will benefit only in a case when data processing takes sufficient amount of time. In most cases, the seq-to-seq models are very slow, while data aggregation is simple and fast.
I can suggest to use the Timing
extension to measure what percentage of time is spent on data processing and training itself.