I ran into the same issue as you, using params_set=transformer_librispeech_v2
The issue is that when running on tpu, batch_size=16 means 1 batch = 16 utterances; While by default it use dynamic batching and I think it only use 1 example per step.
In order to have a real batchsize=16, you need to add use_fixed_batch_size=True to force it to use 16 samples.
Some numbers from my experiments - GPU 1080Ti:
without use_fixed_batch_size=True: steps/sec = 10+, accuracy (1e6 steps) = 65%, GPU utilization ~ 50%
with use_fixed_batch_size=True: steps/sec = 1.5, accuracy (170K steps so far) = 89%, GPU utilization ~ 90%
Hope this help.