As far as I understood [1], the choice of whether to use zero-padding only to implement the transposed convolution actually depends on which kind of transposed convolution you require: you described it for transposed convolution on an unpadded input, and unit stride. UpSample2D and ZeroPadding2D can be used jointly depending on the particular case to produce the input map that then gets convolved in the next step. So instead of introducing zeros between the original units (pixels), UpSample2D just replicates them, i.e. does nearest neighbor upsampling. The padding can then be handled using ZeroPadding2D. Finally, the convolution filters can be learned on that input.
The performance penalty they are referring to in [1] results from extra computations required by enlarging the input with zeros, which can be implemented more efficiently. However, using UpSample2D does not produce these zero patterns, so the output will most likely differ from the 'performant' way of doing it, and the equivalence from [1] do not hold anymore. I guess, TensorFlow implements it more efficiently, but I haven't looked at the code yet.