Hey Gene,
So to clarify, would the most useful form of response contain both the overall timing of the phrase, as well as the timing of individual words? This seems like the most useful for most applications. The level of detail could be controlled by a request parameter perhaps, or it could be a different API method.
As for the specific details of how the output is generated, as it mentions
in the docs it's using deep learning neural networks, so it's not really possible to give a clean formal analysis of how it transforms input to output.
Regards,