Hi Michael,
At present the UV just plays the video. However, the IIIF community are very interested in extending the model to support Audio, Video and 3D, and as the specification evolves the UV will begin to implement features like this in an interoperable way. The current UV implementation is an interim simple model until there is an agreed standard.
Last week there was a workshop in London to lay the grounds for IIIF to extend beyond images, and a report from this will be published very soon. That details a large number of use cases including this one. I'll post the details here as soon as the first report has been assembled.
With an agreed data model, UV can implement feature requests like this and ensure interoperability with others publishing manifests using the same model.
As an interim solution, you could modify the UV's use of mediaElement to render the WebVTT contents. In your manifest, you need a way of stating that the webVTT file is available using the current UV interim model. You could annotate the video with the webVTT file in the same way that this manifest has an annotation on the video that is a PDF transcript:
The "format" would be text/vtt instead of application/pdf
Tom