This is hard to answer without a bit more detail regarding your specific setup. Is the problem to detect a delay between the original and transcoded audio signals? Or do you also need to detect the delay between the video and audio as well?
If the first part, you could definitely do that with librosa in a number of ways. DTW is probably overkill, since I wouldn't expect the delay to change significantly aside from a global offset in time. (Eg, starts 0.5 seconds later, but afterwards looks more or less identical.) The simplest thing to do here is measure the cross-correlation between signals and find the sample offset corresponding to the peak.
If you need to synchronize against video, that's a whole other can of worms, and I'm not sure librosa can do much for you here.