Pilot-tone simulation

Fabrizio Carraro

unread,

Nov 28, 2021, 6:50:53 PM11/28/21

to librosa

I'm trying to simulate the pilot-tone used in the past to sync film and audio. The unstable speed camera emits a tone with a generator driven by its unstable motor, so that the generated tone changes its frequency following the speed variations. This tone is recorded on a track of a tape recorder. Another track records the actual audio from the microphone.

After the film is developed a perfect video-audio sync is achieved because the pilot tones variations from a reference frequency are used to speed up or slow down the tape recorder. In this way the audio is not perfectly in pitch but is surely in perfect sync.

I need to do a similar thing. I have an audio file with a slightly varying tone on one track and music on another track. I need to continuously vary the speed (resample at a varying bit rate) in order to have the pilot tone flat (at a constant frequency), so that the music track varies in the same way.

My idea: count the number of cycles of the pilot tone in a not-too-small time interval (200 ms). Compare it with the wanted (reference) cycles. Interpolate consequently the audio track samples in order to have a fixed number of them in the chosen (200 ms) time interval.

I'm sure it is easier to accomplish with some librosa functions. This will avoid me to write the resample algorithm myself. Any idea?

Fabrizio Carraro

unread,

Nov 28, 2021, 7:06:06 PM11/28/21

to librosa

Maybe if I explain exactly the purpose it would be easier to understand.

I'm grabbing audio from old 8mm sound movie films with magnetic strip attached on the side of the image.

I only grab the sound. The images are already grabbed using a separate video equipment, at 24 frames/sec.

To grab the sound I use an old good 8mm sound projector. It is good but the speed is not stable.

I mounted a sensor on the projector motor giving me exactly 33 cycles for each frame, so that when the projector speed is 24 fps the sensor emits (24x33) 800 Hz. But the projector speed is not precise and can vary with time. I record the sound from the film and the approximate and varying 800 Hz from the sensor so that lately I can perfectly align the sound with the 24 fps frame rate.

What is the best way to accomplish that? Interesting isn't it?

Brian McFee

unread,

Nov 29, 2021, 7:37:43 AM11/29/21

to librosa

This ought to be doable, though I don't think librosa will 100% solve it for you.

If you have the pilot tone reasonable well isolated (even if it's modulating), you could track it with the yin (or pyin) function with the frequency range bounded to 800 +- however much wobble you expect to see. This will give you a time-varying estimate of the frequency, from which you could infer the modulation in the projector speed.

Now, once you have that estimate, correcting the audio is a more delicate process, since it will involve a nonlinear time warping. Librosa's time stretching (phase vocoding) is extremely basic, and only supports uniform time scaling. For non-uniform scaling, I would recommend rubberband. There is a basic python wrapper for rubberband, and it exposes non-uniform time stretching here: https://pyrubberband.readthedocs.io/en/stable/generated/pyrubberband.pyrb.timemap_stretch.html

It sounds like a fun project though! Please report back if you get something working.

Dan Ellis

unread,

Nov 29, 2021, 8:05:35 AM11/29/21

to Brian McFee, librosa

I think Fabrizio wants to keep the pitch shift that comes from resampling, i.e., if the reference track is 10% fast (880 Hz), the audio track will also be fast, and he wants to slow it down, both to make it longer in time, and to lower the pitch.

If that's correct, rubberband is doing too much (modifying the timing without changing the pitch, which is more complicated). I think what's required is time-varying resampling (driven by the pitch track, e.g. from yin as you suggest, or perhaps more precisely as the relevant instantaneous frequency bin from librosa.reassigned_spectrogram).

I once did a project like this as a way to filter out pitch-varying speech by flattening its pitch then using a fixed comb filter. I used interp4, a 4-point lagrange interpolator copied from PureData, to quickly make a new time-domain waveform by resampling the original waveform at new sample points calculated by cumulative sum of the extracted pitch. The waveform 'dm' at this line is pitch-flattened:

https://github.com/dpwe/pitchfilter/blob/master/pitchfilter.m#L42

Unfortunately this project pre-dates librosa, so the code is in Matlab. Translating to Python/numpy should be relatively mechanical, but still nontrivial work.

DAn.

--
You received this message because you are subscribed to the Google Groups "librosa" group.
To unsubscribe from this group and stop receiving emails from it, send an email to librosa+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/librosa/a5d9e092-df74-449e-b36b-bc8408003946n%40googlegroups.com.

Fabrizio Carraro

unread,

Nov 29, 2021, 11:53:15 AM11/29/21

to librosa

Yes, the pitch (and duration) has to change following the varying pilot tone.

The project you did, Dan, seems more difficult to solve than mine, since I have the pilot tone that simplify everything.

I think the steps may be the following:

1) clean the pilot tone in order to have a nice sequence of pulses. (I'll loose the phase shift but 800 Hz is high enough)

2) The variations will be very slow, so it should be enough to count how many pulses there are in consecutive blocks of one second each. (I'm sure that if the variation is around 0.1% for adjacent blocks the final result will be ok.)

3) Use the ratio between the reference (800 Hz) and the actual number of pulses (for example 880)

to resample one second of audio. That means, if I sample at 48KHz then the original 48000 samples should become 48000 * (880/800) = 52800.

4) append all the blocks of interpolated samples. Listen to the result at 48Khz. Enjoy it.

Step 1 can be easily done using some threshold and gain.

Step 2 can be done just counting. With 800 pulses/sec there will be an uncertainty of a little bit more that 0.1% but it should be ok. Here I don't know if there is some ready-made function, but just counting will be easy and safe (no fft for frequency analisis or similar)

Step 3 can be done with some resampling/interpolation function (not time warping). Here I need you advice.

Step 4 is straightforward.

What do you think?

Fabrizio Carraro

unread,

Dec 11, 2021, 11:02:18 AM12/11/21

to librosa

Ok, I got very good results! Initially I tried to get the pilot signal from a magnetic head but the strong 50Hz from the motor modulated the signal causing too much jitter. So I used an optical sensor detecting the opening and closing of the projector rotating shutter, obtaining 72Hz when the projector runs at 24 frames/sec. But the problem is that is runs a bit slower and varies a bit with time.

So from the 72Hz rectangular wave (that after audio recording looks more like a sawtooth) I put some threshold and counted three teeth to obtain 24Hz. At 24 fps it should be 2000 samples. I count the actual number of samples, take the audio and resample it (I used resampy, thank you Dan). For example, if between three pulses I have 2100 samples I tell resampy to interpolate them into 2000. Resampy is fast and works well, but for some reason in rare cases it returns one sample less (1999 instead of 2000) so I had to cope with this exception not to have the numpy arrays complain.

Even if not using continuous interpolation the results are quite good, at least for the quality of audio coming from a 8mm old film. I think they are good because the resampling rate between adjacent blocks varies just a little. But listen how well it follows extremely big speed variations. The source does not come from an actual 8mm film, I generated both audio and pilot signal simulating the same way it would be from a very very bad projector. At the end you'll hear errors since the time window for searching the raising edge of the pilot tone was exceeded, but most of it is well reconstructed.
Thanks to you all!
Fabrizio