How To Download Martian Movie In Hindi Dubbed

0 views

Skip to first unread message

Kristin Banyas

unread,

Aug 5, 2024, 2:44:48 PM8/5/24

to randoserti

Beforethere were neural networks and machine learnings to do all our computer talking, speech was generally synthesized by modeling the vocal cords and resonant cavities of the mouth. There's decades of research into this and probably the most well-known classic example is SAM the Software Automatic Mouth from the early 80's.

Even though I love the way SAM sounds, and just look at him, it wasn't quite what I had in mind for the martians in this game. Something less familiar would be nice for a start. Also, speech synths in this form are finely honed masterpieces of software design and who's got time for that.

Maybe the problem is that we're working with raw samples in the time domain. Let's switch to the frequency domain and try interpolating there. We want this to work on portable hardware so first break the audio clips down to a limited set of important frequencies.

I put together some numpy/scipy python code to do this part. The result is a set of frequencies and amplitudes that can be used to reconstruct the sound with sine waves. Baby's first MP3. Maybe there's something interesting we can do once everything is driven by sine waves instead of individual samples.

Sounds basically right, if a little muffled. I can hear the martians already. Now try blending between the reconstructed clips by interpolating the sine frequencies. This should hopefully be an improvement on crossfading the samples. Waveform (top) & spectrogram (bottom):

Nope! That makes a good siren but it doesn't sound like a voice. The peak frequencies can be pretty far apart for each vowel and there's a "whipping" effect when interpolating between them. I suspect the spectrum is just too sparse and it would need a lot more sine waves to avoid sounding artificial. This rabbit hole sucks. With basically no DSP or speech theory experience this was a lost cause and after a fair bit of flailing I put the whole thing down and moved on to other non-speech stuff.

Almost a year later, I came back to the speech synthesis task. Having forgotten most of the first try my plan this time was to go even simpler. Why synthesize anything at all? Just string together a bunch of audio clips and call it a day.

Well, maybe. I didn't give it much chance. This technique just wasn't grabbing me. One for being too simple and two for having so few limitations. There are infinite ways to prepare and process audio clips for sequential playback. I need constraints, the more the better.

If, like me, you've ever crossed paths with The Talking Moose on a classic Mac then you know all three requirements are handily met by a traditional speech synthesizer. The Moose is based on MacinTalk which as far as I can tell is implemented similarly to SAM.

I trashed all my previous code and researched the details of how these synths actually work. The seminal model here is the Klatt speech synthesizer. Dennis Klatt worked out a system of cascading and parallel filters applied to the fundamental waveform + noise, and all the complicated articulations necessary to sound like human voice. Back in 1980.

Brief summary. The human vocal system has cavities that amplify & resonate the vocal cords' vibrations at certain frequency bands. These bands are called formants and they change based on the shape of the mouth, tongue, lips, palette, etc. Each vowel sound has a different set of formants.

Formants are different from the peaks I was detecting in my first try in that they're independent of the fundamental pitch of the voice, and they have a bandwidth. To get the formant frequencies and bandwidths I ditched my custom python code and switched to using Praat, which is laser focused on this exact task.

Note that these are not marking sharp spikes as much as broad hilltops in the spectrum. It's somewhat surprising (to me) that these vowel formant frequencies don't vary much from person to person. From Synthesizing static vowels and dynamic sounds:

With this formant model, I wrote C code for synthesizing vowels using a sawtooth wave passed through a series of resonating filters. I'm familiar with using these kinds of filters in hardware and DAW synths. What do they look like in actual code? Basically just a weighted sum of the current sample and previous outputs. The weights are calculated from your desired filter frequency, resonance, and bandwidth. MusicDSP.org was a great resource when working on this.

Pretty much exacly like SAM and glad for it. Nothing like a desperate third attempt to slide the goalposts right up. The AH-EE-AH blend that didn't work before sounds fine now when interpolating filter frequencies:

The next step was to integrate a noise source to synthesize the 't', 'ch', 's', 'k', and other non-voiced sounds. Unlike with the vowels, these use a separate parallel filter bank. I made some good progress here before realizing that (A) this part is much trickier since it requires careful modulation to sound right and (B) the end result would be better sounding human speech, which I wasn't really after.

The final vowel synth has parameters for overall speed, input waveform, fundamental pitch (including sub-oscillator, vibrato, and randomized LFO), and formant frequencies. A set of these parameters defines the basic sound of a voice.

One trick I found to get slightly less muffled output was to fix the two highest-frequency formants at 4kHz and 6kHz. This is essentially a HF boost and adds a noticeable crispness when running at the synth's relatively low 11kHz samplerate.

An interesting discovery is that creating intelligible words is still possible with this limited vowel-only synth. Quick starts and stops are almost enough for faking a few consonants. The brain fills in the rest I guess.

Even in its simplified vowel-only form, the synth still has a fair bit of complex C code. I wrote everything initially using floating point math, as a sane person would. Unfortunately the performance requirements to run at 11kHz were too much for the hardware, mostly because of the floats. Once it was working okay I refactored the inner loops to use fixed point math.

Testing on the arm64 Playdate hardware I found that int64 arithmetic is the fastest, then int32, then float. To keep the memory cache unstressed I settled on int32 with S.15.16 fixed point format. Ideally, you'd want more bits after the decimal when dealing with mostly -1.0 -> 1.0 audio samples. In this case the range needed to also cover several multiples of the sample rate so with a unified format I couldn't slide the decimal point very far left.

Speaking of performance, one might ask how SAM and MacinTalk could run perfectly fine on 40-year-old computers. Based on Tyomitch's reverse engineering it seems that, besides lots of clever optimisations, they summed pre-baked formant waveforms instead of running the math-heavy filter code. Sounds a bit like my first attempt up there.

Once I was happy with the audio output the next step was to hook it into the game's visuals. Knowing I'd want the martians to eventually talk, I've been drawing three frames of animation for each mouth since the beginning:

In-game, the vowel synth keeps track of which formant sets are used in a word, and the game logic can query which one is currently playing. The full voiced list of aeioumr is reduced down to aom for the mouth frame selection. Add a little shake and bob's your uncle.

I spent too long working on all of this, really. Anyone paying for 'talking martians' would have a few questions. Luckily, it's just me using up my own energy here, and spending ages learning about things is 90% of why I make games.

STILL, wouldn't it be nice if the speech synth could be used for more than a few martian dialog bleeps and bloops. Can I take this minor side feature and expand it into something more important, papering over the embarassingly long dev time to make it all look intentional?

Well it's hard to turn a vowel-only speech synth into a core feature in an already-weird game like this but that'll hardly stop me from trying. And hey I got an idea while implementing the speech bubble animations.

How is a single glyph enough? I don't know, I hadn't figured that part out. It should be possible to reduce the conversation surface enough to use emotes or other solitary symbols. I just know that I didn't want to put straight text translations in the bubble and adding some ridiculous restriction here might be fruitful.

This number is an index into the BLAB-O-DEX, accessable from the Playdate's in-game system menu. Players can refer to the translations here after they've heard the phrase at least once. Unheard entries are present in the list as a blank dash.

Thanks for these updates, can't wait for final game! I just wanted to add another vote for the idea to use Galactic Alphabet for speech bubbles. It would look more.. dunno, appropriate I guess, because the numbers are looking more like placeholder. And blab-o-dex where you decipher those symbols will look more organic in this world, IMO.

This was a really fascinating read, thank you for detailing your process! I am a sound designer by trade - I edit voices every single day but I don't know where I would even start when it comes to the synthesizing the human voice. I'm no programmer but I think you ultimately ended up in a really cool spot. It's amazing how much range you get even without consonants. Well done!

I think I love you. Many many thanks for sharing this. I always struggled with sound implementation and this adds so interesting conclusions! I admire your hanger for new knowledge and, in fact, yours helped mine here.

Do you reckon replacing the numbers with wacky martian symbols might be cool? I like the idea of translating things, but seeing martians talk saying things like "2!" "3." "1" is a bit weird, if not funny in its own way

I tried this first, but found it really hard to recognize/remember [random-glyph] as "hello". Better if it's readable, even if unrelated, to make looking it up in the blab-o-dex easier. I may add a letter to separate conversations, so "H2" instead of "31". Still figuring it out.