Streaming generated TTS wav file

Iman Saleh

unread,

Jul 8, 2010, 10:13:14 AM7/8/10

to uni...@googlegroups.com

Hi,

I managed to integrate my TTS engine with Unimrcp. The TTS engine generates a .wav file with sampling rate of 16,000 and resolution of 16 bits. I want to know how can I stream this wav file in synthesis channel instead of generating it directly on hard disk.

Thanks.

Arsen Chaloyan

unread,

Jul 8, 2010, 12:49:32 PM7/8/10

to uni...@googlegroups.com

Hi,

Your engine should have the corresponding streaming API to get
synthesized speech instead of writing it to a wave file. If so, you
should use that API to get/fill media frames from the MPF's read_frame
callback. See demo_synth_stream_read or flite_synth_stream_read
functions available in the source tree.

> --
> You received this message because you are subscribed to the Google Groups
> "UniMRCP" group.
> To post to this group, send email to uni...@googlegroups.com.
> To unsubscribe from this group, send email to
> unimrcp+u...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/unimrcp?hl=en.
>

--
Arsen Chaloyan
The author of UniMRCP
http://www.unimrcp.org

Iman Saleh

unread,

Jul 19, 2010, 10:23:50 AM7/19/10

to uni...@googlegroups.com

Hi,

I checked demo_synth_stream_read method. What I see is that it reads frames from the audio file and stores them in a buffer. It is still not clear for me what will the case be on my side. Should the TTS engine add frames of the generated speech file to frame->codec_frame.buffer? And what should I do with these frames? Should they be written to the generated pcm file (as in the demo)? And which function is responsible for that?

Thanks,

Iman Saleh
R&D Software Developer
RDI, www.rdi-eg.com

Arsen Chaloyan

unread,

Jul 20, 2010, 3:55:59 AM7/20/10

to uni...@googlegroups.com

Hi Iman,

On Mon, Jul 19, 2010 at 7:23 PM, Iman Saleh <iman.sa...@gmail.com> wrote:
> Hi,
>

> I checked demo_synth_stream_read method. What I see is that it reads frames
> from the audio file and stores them in a buffer.

Yes, it doesn't perform actual synthesizing, but just simulates the
job by reading frames from the audio input file.

>It is still not clear for me what will the case be on my side. Should the TTS engine add frames of the
> generated speech file to frame->codec_frame.buffer?

Yes, you should pass/write the generated speech to the mpf_frame
available within demo_synth_stream_read() function. Note that
demo_synth_stream_read() is a callback invoked from the MPF context,
while your engine may produce speech from its own context. Check how
this is handled in flite plugin using mpf_buffer.

>And what should I do with these frames?

Typically nothing.

>Should they be written to the generated pcm file (as in the demo)?

No!

>And which function is responsible for that?

No such function.

Iman Saleh

unread,

Aug 1, 2010, 12:41:09 PM8/1/10

to uni...@googlegroups.com

Hi Arsen,

I just want to make sure of few things. I checked flite plugin, the method flite_speak is the one responsible for synthesis. What I understand is that I should write a similar function that splits text using some criteria. And then in a loop I should perform synthesis and write result to synth_channel->audio_buffer each time.

My questions are: is it possible to change the type of synth_channel->audio_buffer to suit the output of my TTS engine?

Also I still cannot follow up what is going on in flite_synth_stream_read. I understand it receives an empty frame as input and writes to it the content of synth_channel->audio_buffer at a point of time, but I still don't know when is it called exactly? And how can I use the filled frame in streaming?

Thanks a lot.

Arsen Chaloyan

unread,

Aug 2, 2010, 8:23:43 AM8/2/10

to uni...@googlegroups.com

Hi Iman,

On Sun, Aug 1, 2010 at 9:41 PM, Iman Saleh <iman.sa...@gmail.com> wrote:
> Hi Arsen,
>
> I just want to make sure of few things. I checked flite plugin, the method
> flite_speak is the one responsible for synthesis. What I understand is that
> I should write a similar function that splits text using some criteria. And
> then in a loop I should perform synthesis and write result to
> synth_channel->audio_buffer each time.

It depends on the capabilities of your engine. Some engines accept
SSML content, the others are capable to process only plain text.

>
> My questions are: is it possible to change the type of
> synth_channel->audio_buffer to suit the output of my TTS engine?

Your goal is to provide a synthesized media frame from the callback.
The callback is invoked every 10 msec, while engines usually produce
synthesized speech at faster than real-time rate. You may or mayn't
use mpf_audio_buffer. Again, it's up to you and the engine.

>
> Also I still cannot follow up what is going on in flite_synth_stream_read. I
> understand it receives an empty frame as input and writes to it the content
> of synth_channel->audio_buffer at a point of time, but I still don't know
> when is it called exactly? And how can I use the filled frame in streaming?

This is a typical callback approach. It's called from the MPF core.
The RTP streaming is handled inside the stack. You should do nothing,
just provide synthesized frames!

Iman Saleh

unread,

Aug 9, 2010, 11:21:46 AM8/9/10

to uni...@googlegroups.com

Hi Arsen,

Now I need to play generated speech. I have read that RTSP protocol should implement a method play and that method should be responsible for playing streamed data. How can I call it in unimrcp? or how can I allow client to play streamed media? I am using Voxeo as a client for a unimrcp server.

Thank you.

Arsen Chaloyan

unread,

Aug 10, 2010, 4:33:39 AM8/10/10

to uni...@googlegroups.com

Hi Iman,

On Mon, Aug 9, 2010 at 8:21 PM, Iman Saleh <iman.sa...@gmail.com> wrote:

Hi Arsen,

Now I need to play generated speech. I have read that RTSP protocol should implement a method play and that method should be responsible for playing streamed data.

The MRCP version 1 uses only a few methods the RTSP protocol defines. The PLAY method shouldn't be used at all.
http://tools.ietf.org/html/rfc4463#section-3.2

How can I call it in unimrcp? or how can I allow client to play streamed media? I am using Voxeo as a client for a unimrcp server.

The only thing you may need is to properly configure the Voxeo platform to use MRCP instead of built-in engines. I have no Voxeo installed at the moment, but it was simple enough. You should provide the corresponding IP address and the resource name(s) or just the URI instead.

Iman Saleh

unread,

Aug 11, 2010, 10:07:44 AM8/11/10

to uni...@googlegroups.com

Hi Arsen,

OK I thing the problem has something to do with sampling rate. The speech generated has a sampling rate of 22050, I can see that the supported sampling rates in unimrcp are 8000, 16000, 32000 and 48000. I think I should determine the sampling rate using something as follows (similar to flite plugin):

    mpf_codec_capabilities_add(
            &capabilities->codecs,
            MPF_SAMPLE_RATE_8000 | MPF_SAMPLE_RATE_16000,
            "LPCM");

Do I have to change the sampling rate of the generated speech or can I add a new sampling rate? and why are the two values MPF_SAMPLE_RATE_8000, MPF_SAMPLE_RATE_16000 ORed?

Arsen Chaloyan

unread,

Aug 12, 2010, 8:41:25 AM8/12/10

to uni...@googlegroups.com

Hi Iman,

On Wed, Aug 11, 2010 at 7:07 PM, Iman Saleh <iman.sa...@gmail.com> wrote:

Hi Arsen,

OK I thing the problem has something to do with sampling rate. The speech generated has a sampling rate of 22050, I can see that the supported sampling rates in unimrcp are 8000, 16000, 32000 and 48000.

Yes, 22050 is not in the list. Perhaps it shouldn't be that hard to add, if needed. Anyway, your engine should be capable to generate speech at different rates. You may try to alter this behavior.

I think I should determine the sampling rate using something as follows (similar to flite plugin):

    mpf_codec_capabilities_add(
            &capabilities->codecs,
            MPF_SAMPLE_RATE_8000 | MPF_SAMPLE_RATE_16000,
            "LPCM");

Yes, provide the list of sampling rates your engine supports.

Do I have to change the sampling rate of the generated speech or can I add a new sampling rate?

I'd suggest to change the sampling rate, as the majority supports only 8kHz, a few - 16kHz, not sure about higher rates.