Best way to handle alignment of multi-lingual file?

139 views
Skip to first unread message

Willem van der Walt

unread,
Sep 12, 2018, 6:25:55 AM9/12/18
to aeneas-forc...@googlegroups.com
Good day,
We have created a few epub3 books with multi=-level alignments done with
Aeneas which works quite well.
For these we used the unparsed xhtml files as input with the proper id
attributes added and both human narrated audio and pre-synthesized audio.
The only real issue at the moment, is that Aeneas aligns such an input
xhtml file using the language/synthesizer which one specify for the
alignment. When the xhtml file contains more than one language, it can
be specified using the xml:lang="en" attribute for example.
To get around this, I plan on calling Aeneas several times, once for each
chunk of xhtml which is in a different language than the chunk before it.
I then plan on combining the smil output received from each alignment run
into one smil file once all the chunks are aligned.

Realizing that what I am about to suggest can only work on input formats
where language specification is possible, I thought to ask about the
possibility to extend Aeneas to do the following in these cases.

Start the alignment using the language specified as currently, but then,
given that the TTS used for the alignment supports the other languages in
the document, change the TTS language/voice dynamically when an xml:lang
attribute occurs in the source xhtml.

Now something else:
As hinted to above, we pre-synthesize the audio when human narration is
unavailable and then align the synthesized audio afterwards using Aeneas.
Is it possible to extend Aeneas to accept e.g. an xhtml file as described
above, but then synthesize and then align, producing both audio output
and a smil file?
I thought that, since synthesizing the text for doing the alignment must
be one of the first steps in the current process, it might be possible to
simply keep that synthesized output and dump it in an audio file in the
end.
I hope I am not confusing everyone with this second point.
TIA, Willem


Alberto Pettarin

unread,
Sep 12, 2018, 5:59:56 PM9/12/18
to aeneas-forc...@googlegroups.com
Hi,

let's separate the two questions.

1. support for multi-lingual text/audio : while the built-in CLI tools
only work for 1-language inputs, if you use aeneas as a Python library
you can create an input TextFile object containing TextFragments in
different languages. Provided that all the languages used are actually
supported by the chosen TTS engine, aeneas "as a library" should work. I
did a couple of experiments in the past with mixed English and Italian,
so I am sure it works --- actually better than ASR-based tools I tried,
which usually have only mono-lingual models. I doubt in the near future
I will consider augmenting the aeneas built-in CLI tools to support
complex parsing.

2. keeping the synthesized audio : I am a bit confused on this point.
Any decent TTS engine can emit the time markers of each
sentence/word/phoneme in the output audio file: once you have those,
there is no need to use aeneas to synchronize the generated audio with
the input text. For example, eSpeak generates them. If your TTS engine
really does not have any switch to save the markers, you can use the
Synthesizer class to perform a synthesis and get the time markers of
each text fragment.

Best regards,

Alberto Pettarin

Willem van der Walt

unread,
Sep 14, 2018, 2:35:03 AM9/14/18
to aeneas-forc...@googlegroups.com
Hi again.
Thanks for the suggestion regarding using the library.
I will look into that.
Regarding my second point in the message below, just some more context.
Our TTS does have the offsets into the audio as you stated, but it is not
easy to get at, and not available in our current
client/server implementation of the TTS.
I can write code to get these offsets using the TTS directly, I.E. not the
server/client implementation, or will have to extend the server/client
implementation to allow the offsets to be requested by the client.
Using your synthesizer class might be my quickest way to get the job done.
We are working with some of the publishers here in South Africa to augment
educational books with TTS audio or human narrated audio if available.
We expect issues with things like columns and tables when trying to align
that with say, a human narration.
The idea was that, since we will likely will have to pull some tricks to
present the alignment with the text in the way it appears in the audio, we
would be able to utilize that same code in the case where we do not have
human narration and have to synthesize e.g. a table in a certain way.
What is interesting, is that Aeneas does amazingly well when aligning
e.g. an English phrase in a Zulu book while using a Zulu TTS for the
alignment.
It is not suppose to work, but it kind of does.
Again, thanks for the suggestions.
Kind regards, Willem
> --
> You received this message because you are subscribed to the Google Groups
> "aeneas-forced-alignment" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to aeneas-forced-ali...@googlegroups.com.
> To post to this group, send email to
> aeneas-forc...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/aeneas-forced-alignment/d3c270c0-419c-b576-d568-c6a8a81f39dd%40readbeyond.it.
> For more options, visit https://groups.google.com/d/optout.
>

Alberto Pettarin

unread,
Sep 18, 2018, 4:10:12 PM9/18/18
to aeneas-forc...@googlegroups.com
Hi,

"Using your synthesizer class might be my quickest way to get the job
done." => OK, understood. Of course in absolute terms is a bit
inefficient (because it accumulates the audio samples by concatenating
one text fragment at a time, and keeping track of the "current" total
time, rather than obtain the same information directly from the TTS engine)

"we would be able to utilize that same code in the case where we do not
have human narration and have to synthesize e.g. a table in a certain
way." => so, basically you are saying that you might have to use the TTS
to "cover up" for audio that was not pre-recorded by the voice talent.
OK, that might work of course.

"It is not suppose to work, but it kind of does." => I think it can be
explained by the fact that, if you have a large enough DTW stripe, the
computed alignment is dominated by the Zulu prefix and suffix around the
"spurious" English text. Also, I guess it depends on the similarity
between the Zulu and English phoneme sets and of the grapheme-to-phoneme
rules of the two languages. But I would bet the first effect (DTW
margin) is the main contributing factor in your case.

Best regards,

Alberto Pettarin
Reply all
Reply to author
Forward
0 new messages