Does Aeneas work with non-perfect synced audio and text?

1,285 views
Skip to first unread message

Zhang He

unread,
Sep 13, 2016, 12:46:11 AM9/13/16
to aeneas-forced-alignment
Hi Alberto,
This is He ZHANG from China, I'm trying to develop an read-along app for English Learning purpose,  and I found aeneas extremly helpful. Thank you for your amazing work!  

Only with one problem, If the content of text and audio files are only 95%~ the same, eg. some of the texts are not spoken in the audio,  the Aeneas would still output a sync map, but the time mark is wrong. 

Is Aeneas designed to handle the non-perfect audio-text alignment? 
 
If the answer is yes, can you show me how to configure aeneas to handle the situation? 
If not, can you point to me what might be the solution?

Thanks again!

Alberto Pettarin

unread,
Sep 13, 2016, 12:11:41 PM9/13/16
to aeneas-forc...@googlegroups.com
On 09/13/2016 06:46 AM, Zhang He wrote:
> Hi Alberto,
> This is He ZHANG from China, I'm trying to develop an read-along app for
> English Learning purpose, and I found aeneas extremly helpful. Thank
> you for your amazing work!
>
> *Only with one problem, If the content of text and audio files are only
> 95%~ the same, eg. some of the texts are not spoken in the audio, the
> Aeneas would still output a sync map, but the time mark is wrong. *
> *

Hi,

welcome to the aeneas mailing list!

> *Is Aeneas designed to handle the non-perfect audio-text alignment? *

No, aeneas is not designed to handle such situation:

"Audio should match the text: large portions of spurious text or audio
might produce a wrong sync map"

(from
https://github.com/readbeyond/aeneas/#limitations-and-missing-features )

However, the exact answer actually depends on the structure of those
spurious portions, and on the granularity of the fragments (sentence- vs
word- level sync) you are using.

In general, if your spurious text comes as a large, contiguous chunk ---
for example, you have an head before and/or a tail after the main,
"correct" text --- then aeneas will trip over it. Fortunately, in that
case you can instruct aeneas to ignore X seconds from the start or Y
seconds from the end of the audio file for the purpose of computing the
alignment. To do so you can specify the following parameters in your
configuration string:

is_audio_file_head_length=10

and/or

is_audio_file_tail_length=20

(skip 10 seconds from the beginning, 20 seconds from the end)

For a live example, run:

$ python -m aeneas.tools.execute_task --example-head-tail

See also the documentation:

https://www.readbeyond.it/aeneas/docs/globalconstants.html#aeneas.globalconstants.PPN_TASK_IS_AUDIO_FILE_HEAD_LENGTH

https://www.readbeyond.it/aeneas/docs/globalconstants.html#aeneas.globalconstants.PPN_TASK_IS_AUDIO_FILE_TAIL_LENGTH

Clearly, this will require you to "manually" inspect each audio file, to
evaluate the length of the head/tail, which is inconvenient. However,
there are also options to specify a minimum and a maximum duration for
both the head and the tail, and aeneas will try to figure it out
automatically. Run:

$ python -m aeneas.tools.execute_task --example-sd

for an example.

If instead your spurious chunk is in the middle of the audio file,
probably aeneas simply is not able to process it correctly.

On the other hand, if the spurious parts are scattered through the text
(e.g., sometimes the narrator skips or adds a word, or inverts a few
words, etc.), then aeneas should be able to deal with those.

Finally, let me note that aeneas is known to work well at sentence or
sub-sentence granularity, while it is not perfect at word granularity.
There is another thread in this mailing list about the latter issue.

> If the answer is yes, can you show me how to configure aeneas to handle
> the situation?
> If not, can you point to me what might be the solution?

Try experimenting with the different parameters mentioned above. If
aeneas does not work for you, you can try using another forced aligner.
You can find a list here:

https://github.com/pettarin/forced-alignment-tools

Best regards,

AP

Firat Özdemir

unread,
Sep 19, 2016, 6:07:27 PM9/19/16
to aeneas-forc...@googlegroups.com
Hi He and all,

Just to add one thing for the situations where imperfections are in the middle:

 You can increase the tolerance to imperfections by decreasing the value of runtime configuration option "mfcc_shift" and increasing the "dtw_margin". (You better do the first one after the coming update, or try with multiples of 0.02, see aeneas issue #102)

Another trick for performance improvement would be to use the multilevel alignment even when you don't need the higher level granularities. This is because it allows you to go to very high frame rates (mfcc_shift) beyond the capacity of your machine's memory if you were to do it with single level.  Especially those who work with word level alignment should definitely use the multilevel mode. As I understand some people who tried aeneas for word-level granularity in the past weren't satisfied and decided to opt for other options. Now that there is the multilevel alignment, perhaps they should consider trying again. (Alberto, some people might also be reluctant to use/try aeneas just because of seeing  DTW as an inferior method compared to HMM, therefore advertising the multilevel alignment as an improvement on the alignment algorithm -because it is- could break that bias.)

***This is important and applies to every alignment tool, not only aeneas*** : 
There's also the possibility that aeneas gives you accurate results but your tests for checking the accuracy are not good enough. For example if your audio files are encoded in VBR mp3, HTML audio elements and the vast majority of audio players will report you wrong positions and take you to wrong positions when you seek a certain time. As a general rule, do not use VBR MP3 for timedtext playback, that is, unless your audio player decodes the whole audio beforehand (as in the case of Web Audio API). Most audio players including the HTML5 audio element do the time seeking for VBR MP3 according to the seek tables at the beginning of the mp3 files, a calculation that involves two approximations. Therefore if you check alignment performance with VBR MP3 files, it will mislead you: you will think it's bad even if it were perfect. Especialy with word level alignment for an audio longer than a few minutes it's a lost cause. The safest way of testing the alignment accuracy is to check with Audacity (with the .tsv output) or to convert your audio file to wav* and then to check it with finetuneas (.json output). Unfortunately seeking in CBR MP3 is occassionaly problematic too. When I find time I will hopefully write a post about the best practices for accurate time seeking in timedtext playback. While developing the Oku Player I spent most of my time for finding workarounds for that.

F.O.






--
You received this message because you are subscribed to the Google Groups "aeneas-forced-alignment" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aeneas-forced-alignment+unsubscr...@googlegroups.com.
To post to this group, send email to aeneas-forced-alignment@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/aeneas-forced-alignment/ddbc69dd-04d4-06ea-fea8-fa6ea7028cab%40readbeyond.it.

For more options, visit https://groups.google.com/d/optout.

Alberto Pettarin

unread,
Sep 20, 2016, 3:11:39 AM9/20/16
to aeneas-forc...@googlegroups.com
Firat, thank you for sharing your suggestions, especially the note on
testing with VBR vs. CBR/raw audio in the browser --- which I forgot to
add, and which is especially relevant to people who wants word-level
alignment.

=== === ===

On using the multilevel format.

It is true that, while aeneas tries to avoid the need for the user to
know "speech processing" techniques in order to get an output, having
that knowledge (or, at least, grasping what aeneas computes) certainly
helps if the user wants to delve into advanced applications or tweaking
parameters.

Debating whether DTW is better or worse of speech-recognition-based
aligners is something that I gladly leave to academia. (In theory it is
possible to compare scientifically the quality of the alignments
produced by aeneas and by other tools, but this is not on top of my list
of priorities, nor I have resources for that. Also, my impression is
that the speech community is now all focused on ASRs, and thus forced
aligners are no longer a trendy topic.)

aeneas is free software, so I guess users should pick your suggestion:
try it, tweak it, and judge by themselves if it solves their problems.
If yes, good; if not, there are other tools out there to try.

To improve the situation, I can add a note about using multilevel in the
docs for v1.6.0, and a straw-man explanation of how aeneas works in the
wiki/ directory.

AP

Firat Özdemir

unread,
Sep 20, 2016, 4:33:37 PM9/20/16
to aeneas-forc...@googlegroups.com
In this mail list I remember at least one thread where poor quality in word level was linked to DTW or described as an inherent limitation. As a long time user now I think aeneas/aeneasweb should be the first option to try for most scenarios including word-level alignment. But if someone makes a decision after seeing that thread, they may rule out aeneas altogether and get stuck with the wrong tool. A pointer in the docs could at least prevent that.



AP

--
You received this message because you are subscribed to the Google Groups "aeneas-forced-alignment" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aeneas-forced-alignment+unsubscr...@googlegroups.com.
To post to this group, send email to aeneas-forced-alignment@googlegroups.com.

Willem van der Walt

unread,
Sep 21, 2016, 2:39:07 AM9/21/16
to aeneas-forc...@googlegroups.com
Good day,
I am glad to hear that word-level alignment works well using the
multi-level format.
It is a lot easier to integrate Aeneas into an automated process than
integrating the other forced alignment options.

For the record, I always work on the audio in wav format..

I need to complete some urgent work first, but would really like to do
some more tests to try to get to the bottom of why I do not get the
accuracy you guys are getting.

If Aeneas works well enough for all our needs, I do not see why one has to
do the scientific comparisons Alberto spoke about, but if you want to do
it some day,
let me know as it is something being done routinely by my colleagues, as
they need to test the accuracy of alignment often in their work.

Some thing which could be a good addition to Aeneas might be to add the
ability to write out the sync map as a TextGrid file.
I, for now just wrote a script outside Aeneas to do it as I needed the
textgrid as input for a further step in the process.
There exists a python module called tgt for doing this.
I am refering to the Praat TextGrid format.
Kind regards, Willem
>> email to aeneas-forced-ali...@googlegroups.com.
>> To post to this group, send email to aeneas-forced-alignment@google
>> groups.com.
>> To view this discussion on the web visit https://groups.google.com/d/ms
>> gid/aeneas-forced-alignment/bc230335-1ffa-ff64-4355-c9579419
>> 7760%40readbeyond.it.
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> You received this message because you are subscribed to the Google Groups "aeneas-forced-alignment" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to aeneas-forced-ali...@googlegroups.com.
> To post to this group, send email to aeneas-forc...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/aeneas-forced-alignment/CAEs3jXuPdHGfA86p9V7oXxxHdYYCSFxDRig9UBnV0dmzfjs4jA%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.
>
--

This message is subject to the CSIR's copyright terms and conditions, e-mail legal notice, and implemented Open Document Format (ODF) standard.
The full disclaimer details can be found at http://www.csir.co.za/disclaimer.html.

Please consider the environment before printing this email.

Alberto Pettarin

unread,
Sep 21, 2016, 4:41:58 AM9/21/16
to aeneas-forc...@googlegroups.com
On 09/21/2016 08:38 AM, Willem van der Walt wrote:
> I need to complete some urgent work first, but would really like to do
> some more tests to try to get to the bottom of why I do not get the
> accuracy you guys are getting.

You might want to perform such tests after aeneas v1.6.0 is released
next week. In fact, v1.6.0 will fix bug #102, which affects the timings
if you set the window shift below 20ms, which I believe you might want to.

> If Aeneas works well enough for all our needs, I do not see why one has
> to do the scientific comparisons Alberto spoke about, but if you want to
> do it some day,
> let me know as it is something being done routinely by my colleagues,
> as they need to test the accuracy of alignment often in their work.

Albeit aeneas is not the output of a research program, I still trying to
substantiate any claim with data (e.g., the benchmark suite).

I guess having a comparison of aeneas and other free/open forced
aligners would be helpful for non-experts.

> Some thing which could be a good addition to Aeneas might be to add the
> ability to write out the sync map as a TextGrid file.
> I, for now just wrote a script outside Aeneas to do it as I needed the
> textgrid as input for a further step in the process.
> There exists a python module called tgt for doing this.
> I am refering to the Praat TextGrid format.

Thank you for your suggestion, added as issue #111, I will look into it
for the next release:

https://github.com/readbeyond/aeneas/issues/111

Best regards,

Alberto Pettarin

Xavier Anguera

unread,
Sep 21, 2016, 5:31:01 AM9/21/16
to aeneas-forc...@googlegroups.com
Alberto, all,

Once you select some audio and text I will be happy to run my alignment system (based on speech recognition models) on the data to compare how is the alignment accuracy at word level when using a different alignment approach.
We could even consider writing a scientific publication on the results, would they come out to be interesting for the community.

yours,

Xavier Anguera

--
You received this message because you are subscribed to the Google Groups "aeneas-forced-alignment" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aeneas-forced-alignment+unsubscr...@googlegroups.com.

To post to this group, send email to aeneas-forced-alignment@googlegroups.com.

Firat Özdemir

unread,
Sep 21, 2016, 8:45:56 AM9/21/16
to aeneas-forc...@googlegroups.com
 A comparison of performances of a TTS-based tool against a ASR-based tool would make sense if  they used phoneme-vocalization/phoneme-recognition alone, i.e. no language modeling etc. I would certainly be interested in that. Because the demand for forced-alignment is expected to be inversely proportional to the success of TTS/ASR in a certain language/accent. In other words, with perfect TTS or perfect ASR there would not be much need for forced alignment anyways.

But then again my point was not benchmarking. Especially benchmarking with other open source tools doesn't sound like a strategically good idea because if their developers don't agree with your criteria/methodology it could alienate them. My original point was that based on my experience with aeneas and other open source tools in most cases potential users should try aeneas as the first option and push it hard before moving on. Usability can't be benchmarked, so this may not be substantiated scientifically.

Willem, if you send me some samples I can check the parameters for word level. Others can send too.

Xavier, is your system available only as SaaS or is it possible to buy a license for local use? 

F.O.

Xavier Anguera

unread,
Sep 21, 2016, 10:19:48 AM9/21/16
to aeneas-forc...@googlegroups.com
Firat,
I am not sure you understood my message. I am not trying to sell my system. Instead, I was trying to:
- offer some clarity into the performance that can be achieved with each technology, so that future users of the tool can know what to expect, regardless whether they read one post or another.
- answer the request of some people to have "speech people" more involved in this area. I consider myself knowledgeable enough in speech processing and I like very much the alignment of text and audio and its use in electronic publishing.

In my experience, TTS-based systems should always get worse results than ASR-based systems due to the way they are designed. It is true that you could get better results if using a better TTS system (Festival has some very good ones, but unfortunately not for may languages) but will never beat an equivalent system based on ASR. Of course, I would be wrong, for this reason I proposed to compare the new multi-level alignment that Alberto created with my (and any other) systems.
Note that the need for language resources is necessary for any of both systems (TTS or ASR) as you either need to build a TTS for the language or train an ASR decoder. In fact, I have been experimenting in the past with some language-free alignment ideas that got me somewhere interesting, but I did not have time to pursue the idea further and thus stopped working on it.

Final remark: I am currently focusing my efforts in a new project (related to language education, totally unrelated to Sinkronigo, check my linkedin if interested). As Sinkronigo is currently not profitable, I am considering what the right next steps are for my syncronization technology. I am open to suggestions.

yours,

X.






To unsubscribe from this group and stop receiving emails from it, send an email to aeneas-forced-alignment+unsub...@googlegroups.com.

To post to this group, send email to aeneas-forced-alignment@googlegroups.com.

Firat Özdemir

unread,
Sep 21, 2016, 11:23:13 AM9/21/16
to aeneas-forc...@googlegroups.com
Hi Xavier,

My first point was about a comparison between these tools: github.com/pettarin/forced-alignment-tools. I wouldn't call SaaS a 'tool'. But perhaps since it followed your email I should have specified it more clearly.

FWIW, my question about a license for local use was genuine too. I would like to have it but I would prefer to have the chance to tweak it. 

Sorry to hear that about Sinkronigo. The demand for 'read along' books doesn't seem to be where it should be. My belief is that it's partly because of the low availability of the books and that it's a matter of time. Readiance could hopefully help break the cycle. If it catches on I would happily include links to professional services too.

F.O.



--
You received this message because you are subscribed to the Google Groups "aeneas-forced-alignment" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aeneas-forced-alignment+unsub...@googlegroups.com.
To post to this group, send email to aeneas-forced-alignment@googlegroups.com.

Alberto Pettarin

unread,
Sep 21, 2016, 12:20:14 PM9/21/16
to aeneas-forc...@googlegroups.com
Dear all,

I appreciate all your contributions.

Firat, I overlooked your second-from-last message. I have just started
revising the documentation in view of the v1.6.0 release, and I have
already added a note about word-level sync in the "command line
tutorial", hopefully it will help new users. Thanks for the suggestion.

=== === ===

I mostly agree with Xavier on the fact that ASR-based aligners are
supposed to produce better results than TTS+DTW-based aligners,
especially those conditioning the models with the extra information
given by knowing in advance the sequence of words.

In fact, as a non speech scientist, I am quite surprised that the
TTS+DTW approach works so well, considering that the TTS language models
are rather "weak", compared to the amount of information encoded in ASR
models.

Firat, you have a point when you raise the question of usability --- and
I give the term "usability" the larger meaning of "ease of
installing/running/extending/embedding" --- indeed "usability" was one
of the issues with existing tools that led me to coding aeneas.

But "usability" is only one dimension.
Runtime speed, alignment "quality", and price are other possible dimensions.
Users might be willing to suffer a painful installation process or to
pay for using the service or to wait days for their output,
if the quality of the output is their main concern.

Hence my suggestion that measuring along one of those dimensions,
in isolation,
might provide helpful data points to inform other people's choices.

But maybe it was just an eruption of "measuritis" from my research-y
past. :)

Best regards,

AP

PS1: I am opening another thread about read-along, which might be
interesting per se.

PS2: I have no problems with commercial products and companies being
mentioned here, as long as they are relevant to the discussion.
Thankfully, this is a free (as in "free speech") group.

Alberto Pettarin

unread,
Oct 17, 2016, 8:20:39 AM10/17/16
to aeneas-forc...@googlegroups.com
On 09/21/2016 08:38 AM, Willem van der Walt wrote:
> Some thing which could be a good addition to Aeneas might be to add the
> ability to write out the sync map as a TextGrid file.
> I, for now just wrote a script outside Aeneas to do it as I needed the
> textgrid as input for a further step in the process.
> There exists a python module called tgt for doing this.
> I am refering to the Praat TextGrid format.
> Kind regards, Willem
Hi,

Willem, would you mind sending me (maybe privately, if you prefer:
alb...@readbeyond.it ) the script you currently use?

I will probably need to adapt it, but of course I will credit you in the
aeneas acknowledgements.

For all the others: if you have requests about the output formats,
please express them now. I plan to work on that part of the code during
this week. There is a request about WebVTT already.

Thank you,

AP

Lexon Guo

unread,
Apr 21, 2017, 10:46:48 AM4/21/17
to aeneas-forced-alignment
你是授渔英语的张和 吗?

在 2016年9月13日星期二 UTC+8下午12:46:11,Zhang He写道:

Alberto Pettarin

unread,
Apr 21, 2017, 10:53:07 AM4/21/17
to aeneas-forc...@googlegroups.com
Can you please post in English (if your message was directed to the
mailing list) or directly to Zhang He (if you wanted to contact him
privately?

Thank you,

AP



On 04/21/2017 04:46 PM, Lexon Guo wrote:
> 你是授渔英语的张和 吗?
>
> 在 2016年9月13日星期二 UTC+8下午12:46:11,Zhang He写道:
>
> Hi Alberto,
> This is He ZHANG from China, I'm trying to develop an read-along app
> for English Learning purpose, and I found aeneas extremly helpful.
> Thank you for your amazing work!
>
> *Only with one problem, If the content of text and audio files are
> only 95%~ the same, eg. some of the texts are not spoken in the
> audio, the Aeneas would still output a sync map, but the time mark
> is wrong. *
> *
> *
> *Is Aeneas designed to handle the non-perfect audio-text alignment? *
>
> If the answer is yes, can you show me how to configure aeneas to
> handle the situation?
> If not, can you point to me what might be the solution?
>
> Thanks again!
>
> --
> You received this message because you are subscribed to the Google
> Groups "aeneas-forced-alignment" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to aeneas-forced-ali...@googlegroups.com
> <mailto:aeneas-forced-ali...@googlegroups.com>.
> To post to this group, send email to
> aeneas-forc...@googlegroups.com
> <mailto:aeneas-forc...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/aeneas-forced-alignment/67ce5c75-328b-41d1-bee8-071b8ab691dc%40googlegroups.com
> <https://groups.google.com/d/msgid/aeneas-forced-alignment/67ce5c75-328b-41d1-bee8-071b8ab691dc%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.

--
Alberto Pettarin

web: http://readbeyond.it/
web: http://www.albertopettarin.it/
twitter: http://twitter.com/acutebit/
skype: alberto_pettarin
mobile: +39 340 82 18 704
Reply all
Reply to author
Forward
0 new messages