Question about close captioning and silent

Cristian Gradisteanu

unread,

May 4, 2017, 3:37:29 PM5/4/17

to aeneas-forced-alignment

Hey,

First of all...thanks for this excellent library, it looks more like a science-fiction than reality to me :)

I wanted to ask a question that is not related to the development of the library (as most questions here) but on how to use it.

I'm trying to use it as a command line utility in Mac OS and works great so far, the only thing that I want to achieve and still don't know how is to display the caption (I am using WebVTT format) when the actual words are spoken at the beginning and the end of the sound file.

E.g. even there is a 5 seconds delay in the beginning of the voice over, the caption is being displayed at the second 0.

Same for the end of the file. If there is a 5 seconds delay, the last line of subtitle is still displayed for those 5 seconds.

I tried playing with some options (there are too many :D) but could't figure it out which one does this.

I think this also has to do with the pauses between sentences. If there is a 5 second pause between 2 paragraphs, the subtitle is still displayed until the next one is scheduled to be shown.

Hope that makes sense and thanks a lot!

Alberto Pettarin

unread,

May 4, 2017, 3:49:26 PM5/4/17

to aeneas-forc...@googlegroups.com

On 05/04/2017 09:37 PM, Cristian Gradisteanu wrote:
> Hey,
>
> First of all...thanks for this excellent library, it looks more like a
> science-fiction than reality to me :)

Hi,

you are welcome.

> I wanted to ask a question that is not related to the development of the
> library (as most questions here) but on how to use it.
> I'm trying to use it as a command line utility in Mac OS and works great
> so far, the only thing that I want to achieve and still don't know how
> is to display the caption (I am using WebVTT format) when the actual
> words are spoken at the beginning and the end of the sound file.
>
> E.g. even there is a 5 seconds delay in the beginning of the voice over,
> the caption is being displayed at the second 0.
>
> Same for the end of the file. If there is a 5 seconds delay, the last
> line of subtitle is still displayed for those 5 seconds.

You can specify the length of the head and/or the tail of the audio file
--- that is, the number of seconds at the beginning and/or at the end
that should be ignored for alignment purposes.

Just add:

is_audio_file_head_length=5|is_audio_file_tail_length=6.789

to your task configuration string (in the example: ignore first 5
seconds and last 6.789 seconds of the audio file).

There is a built-in example that demos this:

$ python -m aeneas.tools.execute_task --example-head-tail

> I think this also has to do with the pauses between sentences. If there
> is a 5 second pause between 2 paragraphs, the subtitle is still
> displayed until the next one is scheduled to be shown.

You can have aeneas to remove non-speech intervals (longer than X
seconds) between two speech fragments. A typical case is a dramatic
pause between two sentences by the same speaker, or a silence between
one speaker's last word and another speaker's reply.

You can enable this removal by adding the following:

task_adjust_boundary_nonspeech_min=0.500|task_adjust_boundary_nonspeech_string=REMOVE

to your task configuration string (in the example: remove all non-speech
intervals longer than 0.500 seconds)

There is a built-in example that demos this:

$ python -m aeneas.tools.execute_task --example-remove-nonspeech

Also, you might want to run:

$ python -m aeneas.tools.execute_task --list-parameters

for a list of parameters with a brief description,

$ python -m aeneas.tools.execute_task --examples-all

for the full list of built-in examples,

and

$ python -m aeneas.tools.execute_task --help-rconf

for more advanced settings involving the runtime configuration options.

HTH,

Alberto Pettarin

Cristian Gradisteanu

unread,

May 4, 2017, 4:22:07 PM5/4/17

to aeneas-forced-alignment

Thanks so much, this solved my problem!

I just discovered some of these options but your email came right on time, saving me few good hours.

Thanks again for your work, this utility saves me quite a few good (boring) hours of syncing the captions!

Cheers!

Cristian Gradisteanu

unread,

May 19, 2017, 12:21:20 PM5/19/17

to aeneas-forced-alignment

The library is indeed magical, it saves me a LOT of time while syncing the captions.

One last problem that I wasn't able to solve: is there a way to sync the first sentence of the caption with the first spoken phrase?

I know about the "is_audio_file_head_length=2" but I have multiple videos which start at a different time and the caption appears before the narrator starts talking (or later). Can this be done automatically (to display the first caption sentence when the narrator starts talking)?

Other than that...this library is PERFECT for my needs.

Thanks again for your effort!

Alberto Pettarin

unread,

May 19, 2017, 12:46:36 PM5/19/17

to aeneas-forc...@googlegroups.com

You are welcome.

If instead of

is_audio_file_head_length

you set:

is_audio_file_detect_head_max=10
is_audio_file_detect_head_min=0

aeneas will try to detect the (silent) head, which you "guarantee" to be
within 0 and 10 seconds. There are two analogous parameters for the
tail: is_audio_file_detect_tail_max and is_audio_file_detect_tail_min.

Note that the longer the possible interval you specify, the more time
the heuristic will take to "detect" the head. [1]

Also note that this is just a heuristic, and that I am not really
satisfied by it. There are several ways it can be improved --- but but
time and other constraints :(

Actually, I am interested in feedback on this feature: if you try it,
let me know how it goes.

HTH,

AP

[1] Basically, it aligns the first k text fragments against many
prefixes of the real audio wave, starting at different positions within
the range you specified with the above two parameters, and it selects
the offset that yields the "best matching" as the head offset.
Afterward, it computes the alignment as if _you_ specified
is_audio_file_head_length=SELECTEDOFFSET.

> --
> You received this message because you are subscribed to the Google
> Groups "aeneas-forced-alignment" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to aeneas-forced-ali...@googlegroups.com
> <mailto:aeneas-forced-ali...@googlegroups.com>.
> To post to this group, send email to
> aeneas-forc...@googlegroups.com
> <mailto:aeneas-forc...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/aeneas-forced-alignment/d2678ac9-c108-4e2b-88a7-d7278924128d%40googlegroups.com
> <https://groups.google.com/d/msgid/aeneas-forced-alignment/d2678ac9-c108-4e2b-88a7-d7278924128d%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.

Cristian Gradisteanu

unread,

May 19, 2017, 1:59:58 PM5/19/17

to aeneas-forced-alignment

Thanks!

Tried a couple of videos and appears to work great for both the beginning and end of file! Please note that I also have music on the video background, which starts from the beginning of the video, but the voice over starts a few seconds later. I will try a few more videos and will let you know if something is changing but I hope not.

Now, really the last question (is not that important after all, but just curious): I am using the task_adjust_boundary_nonspeech_min=0.01 to hide the captions between sentences but even with this small value (0.01) the caption is still kept a few milliseconds on the screen until the next one is displayed (it doesn't hide the caption unless there is a longer (1 sec maybe, delay in the VO). Is there a way to hide the caption sooner?

Again, this is just an optional thing that I would like to be able to do, it works perfect even as it is now.

Thanks again!

> an email to aeneas-forced-alignment+unsub...@googlegroups.com
> <mailto:aeneas-forced-alignment+unsubscribe@googlegroups.com>.

> To post to this group, send email to

> aeneas-forced-alignment@googlegroups.com
> <mailto:aeneas-forced-alig...@googlegroups.com>.

Alberto Pettarin

unread,

May 19, 2017, 3:17:43 PM5/19/17

to aeneas-forc...@googlegroups.com

On 05/19/2017 07:59 PM, Cristian Gradisteanu wrote:
> Thanks!
>
> Tried a couple of videos and appears to work great for both the
> beginning and end of file! Please note that I also have music on the
> video background, which starts from the beginning of the video, but the
> voice over starts a few seconds later. I will try a few more videos and
> will let you know if something is changing but I hope not.

Cool, thank you.

> Now, really the last question (is not that important after all, but just
> curious): I am using the task_adjust_boundary_nonspeech_min=0.01 to hide
> the captions between sentences but even with this small value (0.01) the
> caption is still kept a few milliseconds on the screen until the next
> one is displayed (it doesn't hide the caption unless there is a longer
> (1 sec maybe, delay in the VO). Is there a way to hide the caption sooner?

I am not sure I have understood correctly your question. Would you mind
making an example (with text and dummy timings) of what you
perceive/would like to have as output, and what aeneas actually produces?

AP

Cristian Gradisteanu

unread,

May 19, 2017, 3:23:04 PM5/19/17

to aeneas-forc...@googlegroups.com

Sorry for the misunderstanding.

If I want the caption to completely disappear between sentences (when there is a small pause in narration) I can do that by using: task_adjust_boundary_nonspeech_min=0.01 but even there is a small pause between sentences, the caption is still kept on screen until the next sentence (caption) appears. This works well for longer pauses (>~1second) but not for smaller ones.

Hope that makes sense.

Thanks!

--
You received this message because you are subscribed to a topic in the Google Groups "aeneas-forced-alignment" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/aeneas-forced-alignment/J7FnZ8OSOLE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to aeneas-forced-alignment+unsubscr...@googlegroups.com.
To post to this group, send email to aeneas-forced-alignment@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/aeneas-forced-alignment/32078a3c-672f-c8b0-95a9-f73f364b8622%40readbeyond.it.

Alberto Pettarin

unread,

May 19, 2017, 3:55:17 PM5/19/17

to aeneas-forc...@googlegroups.com

Thanks for the clarification.

So basically you are saying that a human perceives:

0.000 1.200 What have you been doing all these years?
1.200 2.000 GAP (SILENCE/NONSPEECH)
2.000 3.600 I've been going to bed early.

and you would like to get:

0.000 1.200 What have you been doing all these years?
2.000 3.600 I've been going to bed early.

while aeneas produces:

0.000 1.600 What have you been doing all these years?
1.600 3.600 I've been going to bed early.

or

0.000 2.000 What have you been doing all these years?
2.000 3.600 I've been going to bed early.

=== === ===

There might be several reasons for why aeneas is not giving you what you
expect.

The current algorithm to create the "gap" is the following:

1. determine all nonspeech intervals (NSI) using the built-in voice
activity detector (VAD)
2. align without gaps
3. for each pair of consecutive fragments, check if the transition point
between the two fragments occurs inside a NSI (and it is the only
transition point in that NSI), where the length of the NSI is >=
SPECIFIEDLENGTH: if so, create the "gap"

A first reason for the "error" might be that, with default parameters,
the VAD determines a nonspeech intervals only if it has length >= 0.200
s. So, even if you set task_adjust_boundary_nonspeech_min=0.01, it is
"shadowed" by the VAD setting. You can try lowering the minimum length
of nonspeech in the VAD ( -r="vad_min_nonspeech_length=0.040" ), but
note that it does not make sense setting it to a value smaller than the
MFCC shift (default: 0.040s), and therefore the MFCC shift is the
ultimate lower bound to any gap length.

Another reason might be that the transition point is determined to be
outside a nonspeech interval, maybe at the MFCC frame just before or
after it. => You might try enabling the MFCC nonspeech masking and see
if it helps ( -r="mfcc_mask_nonspeech=True" ) .

The VAD might not label correctly the nonspeech interval --- you can try
increasing the vad_log_energy_threshold rconf parameter and see if it helps.

Finally, a bug is ALWAYS an option. ;)

HTH,

AP

Note: since you say that it works for longer pauses, I guess you are
also specifying "task_adjust_boundary_nonspeech_string=REMOVE" in the
task config string, but I just want to mention it for other users who
might not know.

On 05/19/2017 09:23 PM, Cristian Gradisteanu wrote:
> Sorry for the misunderstanding.
>
> If I want the caption to completely disappear between sentences (when
> there is a small pause in narration) I can do that by

> using: task_adjust_boundary_*nonspeech_min=0.01 *but even there is a

Cristian Gradisteanu

unread,

May 19, 2017, 4:17:00 PM5/19/17

to aeneas-forc...@googlegroups.com

Yes, I am also specifying "task_adjust_boundary_nonspeech_string=REMOVE".

I think the pauses are not long enough to trigger the caption disappearing after all.

Tried the other options that you mentioned but couldn't see any difference.

Anyway, this was something optional and not really important, it works GREAT the way it is right now.

Another factor to take into consideration might be that the videos have a music background which might interfere with the voice over.

--
You received this message because you are subscribed to a topic in the Google Groups "aeneas-forced-alignment" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/aeneas-forced-alignment/J7FnZ8OSOLE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to aeneas-forced-alignment+unsubscr...@googlegroups.com.
To post to this group, send email to aeneas-forced-alignment@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/aeneas-forced-alignment/7dabdc0f-2491-1589-a79c-c416d908d569%40readbeyond.it.

Alberto Pettarin

unread,

May 19, 2017, 4:34:42 PM5/19/17

to aeneas-forc...@googlegroups.com

On 05/19/2017 10:16 PM, Cristian Gradisteanu wrote:
> Another factor to take into consideration might be that the videos have
> a music background which might interfere with the voice over.

Yes, that might be a problem, because the interval with music but not
speech might have enough energy so that the VAD labels it as "speech"
instead of "nonspeech".

To distinguish speech from music/noise the simple energy-based VAD is
not enough, we need to look at other properties of the audio spectrum
--- in general, doing this in an unsupervised manner is an open problem
even in academia, and out of scope for aeneas (at least for now).

BTW, you can use the aeneas.tools.run_vad tool to see the
speech/nonspeech intervals computed by the current VAD:

$ python -m aeneas.tools.run_vad audio.mp3 both intervals.txt

0.000 0.440 nonspeech
0.440 0.680 speech
0.680 2.680 nonspeech
2.680 5.400 speech
...

$ python -m aeneas.tools.run_vad audio.mp3 both intervals.txt
-r="vad_min_nonspeech_length=0.040"

0.000 0.440 nonspeech
0.440 0.680 speech
0.680 2.680 nonspeech
2.680 3.680 speech
3.680 3.720 nonspeech
...

AP

Cristian Gradisteanu

unread,

May 19, 2017, 4:54:07 PM5/19/17

to aeneas-forc...@googlegroups.com

This is a great way to debug!

Yes, I think the music interferes with the voice recognition algorithm.

I got this when running "python -m aeneas.tools.run_vad hdr.mp4 both intervals2.txt -r="vad_min_nonspeech_length=0.040":

0.000 0.120 nonspeech

0.120 0.160 speech

0.160 0.520 nonspeech

0.520 0.560 speech

0.560 0.920 nonspeech

0.920 1.000 speech

1.000 3.600 nonspeech

3.600 3.680 speech

3.680 4.400 nonspeech

4.400 4.440 speech

4.440 4.800 nonspeech

4.800 4.840 speech

4.840 5.200 nonspeech

5.200 5.280 speech

5.280 6.840 nonspeech

6.840 8.240 speech

....

And in the hdr.vtt file, the voice only starts at 6.840. Before that is only music:

WEBVTT

1

00:00:06.840 --> 00:00:19.640

HDR stands for High Dynamic Range.

Does it matter if I'm trying these agains mp4 files instead of mp3?

--
You received this message because you are subscribed to a topic in the Google Groups "aeneas-forced-alignment" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/aeneas-forced-alignment/J7FnZ8OSOLE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to aeneas-forced-alignment+unsubscr...@googlegroups.com.
To post to this group, send email to aeneas-forced-alignment@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/aeneas-forced-alignment/9b261f80-8433-1584-1ccb-ad9391e27faf%40readbeyond.it.

Alberto Pettarin

unread,

May 19, 2017, 4:59:12 PM5/19/17

to aeneas-forc...@googlegroups.com

On 05/19/2017 10:54 PM, Cristian Gradisteanu wrote:
> Does it matter if I'm trying these agains mp4 files instead of mp3?

If the "audio" file is not a WAVE (RIFF) file, ffmpeg is called to
convert it to a mono 16 KHz WAVE file.

So, while starting from two different files/formats yields slightly
different audio data (and that might be a problem when absolute timing
precision is needed), it should not matter for this particular issue.

AP

Reply all

Reply to author

Forward