Update on code speedup via Python C extensions

Alberto Pettarin

unread,

Aug 18, 2015, 5:42:16 AM8/18/15

to aeneas-forc...@googlegroups.com

Dear all,

just to let you know that I have done great progress towards speeding
aeneas up.

In particular, I completed the porting of the DTW algorithm to C code,
and integrated it with the Python code, yielding an impressive code
speedup (30-60x faster).

Additionally, I started porting the code for computing the MFCCs as a
separate Python C extension, and the first tests on long audio files are
great as well (20-50x speedup).

As stated before, aeneas will fall back to the current pure Python code
in case the C extensions cannot be loaded (e.g., not installed/compiled
correctly, etc.).

Finally, I improved dtw.py, so that the computation of the accumulated
cost matrix will be done in-place, effectively halving the peak memory
consumption of the aligner.

I should be able to package and test all these improvements in the next
two weeks or so.

As always, feel free to share any comments you might have.

Best regards,

Alberto Pettarin

Xavier Anguera

unread,

Aug 18, 2015, 6:36:11 AM8/18/15

to aeneas-forc...@googlegroups.com

Great job Alberto.
Out of curiosity, what would now be the real-time factor of your system end-to-end?

Thanks.

X. Anguera

--
You received this message because you are subscribed to the Google Groups "aeneas-forced-alignment" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aeneas-forced-ali...@googlegroups.com.
To post to this group, send email to aeneas-forc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/aeneas-forced-alignment/55D2FCF1.1060701%40readbeyond.it.
For more options, visit https://groups.google.com/d/optout.

Alberto Pettarin

unread,

Aug 18, 2015, 8:13:37 AM8/18/15

to aeneas-forc...@googlegroups.com

On 08/18/2015 12:36 PM, Xavier Anguera wrote:
> Great job Alberto.
> Out of curiosity, what would now be the real-time factor of your system
> end-to-end?

I assume that the "real-time factor" (RTF) is the time to produce the
output divided by the duration of the input (RTF = o / i).

I have not done extensive benchmarking yet --- I plan to play with it
some time next week, especially to compare pure Python vs Python + C.

To give you a rough idea, here are the "real time" values returned by
Bash time [1] when running aeneas.tools.execute_task with the C
extensions enabled on three "typical" tasks [2]:

1. o = 3.811 i = 53.320 (~1m, 15 frags) => RTF = 0.071
2. o = 71.283 i = 941.110 (~15m, 283 frags) => RTF = 0.076
3. o = 362.307 i = 2305.180 (~38m, 747 frags) => RTF = 0.157

=== === ===

Footnotes:

[1] with the C extensions on, much of the time is now spent
a. converting the given audio file into WAVE format (the above input
audio files were in MP3 format), and
b. synthesizing the text [3].

The actual MFCC extraction + DTW requires way less time:

1. o' = 0.374 i = 53.320 => RTF' = 0.007
2. o' = 11.885 i = 941.110 => RTF' = 0.013
3. o' = 28.549 i = 2305.180 => RTF' = 0.013

(Of course one can say that not counting the synthesis step is like
cheating, since the MFCC+DTW approach relies on it. The latter table was
just meant to suggest that the bottleneck is now elsewhere.)

[2] the values above were taken on my crappy laptop (1.4 GHz, 4 GB RAM);
on a decent 4GB RAM VPS I usually run 50-100% faster.

[3] aeneas synthesizes the text by calling espeak via subprocess, which
introduces a lot of overhead. I have researched the possibility of
calling espeak via its C library (speak_lib.h) --- the same approach
e.g. pyttsx takes --- but then I would need to modify/wrapping it to get
the synthesized wave into memory/file instead of emitting through the
sound device, and I am not sure this would be a good investment of my
time. (
https://github.com/parente/pyttsx/issues/6#issuecomment-129486418 ) But
of course if someone wants to sponsor the effort, I can reconsider this
decision...

AlPe

Xavier Anguera

unread,

Aug 18, 2015, 11:09:13 AM8/18/15

to aeneas-forc...@googlegroups.com

Thanks Alberto,

yes, I was referring to exactly this metric. This is what we (speech people) normally use to define the speed of systems.

Your values are in line with the fastest you will get in a standard architecture (not going into GPU or FPGA).

Regarding the MFCC extraction, I would suggest you look into open source libraries available out there, some of which are very efficiently programmed and.

The TTS part will always be slow, unless you are able to find an HMM-based TTS system. When performing the research for [1] I found the quality of the TTS to be very correlated with the accuracy of the alignment. I used Festival as my TTS engine and had to rely on the highest quality TTS in order to get accurate-enough word-level alignments. For sentence level alignments you should be able to get by with a faster/less quality system.

yours,

X. Anguera

"Automatic Synchronization of Electronic and Audio Books via TTS Alignment and Silence Filtering",
Xavier Anguera, Néstor Pérez, Andreu Urruela and Nuria Oliver, in Proc. Hot Topics in Multimedia within ICME 2011, Barcelona, Spain. http://www.xavieranguera.com/papers/icme2011b.pdf

--
You received this message because you are subscribed to the Google Groups "aeneas-forced-alignment" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aeneas-forced-ali...@googlegroups.com.
To post to this group, send email to aeneas-forc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/aeneas-forced-alignment/55D32068.7000108%40readbeyond.it.

Alberto Pettarin

unread,

Aug 18, 2015, 12:38:31 PM8/18/15

to aeneas-forc...@googlegroups.com

On 08/18/2015 05:09 PM, Xavier Anguera wrote:
> Thanks Alberto,
>
> yes, I was referring to exactly this metric. This is what we (speech
> people) normally use to define the speed of systems.
> Your values are in line with the fastest you will get in a standard
> architecture (not going into GPU or FPGA).
> Regarding the MFCC extraction, I would suggest you look into open source
> libraries available out there, some of which are very efficiently
> programmed and.
> The TTS part will always be slow, unless you are able to find an
> HMM-based TTS system. When performing the research for [1] I found the
> quality of the TTS to be very correlated with the accuracy of the
> alignment. I used Festival as my TTS engine and had to rely on the
> highest quality TTS in order to get accurate-enough word-level
> alignments. For sentence level alignments you should be able to get by
> with a faster/less quality system.

Thank you for the useful comments and pointers.

=== === ===

MFCC lib: I ported the Python code (mfcc.py) into C, optimized it a bit
(precomputing sin tables for the FFT/RFFT, building the Mel filter bank
at once, etc.), and packaged it as a Python C extension to be built locally.

I chose this path for two main reasons:

1. to eliminate the dependencies from other libraries/tools, hopefully
minimizing build/distribution issues, and

2. consistency when comparing the results from the pure Python code and
the Python+C code (on the same input).

Since the resulting C code seems quick enough, I would not overengineer
this approach. Shall unforeseen needs for speed raise, such 3rd party
libs might get investigated (aubio, sptk, etc.)

=== === ===

TTS: my previous tests (espeak, festival, Loquendo/Nuance, IVONA) agree
with your observation about the role of the TTS. Since I have been
mostly interested in (sub)sentence-level alignment, and preferring free
software options, espeak has been a sufficiently good choice so far for
my needs. But of course if one wants/needs to use another TTS, she just
have to write a wrapper similar to espeakwrapper.py.

=== === ===

Best regards,

AlPe

Crisman Cooley

unread,

Aug 18, 2015, 4:18:27 PM8/18/15

to aeneas-forc...@googlegroups.com

Alberto & all,

Just a quick update (since I said I'd give one :) Neil O and I (mostly Neil :) worked on porting the system to a Win7 PC. We decided to try doing it w/o the emulator (forget your name for it) so it would run faster. We have libs installed now, no mean feat, but are still getting an error on check_dependencies.py saying it can't find eSpeak, tho espeak is clearly in the PATH and is working (I've talked to my family using that weird british cyborg accent haha). Also I tried running this:

python -m aeneas.tools.execute_task

in the aeneas folder, and I get this error:

"DLL load failed: [argument 1] is not a valid Win32 application."

So we are working on solutions. Open to any input from the PC whizzes out there. ...believe someone ported this to Win 8?

We'll be working on this again next week. Thanks for your C code speed up--that will be a life-saver once we get into full production, which is my goal for my co Tribd Audiobooks. Just joined the Audio Publishers Association. I will alert them to your work, as a proxy for donation since we're in startup mode.

Keep up the good work!

Cheers,

Crisman

AlPe

--
You received this message because you are subscribed to the Google Groups "aeneas-forced-alignment" group.
To unsubscribe from this group and stop receiving emails from it, send an email to aeneas-forced-ali...@googlegroups.com.
To post to this group, send email to aeneas-forc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/aeneas-forced-alignment/55D35E81.6030201%40readbeyond.it.

Reply all

Reply to author

Forward