Need kaldi example for performance challenge

684 views
Skip to first unread message

Ivica Bogosavljevic

unread,
Jun 5, 2021, 9:06:47 AM6/5/21
to kaldi-help
Hi!

I (Ivica Bogosavljevic) and my colleague (Denis Bakhvalov) organize a performance challenge. Normally we take an open-source problem and a test, then we assign the problem to the contestants. They have a few weeks to make the program run faster. We can also agree to commit the solution to the master.

I used to work with kaldi and I used examples from librispeech. I remember there is a very slow linked list in lattice-decoder-faster.cc. Unfortunately, I don't have either a good example nor instructions on how to reproduce the problem with the slowness.

I would like to take kaldi as an example for the competition, but I need your help. Can you provide me with an example that uses librispeech model and has a performance bottleneck in lattice-decoder-faster.cc, so I can create an example and start the challenge.

Ivica

Daniel Povey

unread,
Jun 6, 2021, 9:40:30 AM6/6/21
to kaldi-help
That's nice...
To make it easy to run, someone would have to create an archive with the files needed, I don't have time right now.
l'm not sure that it would be that easy to optimize. (Although replacing the allocator might help, especially in multi-threaded decoding such as
nnet3-latgen-faster-parallel.cc).

Dan

--
Go to http://kaldi-asr.org/forums.html to find out how to join the kaldi-help group
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/4cecfa51-5514-48f7-8160-32bcc179ea1fn%40googlegroups.com.

Ivica Bogosavljevic

unread,
Jun 7, 2021, 3:26:12 AM6/7/21
to kaldi-help
lattice-decoder-faster.cc had a lot of inefficiencies and could have been made much faster. When traversing a linked list (Token* was a linked list, and so was the ForwardingList), each access to the element of the list will cost you around 200 cycles (this is the time you can do 200 simple instructions). I remember, for example, that the decoder was going through the tokens with zero forward links, there were many tokens like this and using a skipping over the empty tokens would definitely make things faster. 

The participants at the contest are really smart guys and you would probably be surprised on how much faster they can make kaldi run. 
 If you are interested, let me know soon. Otherwise, I will find another example to optimize.

Ivica

nshm...@gmail.com

unread,
Jun 7, 2021, 9:28:47 AM6/7/21
to kaldi-help
We would be extremely excited to see the great improvement this area.

What do you need exactly again? A bash script to run the decoder? A model? The audio files? Anything else?

Jan Yenda Trmal

unread,
Jun 7, 2021, 12:03:49 PM6/7/21
to kaldi-help
I can also offer some time on the Kaldi side to prepare some use-case if there is interest (depending on if Nickolay would/wouldn't like to do it) :)
y.

Ivica Bogosavljevic

unread,
Jun 7, 2021, 4:00:46 PM6/7/21
to kaldi-help
I know how to compile Kaldi, but I need a test script to reproduce the issue. Ideally, the script should take around 10s to execute, it should accurately reproduce the problem (e.g. avoid a short running script would spend a lot of time in the initialization phase which is not relevant).

The last time I was debugging the issue, a lot of time was spent here: https://github.com/kaldi-asr/kaldi/blob/master/src/decoder/lattice-faster-decoder.cc#L299 . That code is memory bound, i.e. it suffers from a large number of cache misses and the contestants will try various techniques to lower it. My task i s to write the Contest Assignment, guide the contestant, and at the end write a report about findings. If the solution is good enough, we can commit it to Kaldi. I prefer to work with lattice-faster-decoder, because I already know the code base and it will be easier for me to write a Contest Assignment.

In the previous contest, we optimized the Canny Image processing algorithm.

In that case, the speedup was about 20x, albeit the start version of the code was really unoptimized. I hope to get a speedup on the critical function in your code about 3x, but we will see.

Ivica
Message has been deleted

nshm...@gmail.com

unread,
Jun 7, 2021, 4:30:23 PM6/7/21
to kaldi-help
You can download the sample setup here:

https://alphacephei.com/test-other/easyperf/test-speed.zip

it contains everything included - model and test files. Just modify kaldi path in the script and run decode.sh. It runs about 40 seconds and models more or less realistic decoding.

If you want to learn something about speech recognition theory and possible algorithmic improvements to the decoding, let us know too.

Daniel Povey

unread,
Jun 11, 2021, 12:35:41 PM6/11/21
to kaldi-help
Thanks, guys!

On Tue, Jun 8, 2021 at 5:25 PM nshm...@gmail.com <nshm...@gmail.com> wrote:
You can download for example

https://alphacephei.com/test-other/easyperf/test-speed.zip

it contains everything included, just change the path to kaldi in decode.sh script and run it. The archive size is 282Mb. Script runs 40 seconds and models more or less realistic decoding.

If you want to learn a bit more about speech recognition theory, possible optimization paths and overall algorithm layout, let us know too.


On Monday, June 7, 2021 at 11:00:46 PM UTC+3 ibogosa...@gmail.com wrote:

Ivica Bogosavljevic

unread,
Jun 12, 2021, 9:05:18 AM6/12/21
to kaldi-help
Hi!

I tried the provided example and here are my findings:
* Most of the time is spent in sgemm_incopy and sgemm_kernel.  These routines belong to OpenBLAS and they should be optimized there.
* Another source of slowdowns is here: https://github.com/kaldi-asr/kaldi/blob/ca6d133262aa183b23f6daba48995bd7576fb572/src/decoder/lattice-faster-decoder.cc#L775, but again, the fix is not in Kaldi but in OpenFST. The iterator is iterating over a random access data structure, and each access means a slowdown. If fixed, it should be in OpenFST.

The only place where it makes sense to do any performance-related work in Kaldi is here:  https://github.com/kaldi-asr/kaldi/blob/ca6d133262aa183b23f6daba48995bd7576fb572/src/decoder/lattice-faster-decoder.cc#L324. This loop iterates over linked lists, which is very inefficient, but there are many places for improvement here. Additionally, a lot of time is spent in malloc and free (which is not unexpected since there are a huge amount of linked lists).

As a summary, on this 40 s example, a speedup of 5 s by modifying kaldi-src alone would mean a good result.

From my point of view, I would take this example and start the performance contest. We would limit the modifications to lattice-faster-decoder.cc, lattice-faster-decoder.h. We might also add some additional headers (e.g. a custom allocator implementation). We can decide later if we wish to merge the patches contestants made to the master depending on the performance and maintainability.

One more thing: Can you please provide me with a shorter example that is based on this one. 40 s is a too long time to wait for results.

Ivica

Ivica Bogosavljevic

unread,
Jun 12, 2021, 9:36:56 AM6/12/21
to kaldi-help
Another more thing: we need tests that verify our changes haven't broken anything. Do you have them?

Daniel Povey

unread,
Jun 13, 2021, 12:14:58 AM6/13/21
to kaldi-help, Hang Lyu, Yiming Wang
We could get someone to run some example scripts (e.g. mini_librispeech) and check the WER is identical, to verify correctness.
Or check that the decoded output does not change, in this example, is a good start.
Of course check for memory leaks too.
I think it's OK to just focus on that small amount of time spent in Kaldi's own code.
Incidentally, I believe the malloc/free time taken is more significant if using multiple threads, e.g. lattice-faster-decoder-parallel.
Be careful about making the time taken of the example shorter, because eventually you'll be limited by the time spent loading the graph from disk.
BTW, at some point we experimented with speeding it up by linking to a different malloc, e.g. tcmalloc or google's malloc (I forget its name).
We did see improvements, but it didn't make it into master, I think about concerns of whether it would break the build.  There may be
a pull request and/or kaldi-help discussions about that (sorry no time to check right now).

Hang Lyu or Yiming Wang (cc'd.. think it was Hang) may have done some work on either the malloc issue, or on trying to simplify the
decoder code to remove the HashList (which has a kind of ugly interface).  My recollection is that the malloc stuff helped but was not merged
(I guess out of concerns of breaking the build, or I was just too busy to deal with it), and the replacing-HashList stuff degraded performance so 
we did not merge.  But you guys could reconsider those things.  Hang may be able to point you towards some code and/or further information.


Dan


Hang Lyu

unread,
Jun 13, 2021, 2:23:16 PM6/13/21
to Daniel Povey, kaldi-help, Yiming Wang
Yes, I did some comparison experiments with different allocators (tcmalloc, tbbmalloc, openfst's pooling allocator) in my previous project. This link ``https://github.com/kaldi-asr/kaldi/issues/3271'' contains some results about using different allocators. 
Just now, I saw your previous emails quickly. 
As you can see (lattice-faster-decoder#L324), the frequent allocate and free operations waste a lot of time. In addition, if we use some basic data structure like unordered_map to replace the existing hash-list or to use it in lattice-simple-decoder directly, it will be slower. Using some more efficient allocators (e.g. tcmalloc) will overcome the performance degradation. There is a PR ``https://github.com/kaldi-asr/kaldi/pull/4564'' about installing the tcmalloc to Kaldi, maybe you can use it. Or to do the topological sort on tokenlist will be useful to avoid the loop, but it depends on the complexity of the lattice. It may not be very related to your challenge.
For the fst iterator (lattice-faster-decoder#L775) , in decoder context, using a sequential iterator is feasible, but it will take time. If I remember correctly, the ''ConstFstImpl / VectorFstImpl'' contains the real data in something like "states_regin/ arcs_regin", which is a pointer pointing to a "MappedFile" object. In the "MappedFile" class, the MemoryRegion is operated with "offset" indexes and so on.
From my understanding of your challenge, your guys will optimize the code from a deeper level. I think it will be helpful. Kaldi's community is big, I believe your effort will help a lot. Looking forward to learning from you.

Hang

Ivica Bogosavljevic

unread,
Jun 14, 2021, 3:47:44 AM6/14/21
to kaldi-help
Hi!

About using tcmalloc in kaldi: I tried it, it works, but this is not the route we are going to take for the contest. As far as I understand, kaldi is used as a library and in that case you should let the integrator pick the system allocator by themselves. 

Related to optimizing for memory allocation and memory access, I did a lot of research there:
I am going to mention those as part of the contest assignment.

The optimization way would probably consist of: custom allocator for Token* and ForwardLink* + some __builtin_prefetch to get data in advance + keeping data in a custom linked list that is in some sense vector backed. At the end, we'll see what's best, after all,  this is the contest. Last time people had really good ideas, I hope this will be the case this time.

I started writing the contest assignment. I hope to get it out in a few days.

Ivica

Daniel Povey

unread,
Jun 14, 2021, 11:38:51 AM6/14/21
to kaldi-help

Ivica Bogosavljevic

unread,
Jun 20, 2021, 6:41:23 AM6/20/21
to kaldi-help
A question: what is the output of decode.sh? We need some kind of output in order to verify that after our change the kaldi still functions properly?

Ivica Bogosavljevic

unread,
Jun 28, 2021, 8:32:22 AM6/28/21
to kaldi-help
I need help again. To run the test, I execute:

$KALDI_ROOT/src/online2bin/online2-wav-nnet3-latgen-faster \
      --word-symbol-table=graph/words.txt \
      --config=conf/model.conf \
      am/final.mdl graph/HCLG.fst ark:test.utt2spk scp:test.scp ark:/dev/null

The problem is that this command doesn't produce any output. And we need an output for the performance contest. One output is baseline output, the other output is the output on modified kaldi. By comparing the two outputs we can make sure that there were no functional regressions. How do we get an output?

nshm...@gmail.com

unread,
Jun 28, 2021, 10:53:36 AM6/28/21
to kaldi-help

You can run this command:

$KALDI_ROOT/src/online2bin/online2-wav-nnet3-latgen-faster \
      --word-symbol-table=graph/words.txt \
      --config=conf/model.conf \
      am/final.mdl graph/HCLG.fst ark:test.utt2spk scp:test.scp ark,t:output_lats.txt

It will create an output_lats.txt which you can compare with previous runs.

It also prints decoded text on stderr, you can see it if you run the command.

Ivica Bogosavljevic

unread,
Jul 21, 2021, 3:13:57 AM7/21/21
to kaldi-help
Hi!

Just to let you know: the contest is on its way:

Thank everyone for your help and support!

Ivica

Ivica Bogosavljevic

unread,
Sep 30, 2021, 11:13:46 AM9/30/21
to kaldi-help
FYI: The performance contest is completed, on Sunday October 2nd we will have a meeting where we talk about everything that was done and the impact on speed. Feel free to join in.

The invitation to the meeting is attached.

Ivica
EasyperfChallenge5-Summary.ics
Reply all
Reply to author
Forward
0 new messages