How can I make multithreaded decoder?

1,247 views
Skip to first unread message

Jinserk Baik

unread,
Jul 21, 2017, 4:15:30 PM7/21/17
to kaldi-help
Hi,

I'm considering to split long wav file into small pieces and run them with multithreaded decoder to boost up the decoding speed. I'm taking a look of the online decoder structure, in order to distinguish static part and dynamic part during the decoding calculation respectively. I want to extract static parts into a separate ThreadedDecoderConfig class, and share its instance among the ThreadedDecoder function object inherited by kaldi::MultiThreadable. However, most of objects in the online decoder have deep nested hierarchy, so it's not easy to figure out which one is static and the other is dynamic.

1. The all *Config and *Info classes are static in decoding process?
2. How about the TransitionModel and nnet3::AmNnetSimple? Do they have any inner state change after decoding by SingleUtteranceNnet3Decoder?
3. OnlineTimingStats and OnlineIvectorExtractorAdaptationState seems to be shared but controlled by mutex. Is it correct?

Jin

Daniel Povey

unread,
Jul 21, 2017, 4:54:45 PM7/21/17
to kaldi-help
You can tell what is modified by the use of 'const' in the code. The
code (at least nnet2 and nnet3) has been designed so as to make it
easy to make multi-threaded decoders.
However, normally if you want to decode multi-threaded you'd want to
decode separate streams of data, e.g. separate wav files. IMO it
doesn't really make sense to do multi-threaded decoding of a single
stream because it's hard to do the right thing with the adaptation
state (if there are ivectors) and for other reasons too. In an
application where you cared about latency it would normally make more
sense to transmit the data bit by bit and process it as it comes. The
decoding is much faster than real-time for good configurations (e.g.
chain models) so this should be easy.


Dan
> --
> Go to http://kaldi-asr.org/forums.html find out how to join
> ---
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kaldi-help+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Jinserk Baik

unread,
Jul 21, 2017, 5:25:39 PM7/21/17
to kaldi-help, dpo...@gmail.com
Thank you for your valuable advice, Dan! You already know that I'm focusing on the sentence segmentation, and I have made a two-step approach: 1) segment a wav into pieces of utterances by VAD, and 2) decode each utterance and segment once again using alignment, divided by sil phone. The result is a level of satisfaction, but the problem is speed. For processing 15-minutes wav file with two channels, it takes about 12 minutes. Of course it's sufficient to use for real time application, but IMO it can be more faster if I would make each stage multithreaded. I'm using 40-core system, but the decoder binary only use single core in overall process.

Jin

Daniel Povey

unread,
Jul 21, 2017, 5:31:44 PM7/21/17
to Jinserk Baik, kaldi-help
If you treat them as separate speakers and don't attempt to share the
speaker-adaptation information, this approach could make sense.
Or you could extract a single ivector for the whole thing beforehand
and use the offline decoding code, similar to
nnet3-latgen-faster-parallel.

Jinserk Baik

unread,
Jul 21, 2017, 5:55:18 PM7/21/17
to kaldi-help, jinser...@gmail.com, dpo...@gmail.com
Oh, you already made a good reference! It seems what I wanted to do.
Thank you so much!

Jin

Kirill Katsnelson

unread,
Jul 23, 2017, 12:22:37 AM7/23/17
to kaldi-help
If you have a profiler handy, notice how much time is spent in AM computation. Again, from our practice (not very large LMs, on the order of a couple million arcs in the G), our nnet2-based production decoder spends 80% of time in matrix calculations anyway, and we are using Intel MKL (and Intel CPUs), supposedly the most optimized setup. nnet3 graph decoding is more efficient due to significantly simpler H topology, so the relative figure may be even higher in nnet3 (for the same AM network size/complexity). So if you have just one and only stream, offloading AM computation onto the GPU could provide the biggest performance boost. I believe Kaldi has this option off the shelf.

Another performance option is the decoding beam. Setting it too high may hurt not only performance, but surprisingly the WER as well. From experience, smaller LMs prefer narrower beams.

There was an interesting Microsoft paper on the parallelization of WFST decoding: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/11/ParallelizingWFSTSpeechDecoders.ICASSP2016.pdf

 -kkm

Kirill Katsnelson

unread,
Jul 23, 2017, 2:29:14 AM7/23/17
to kaldi-help
Oh, and to answer your questions, (1) and (2) yes. We share a single copy of AM among as many requests a server is processing, and here is a literal code excerpt:

class AcousticModel {
public:
  static std::shared_ptr<AcousticModel> Get(const stringlanguage_idconst stringmodel_id);
 
  const kaldi::TransitionModel& transition_model() const { return *trans_model_; }
  const kaldi::ContextDependency& context_dependency() const { return *context_dependency_; }
  const kaldi::nnet2::AmNnet& nnet() const { return *nnet_; }
  const kaldi::OnlineNnet2FeaturePipelineInfo& feature_info() const { return *feature_info_; }

Note that all these references are const. nnet3 is no different.

As for the adaptation state (3), Dan has answered already. Yes the mutex would guard it from corrupting the internal data state, but heaping random updates from multiple utterances makes no sense.


 -kkm

On Friday, July 21, 2017 at 1:15:30 PM UTC-7, Jinserk Baik wrote:

Jinserk Baik

unread,
Jul 23, 2017, 2:33:14 AM7/23/17
to kaldi-help
Thank you Kirill! It's great if we can boost up the AM decoder easily by using GPU. But you know GPU is not a cheap resource, so most of our application servers don't have that. As I told, the server has E5 intel CPU with 40 cores, and most of them aren't utilized during the decoding. Interesting thing is, even I used OpenBLAS with multithread support, the decoding time is not sigificantly improved. It just reduce the execution time by 1 minute or less, for the 15-minute length wav file. That's why I'm thinking parallelize decoding. I didn't test MKL.

Jin



--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to a topic in the Google Groups "kaldi-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kaldi-help/rOeKi_CXoj8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kaldi-help+...@googlegroups.com.

Kirill Katsnelson

unread,
Jul 23, 2017, 4:06:33 AM7/23/17
to kaldi-help
I did not have much success with MKL's multithreaded BLAS. On some machines performance even dropped inexplicably and significantly, and I did not pursue that with Intel any further; with our usage, where we target one active recognition per core on average, multithreading is hardly beneficial. But if you are after better realtime response rate, at least try it, but beware of possible scalability issues when hitting the server with many simultaneous requests. In the single-threaded case, MKL is certainly worth a try, but I do not expect it to magically change performance numbers.

Without understanding your application I cannot suggest more. Indeed, one obvious point where you can easily split processing between two cores is compute the AM in one thread and pass frames to the decoder thread. But the split is uneven; with the observed 80/20 balance between the use of CPU in the AM and the decoder, this would give us 20% realtime improvement at most (but YMMV, if you use larger models). On the other hand, it is very simple to implement, and why not get free 20% realtime response boost at times when the server is not running at capacity? May be I'll add that when I have free cycles myself. Another is what you attempted, using multithreaded matrix algebra libraries. All in all, in our application, sending speech as it is received from the speaker to the server and into the decoder proved to be totally adequate in a single-threaded implementation. When sizing servers, we went not for most cores, but rather slightly higher GHz -- high core count chips have slower clocks, chiefly due to the high heat flux from the die.

Parallelizing WFST decoding looks, on the scale between helluva lot of work and nightmare, closer to nightmare. This is not only algorithmically complex (writing provably correct code is an easy part), but then minimizing locking (welcome acqure/release semantics of memory access) is tricky, and optimizing for CPU cache performance is an art that I do not posses--and not that I want to get into.

 -kkm

Jerry Pan

unread,
Jun 14, 2018, 5:17:52 AM6/14/18
to kaldi-help
Hi Dan,

It is not true for Cuda (on multi-threaded decoders).  It is very easy to make it work for MKL or CBLAS,  I am running30 decoder on a fairly complex nnet3 with each one under real-time decoding for days.
But is easily crashed on GPU ( k20,  Titan and etc) within seconds. Sometimes it is segmentation fault; sometimes it is crash. All are related cuda functions like the following.  Cuda driver functions are very lousy for multiple thread decoding. Still, try to fix...

Thanks!

LOG ([5.2.142~6-90600]:kaldi::CuDevice::IsComputeExclusive():cu-device.cc:264) CUDA setup operating under Compute Exclusive Process Mode.
LOG ([5.2.142~6-90600]:kaldi::CuDevice::SelectGpuIdMan():cu-device.cc:491) The active GPU is [0]: Tesla K20c    free:4647M, used:78M, total:4726M, free/total:0                                            .98335 version 3.5
LOG ([5.2.142~6-90600]:kaldi::nnet3::Nnet::RemoveSomeNodes():nnet-nnet.cc:926) Removed 2 orphan nodes.
LOG ([5.2.142~6-90600]:kaldi::nnet3::Nnet::RemoveOrphanComponents():nnet-nnet.cc:849) Removing 2 orphan components.
LOG ([5.2.142~6-90600]:kaldi::nnet3::ModelCollapser::Collapse():nnet-utils.cc:798) Added 1 components, removed 2
LOG ([5.2.142~6-90600]:kaldi::nnet3::CompileLooped():nnet-compile-looped.cc:337) Spent 2.86904 seconds in looped compilation.
ERROR ([5.2.142~6-90600]:kaldi::CuMemoryAllocator::Free():cu-allocator.cc:283) Attempt to free CUDA memory pointer that was not allocated: 0000004306AA0000


ASSERTION_FAILED ([5.2.142~6-90600]:kaldi::CuMemoryAllocator::MruCache::Lookup():cu-allocator.cc:317) : '!q.empty()'


WARNING ([5.2.142~6-90600]:kaldi::nnet3::NnetComputer::ExecuteCommand():nnet-compute.cc:341) Printing some background info since error was detected
LOG ([5.2.142~6-90600]:kaldi::nnet3::NnetComputer::ExecuteCommand():nnet-compute.cc:342) matrix m1(79, 29), m2(77, 29), m3(77, 29), m4(75, 928), m5(73, 464), m                                            6(71, 464), m7(69, 464), m8(63, 464), m9(63, 512), m10(21, 1536), m11(21, 512), m12(1, 640), m13(1, 2560), m14(1, 1024), m15(1, 256), m16(1, 640), m17(1, 640),                                             m18(1, 2560), m19(1, 1024), m20(1, 256), m21(1, 640), m22(1, 640), m23(1, 2560), m24(1, 1024), m25(1, 256), m26(1, 640), m27(1, 640), m28(1, 2560), m29(1, 102                                            4), m30(1, 256), m31(1, 640), m32(1, 640), m33(1, 2560), m34(1, 1024), m35(1, 256), m36(1, 640), m37(1, 640), m38(1, 2560), m39(1, 1024), m40(1, 256), m41(1, 6                                            40), m42(1, 640), m43(1, 2560), m44(1, 1024), m45(1, 256), m46(1, 640), m47(1, 640), m48(1, 2560), m49(1, 1024), m50(1, 256), m51(1, 640), m52(1, 640), m53(1,                                             2560), m54(1, 1024), m55(1, 256), m56(1, 640), m57(1, 640), m58(1, 2560), m59(1, 1024), m60(1, 256), m61(1, 640), m62(1, 640), m63(1, 2560), m64(1, 1024), m65(                                            1, 256), m66(1, 640), m67(1, 640), m68(1, 2560), m69(1, 1024), m70(1, 256), m71(1, 640), m72(1, 640), m73(1, 2560), m74(1, 1024), m75(1, 256), m76(1, 640), m77                                            (1, 640), m78(1, 2560), m79(1, 1024), m80(1, 256), m81(1, 640), m82(1, 640), m83(1, 2560), m84(1, 1024), m85(1, 256), m86(1, 640), m87(1, 640), m88(1, 2560), m                                            89(1, 1024), m90(1, 256), m91(1, 640), m92(1, 640), m93(1, 2560), m94(1, 1024), m95(1, 256), m96(1, 640), m97(1, 640), m98(1, 2560), m99(1, 1024), m100(1, 256)                                            , m101(1, 640), m102(1, 640), m103(1, 2560), m104(1, 1024), m105(1, 256), m106(1, 640), m107(1, 640), m108(1, 2560), m109(1, 1024), m110(1, 256), m111(1, 640),                                             m112(1, 640), m113(1, 2560), m114(1, 1024), m115(1, 256), m116(19, 768), m117(19, 512), m118(17, 1536), m119(17, 512), m120(1, 640), m121(1, 2560), m122(1, 10                                            24), m123(1, 256), m124(1, 640), m125(1, 640), m126(1, 2560), m127(1, 1024), m128(1, 256), m129(1, 640), m130(1, 640), m131(1, 2560), m132(1, 1024), m133(1, 25                                            6), m134(1, 640), m135(1, 640), m136(1, 2560), m137(1, 1024), m138(1, 256), m139(1, 640), m140(1, 640), m141(1, 2560), m142(1, 1024), m143(1, 256), m144(1, 640                                            ), m145(1, 640), m146(1, 2560), m147(1, 1024), m148(1, 256), m149(1, 640), m150(1, 640), m151(1, 2560), m152(1, 1024), m153(1, 256), m154(1, 640), m155(1, 640)                                            , m156(1, 2560), m157(1, 1024), m158(1, 256), m159(1, 640), m160(1, 640), m161(1, 2560), m162(1, 1024), m163(1, 256), m164(1, 640), m165(1, 640), m166(1, 2560)                                            , m167(1, 1024), m168(1, 256), m169(1, 640), m170(1, 640), m171(1, 2560), m172(1, 1024), m173(1, 256), m174(1, 640), m175(1, 640), m176(1, 2560), m177(1, 1024)                                            , m178(1, 256), m179(1, 640), m180(1, 640), m181(1, 2560), m182(1, 1024), m183(1, 256), m184(1, 640), m185(1, 640), m186(1, 2560), m187(1, 1024), m188(1, 256),                                             m189(1, 640), m190(1, 640), m191(1, 2560), m192(1, 1024), m193(1, 256), m194(1, 640), m195(1, 640), m196(1, 2560), m197(1, 1024), m198(1, 256), m199(1, 640),                                             m200(1, 640), m201(1, 2560), m202(1, 1024), m203(1, 256), m204(15, 768), m205(15, 512), m206(13, 1536), m207(13, 512), m208(1, 640), m209(1, 2560), m210(1, 102                                            4), m211(1, 256), m212(1, 640), m213(1, 640), m214(1, 2560), m215(1, 1024), m216(1, 256), m217(1, 640), m218(1, 640), m219(1, 2560), m220(1, 1024), m221(1, 256                                            ), m222(1, 640), m223(1, 640), m224(1, 2560), m225(1, 1024), m226(1, 256), m227(1, 640), m228(1, 640), m229(1, 2560), m230(1, 1024), m231(1, 256), m232(1, 640)                                            , m233(1, 640), m234(1, 2560), m235(1, 1024), m236(1, 256), m237(1, 640), m238(1, 640), m239(1, 2560), m240(1, 1024), m241(1, 256), m242(1, 640), m243(1, 640),                                             m244(1, 2560), m245(1, 1024), m246(1, 256), m247(1, 640), m248(1, 640), m249(1, 2560), m250(1, 1024), m251(1, 256), m252(1, 640), m253(1, 640), m254(1, 2560),                                             m255(1, 1024), m256(1, 256), m257(1, 640), m258(1, 640), m259(1, 2560), m260(1, 1024), m261(1, 256), m262(1, 640), m263(1, 640), m264(1, 2560), m265(1, 1024),                                             m266(1, 256), m267(1, 640), m268(1, 640), m269(1, 2560), m270(1, 1024), m271(1, 256), m272(13, 256), m273(13, 3917), m274(39, 29), m275(39, 29), m276(39, 29),                                             m277(41, 29), m278(39, 928), m279(41, 928), m280(39, 464), m281(41, 464), m282(39, 464), m283(41, 464), m284(39, 464), m285(45, 464), m286(39, 464), m287(39,                                             512), m288(13, 1536), m289(13, 512), m290(1, 640), m291(1, 640), m292(1, 2560), m293(1, 1024), m294(1, 256), m295(1, 640), m296(1, 640), m297(1, 2560), m298(1,                                             1024), m299(1, 256), m300(1, 640), m301(1, 640), m302(1, 2560), m303(1, 1024), m304(1, 256), m305(1, 640), m306(1, 640), m307(1, 2560), m308(1, 1024), m309(1,                                             256), m310(1, 640), m311(1, 640), m312(1, 2560), m313(1, 1024), m314(1, 256), m315(1, 640), m316(1, 640), m317(1, 2560), m318(1, 1024), m319(1, 256), m320(1,                                             640), m321(1, 640), m322(1, 2560), m323(1, 1024), m324(1, 256), m325(1, 640), m326(1, 640), m327(1, 2560), m328(1, 1024), m329(1, 256), m330(1, 640), m331(1, 6                                            40), m332(1, 2560), m333(1, 1024), m334(1, 256), m335(1, 640), m336(1, 640), m337(1, 2560), m338(1, 1024), m339(1, 256), m340(1, 640), m341(1, 640), m342(1, 25                                            60), m343(1, 1024), m344(1, 256), m345(1, 640), m346(1, 640), m347(1, 2560), m348(1, 1024), m349(1, 256), m350(1, 640), m351(1, 640), m352(1, 2560), m353(1, 10                                            24), m354(1, 256), m355(13, 768), m356(13, 1536), m357(13, 512), m358(1, 640), m359(1, 640), m360(1, 2560), m361(1, 1024), m362(1, 256), m363(1, 640), m364(1,                                             640), m365(1, 2560), m366(1, 1024), m367(1, 256), m368(1, 640), m369(1, 640), m370(1, 2560), m371(1, 1024), m372(1, 256), m373(1, 640), m374(1, 640), m375(1, 2                                            560), m376(1, 1024), m377(1, 256), m378(1, 640), m379(1, 640), m380(1, 2560), m381(1, 1024), m382(1, 256), m383(1, 640), m384(1, 640), m385(1, 2560), m386(1, 1                                            024), m387(1, 256), m388(1, 640), m389(1, 640), m390(1, 2560), m391(1, 1024), m392(1, 256), m393(1, 640), m394(1, 640), m395(1, 2560), m396(1, 1024), m397(1, 2                                            56), m398(1, 640), m399(1, 640), m400(1, 2560), m401(1, 1024), m402(1, 256), m403(1, 640), m404(1, 640), m405(1, 2560), m406(1, 1024), m407(1, 256), m408(1, 64                                            0), m409(1, 640), m410(1, 2560), m411(1, 1024), m412(1, 256), m413(1, 640), m414(1, 640), m415(1, 2560), m416(1, 1024), m417(1, 256), m418(1, 640), m419(1, 640                                            ), m420(1, 2560), m421(1, 1024), m422(1, 256), m423(13, 768), m424(13, 1536), m425(13, 512), m426(1, 640), m427(1, 640), m428(1, 2560), m429(1, 1024), m430(1,                                             256), m431(1, 640), m432(1, 640), m433(1, 2560), m434(1, 1024), m435(1, 256), m436(1, 640), m437(1, 640), m438(1, 2560), m439(1, 1024), m440(1, 256), m441(1, 6                                            40), m442(1, 640), m443(1, 2560), m444(1, 1024), m445(1, 256), m446(1, 640), m447(1, 640), m448(1, 2560), m449(1, 1024), m450(1, 256), m451(1, 640), m452(1, 64                                            0), m453(1, 2560), m454(1, 1024), m455(1, 256), m456(1, 640), m457(1, 640), m458(1, 2560), m459(1, 1024), m460(1, 256), m461(1, 640), m462(1, 640), m463(1, 256                                            0), m464(1, 1024), m465(1, 256), m466(1, 640), m467(1, 640), m468(1, 2560), m469(1, 1024), m470(1, 256), m471(1, 640), m472(1, 640), m473(1, 2560), m474(1, 102                                            4), m475(1, 256), m476(1, 640), m477(1, 640), m478(1, 2560), m479(1, 1024), m480(1, 256), m481(1, 640), m482(1, 640), m483(1, 2560), m484(1, 1024), m485(1, 256                                            ), m486(1, 640), m487(1, 640), m488(1, 2560), m489(1, 1024), m490(1, 256), m491(13, 256), m492(13, 3917), m493(39, 29), m494(39, 29), m495(39, 29), m496(41, 29                                            ), m497(39, 928), m498(41, 928), m499(39, 464), m500(41, 464), m501(39, 464), m502(41, 464), m503(39, 464), m504(45, 464), m505(39, 464), m506(39, 512), m507(1                                            3, 1536), m508(13, 512), m509(1, 640), m510(1, 640), m511(1, 2560), m512(1, 1024), m513(1, 256), m514(1, 640), m515(1, 640), m516(1, 2560), m517(1, 1024), m518                                            (1, 256), m519(1, 640), m520(1, 640), m521(1, 2560), m522(1, 1024), m523(1, 256), m524(1, 640), m525(1, 640), m526(1, 2560), m527(1, 1024), m528(1, 256), m529(                                            1, 640), m530(1, 640), m531(1, 2560), m532(1, 1024), m533(1, 256), m534(1, 640), m535(1, 640), m536(1, 2560), m537(1, 1024), m538(1, 256), m539(1, 640), m540(1                                            , 640), m541(1, 2560), m542(1, 1024), m543(1, 256), m544(1, 640), m545(1, 640), m546(1, 2560), m547(1, 1024), m548(1, 256), m549(1, 640), m550(1, 640), m551(1,                                             2560), m552(1, 1024), m553(1, 256), m554(1, 640), m555(1, 640), m556(1, 2560), m557(1, 1024), m558(1, 256), m559(1, 640), m560(1, 640), m561(1, 2560), m562(1,                                             1024), m563(1, 256), m564(1, 640), m565(1, 640), m566(1, 2560), m567(1, 1024), m568(1, 256), m569(1, 640), m570(1, 640), m571(1, 2560), m572(1, 1024), m573(1,                                             256), m574(13, 768), m575(13, 1536), m576(13, 512), m577(1, 640), m578(1, 640), m579(1, 2560), m580(1, 1024), m581(1, 256), m582(1, 640), m583(1, 640), m584(1                                            , 2560), m585(1, 1024), m586(1, 256), m587(1, 640), m588(1, 640), m589(1, 2560), m590(1, 1024), m591(1, 256), m592(1, 640), m593(1, 640), m594(1, 2560), m595(1                                            , 1024), m596(1, 256), m597(1, 640), m598(1, 640), m599(1, 2560), m600(1, 1024), m601(1, 256), m602(1, 640), m603(1, 640), m604(1, 2560), m605(1, 1024), m606(1                                            , 256), m607(1, 640), m608(1, 640), m609(1, 2560), m610(1, 1024), m611(1, 256), m612(1, 640), m613(1, 640), m614(1, 2560), m615(1, 1024), m616(1, 256), m617(1,                                             640), m618(1, 640), m619(1, 2560), m620(1, 1024), m621(1, 256), m622(1, 640), m623(1, 640), m624(1, 2560), m625(1, 1024), m626(1, 256), m627(1, 640), m628(1,                                             640), m629(1, 2560), m630(1, 1024), m631(1, 256), m632(1, 640), m633(1, 640), m634(1, 2560), m635(1, 1024), m636(1, 256), m637(1, 640), m638(1, 640), m639(1, 2                                            560), m640(1, 1024), m641(1, 256), m642(13, 768), m643(13, 1536), m644(13, 512), m645(1, 640), m646(1, 640), m647(1, 2560), m648(1, 1024), m649(1, 256), m650(1                                            , 640), m651(1, 640), m652(1, 2560), m653(1, 1024), m654(1, 256), m655(1, 640), m656(1, 640), m657(1, 2560), m658(1, 1024), m659(1, 256), m660(1, 640), m661(1,                                             640), m662(1, 2560), m663(1, 1024), m664(1, 256), m665(1, 640), m666(1, 640), m667(1, 2560), m668(1, 1024), m669(1, 256), m670(1, 640), m671(1, 640), m672(1,                                             2560), m673(1, 1024), m674(1, 256), m675(1, 640), m676(1, 640), m677(1, 2560), m678(1, 1024), m679(1, 256), m680(1, 640), m681(1, 640), m682(1, 2560), m683(1,                                             1024), m684(1, 256), m685(1, 640), m686(1, 640), m687(1, 2560), m688(1, 1024), m689(1, 256), m690(1, 640), m691(1, 640), m692(1, 2560), m693(1, 1024), m694(1,                                             256), m695(1, 640), m696(1, 640), m697(1, 2560), m698(1, 1024), m699(1, 256), m700(1, 640), m701(1, 640), m702(1, 2560), m703(1, 1024), m704(1, 256), m705(1, 6                                            40), m706(1, 640), m707(1, 2560), m708(1, 1024), m709(1, 256), m710(13, 256), m711(13, 3917), m712(39, 29)
# The following show how matrices correspond to network-nodes and
# cindex-ids.  Format is: matrix = <node-id>.[value|deriv][ <list-of-cindex-ids> ]
# where a cindex-id is written as (n,t[,x]) but ranges of t values are compressed
# so we write (n, tfirst:tlast).
m1 == value: input[(0,-15:63)]
m2 == value: lda_input[(0,-15:61)]
m3 == value: lda[(0,-15:61)]
m4 == value: cnn1.conv[(0,-14:60)]
m5 == value: cnn2.conv[(0,-13:59)]
m6 == value: cnn3.conv[(0,-12:58)]

Daniel Povey

unread,
Jun 14, 2018, 2:24:40 PM6/14/18
to kaldi-help
That error you are getting can be easily fixed, you need to change
your device initialization code if your program will be
multi-threaded, to the following:

#if HAVE_CUDA==1
CuDevice::Instantiate().SelectGpuId(use_gpu);
CuDevice::Instantiate().AllowMultithreading();
#endif

(the second line is new).

But it's still probably not the best way to get fast multi-threaded
decoding with GPU. We are working on a version of nnet3-compute that
will take advantage of multiple GPUs, here:

https://github.com/kaldi-asr/kaldi/pull/2479

and when that's done, its output can be piped into (e.g.) a parallel
version of latgen-faster-mapped, or into a GPU-based decoder program
when that is done.

Dan
> To post to this group, send email to kaldi...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/kaldi-help/82d6b610-4be5-4f07-a256-7f33840e520c%40googlegroups.com.

Jerry Pan

unread,
Jun 14, 2018, 9:19:13 PM6/14/18
to kaldi-help
Dan,
Thanks! Why did I not notice this function?
Reply all
Reply to author
Forward
0 new messages