Hi Anthony,
Thanks for your thoughts, I clearly understand what you mean and admit the method you described should work efficient enough in case of short continuous streams (typical utterances).
But thinking globally, considering not only ASR case, this method isn't acceptable, as it doesn't allow the exact reconstruction of streams. I mean the gaps in RTP streams (discontinuous transmission). Also that method doesn't produce real-time data in output, while RTP (Real-time Transport Protocol) is intended for real-time transmission.
One may argue, why RTP is used for ASR at all, wouldn't be it better to use HTTP instead. Please note, that the method you described can be considered as HTTP streaming and this is the core difference between those two protocols (RTP and HTTP).
I'd answer probably, they chose RTP, because MRCP is intened to be used in VoIP environment, where you process real-time data all the time (SIP,H323,PSTN,...)
Also suppose in one day, you'll need to process video streams too (I know it's not tomorrow for ASR). How will you synchronize audio and video streams without timestamps.
> (ie : an adaptavice jitter between 0 and "maximum delay")
I like this much, and current implementation in UniMRCP is about 8 hours behind that.
Hope my concerns are clear and acceptable too.