it seems to me that I may be having a similar problem with the ASR component (I am using a separate server for TTS, not UniMRCP).
I am using GSR plugin version 1.14.0, first version I have installed, and before writing this message I verified
that the umc scenarios work correctly reading the input from the demo
files, so the base setup is OK.
What is happening exactly is that I have a somewhat long audio message with barge in enabled, and I am unable to hear it for about more than two seconds since its beginning, because a nomatch is returned; if I speak immediately, the voice is processed and results are obtained.
I am using the builtin:speech/transcribe grammar with most default values taken from GSR defaults, I tried to use a minimum set of headers into the RECOGNIZE request, using the same values set by the umc client (in this case I tried changing Recognition-Timeout from 10000 to 20000 but it made no difference at all).
This is the log of unimrcpserver from the RECOGNIZE request up to the nomatch (I am using MRCPv1 because I use VoiceXML interpreter internally developed by our company which still uses only that MRCP version):
2019-10-09 17:32:32:962373 [INFO] Receive RTSP Data
10.51.54.199:1554 <->
10.51.54.199:41692 [568 bytes]
ANNOUNCE rtsp://
10.51.54.199:1554/speechrecognizer RTSP/1.0
Content-Length: 284
CSeq: 65
Session: 0a2ac1c42a994d7a
Content-Type: application/mrcp
Supported: method.announce
User-Agent: VoiceXML-SIP/6.4.4-2176:2177 (Linux 3.10.0-862.3.2.el7.x86_64) VoiceXML/2.1 JRE/1.8.0_171
RECOGNIZE 3 MRCP/1.0
Content-Length: 55
Speech-Language: it-IT
No-Input-Timeout: 10000
Save-Waveform: true
Content-Type: text/uri-list
Confidence-Threshold: 50
Recognition-Timeout: 20000
Recognizer-Start-Timers: false
session:aNAVWcxPeGmMOHQQ...@external.com2019-10-09 17:32:32:962492 [INFO] Process RECOGNIZE Request <0a2ac1c42a994d7a@speechrecog> [3]
2019-10-09 17:32:32:962583 [INFO] Init Speech Detector: frame-size=160, max-frame-count=350, output-frame-count=20, vad-mode=2, noinput-timeout=10000 ms, input-timeout=20000 ms, start-timeout=50 ms, complete-timeout=1000 ms, incomplete-timeout=3000 ms, leading-silence=300 ms, trailing-silence=300 ms, interim-results=1, start-of-input=external <0a2ac1c42a994d7a>
2019-10-09 17:32:32:962643 [INFO] Open Waveform File for Writing /opt/unimrcp/var/umsgsr-0a2ac1c42a994d7a-3.wav, sampling-rate [8000]
2019-10-09 17:32:32:962801 [INFO] Create gRPC Stream <0a2ac1c42a994d7a@gsr>
2019-10-09 17:32:32:963736 [INFO] Process RECOGNIZE Response <0a2ac1c42a994d7a@speechrecog> [3]
2019-10-09 17:32:32:963750 [INFO] State Transition IDLE -> RECOGNIZING <0a2ac1c42a994d7a@speechrecog>
2019-10-09 17:32:32:963785 [INFO] Send RTSP Data
10.51.54.199:1554 <->
10.51.54.199:41692 [138 bytes]
RTSP/1.0 200 OK
CSeq: 65
Session: 0a2ac1c42a994d7a
Content-Type: application/mrcp
Content-Length: 30
MRCP/1.0 3 200 IN-PROGRESS
2019-10-09 17:32:33:092620 [INFO] Speech Detector State Transition NO-INPUT -> IN-PROGRESS [130 ms] <0a2ac1c42a994d7a>
2019-10-09 17:32:33:092649 [INFO] Start Input Timer [20000 ms] <0a2ac1c42a994d7a>
2019-10-09 17:32:36:572595 [INFO] Speech Detector State Transition IN-PROGRESS -> COMPLETE [3480 ms] <0a2ac1c42a994d7a>
2019-10-09 17:32:36:572623 [INFO] Detector Stats: leading-silence=20 ms, input=530 ms, trailing-silence=3000 ms <0a2ac1c42a994d7a>
2019-10-09 17:32:36:572869 [INFO] Input Complete [success] size=61280 bytes, dur=3830 ms <0a2ac1c42a994d7a@gsr>
2019-10-09 17:32:36:628037 [INFO] Received Response: status [0] type [unspecified] result-count [0] <0a2ac1c42a994d7a@gsr>
2019-10-09 17:32:36:628135 [INFO] Process START-OF-SPEECH Event <0a2ac1c42a994d7a@speechrecog> [3]
2019-10-09 17:32:36:628197 [INFO] Process RECOGNITION-COMPLETE Event <0a2ac1c42a994d7a@speechrecog> [3]
2019-10-09 17:32:36:628215 [INFO] State Transition RECOGNIZING -> RECOGNIZED <0a2ac1c42a994d7a@speechrecog>
2019-10-09 17:32:36:628238 [INFO] Send RTSP Data
10.51.54.199:1554 <->
10.51.54.199:41692 [194 bytes]
ANNOUNCE rtsp://
10.51.54.199:1554/speechrecognizer RTSP/1.0
CSeq: 65
Session: 0a2ac1c42a994d7a
Content-Type: application/mrcp
Content-Length: 42
START-OF-SPEECH 3 IN-PROGRESS MRCP/1.0
2019-10-09 17:32:36:628282 [INFO] Send RTSP Data
10.51.54.199:1554 <->
10.51.54.199:41692 [329 bytes]
ANNOUNCE rtsp://
10.51.54.199:1554/speechrecognizer RTSP/1.0
CSeq: 65
Session: 0a2ac1c42a994d7a
Content-Type: application/mrcp
Content-Length: 176
RECOGNITION-COMPLETE 3 COMPLETE MRCP/1.0
Completion-Cause: 001 no-match
Waveform-Url: <
http://localhost/utterances/umsgsr-0a2ac1c42a994d7a-3.wav>;size=61280;duration=3830
If this can help you understand what is happening and how to prevent it, the indicated duration is 3480 ms but if I (try to) start speaking immediately when the audio message begins there are almost 2 seconds of silence at the beginning of the saved waveform (I am unable to explain now where that comes from).