Here is a drawing I often use to explain basic VXML IVR systems -
https://imgur.com/a/uYQPIgk
Of course, this is not the only way to build a solution, but it is a very common approach.
I think your understanding is correct. MRCP is not a protocol for handling calls. Calls arrive via SIP or similar and are handled by something. Typically a system like Freeswitch or Asterisk, or in many systems VXML voice browsers, are used to answer the call. MRCP is a protocol used by these devices to control media. The media may be Speech sent for recognition, speech sent for recording, speech sent for verification, or text sent for speech synthesis. The most common cases are recognition (ASR, STT -
https://www.rfc-editor.org/rfc/rfc6787.html#section-9) and synthesis (TTS -
https://www.rfc-editor.org/rfc/rfc6787.html#section-8).
VXML is one way people build these applications. Many VXML applications still use SRGS+XML (.grxml) grammars to control recognition. Often outbound prompts are synthesized by TTS or pre-recorded audio files.