crosvm recently added support for vhost-user SHMEM_MAP as it is defined in the spec, but, since crosvm doesn't negotiate REPLY_ACK and set the need_reply bit in the request message, there is no reply message to tell the backend when the SHMEM_MAP op finished and it is safe to use the mapped memory, so we think it might have introduced a race condition. We temporarily introduced a hack to always send a reply (
https://crrev.com/c/7399018) and want to remove that hack ASAP.
Proposal:
- The crosvm vhost-user frontend will attempt to negotiate the REPLY_ACK protocol feature, but it won't set the need_reply bit for any of its request messages to the backend. It negotiates REPLY_ACK only so that the backend is allowed to use the need_reply bit for backend->frontend requests.
- The crosvm vhost-user backend will advertise REPLY_ACK support for all devices. If negotiated, then it will set the need_reply bit only for SHMEM_MAP and SHMEM_UNMAP messages.
So, when only crosvm is used for both frontend and backends, the behavior change is minimal. When interop'ing with other projects, like another VMM's vhost-user frontend, they are free to make heavier use of the need_reply bit. crosvm already had code to handle need_reply and the code seems correct to me, but there is potential for a breakage.
Why not set need_reply for all request messages? Most of the requests from the frontend are basically just setting fields in the backend. Waiting for an ACK in response to each field set op will slow things down (I haven't measured it) for essentially no advantage. The message types that seem to have ordering constraints, like VHOST_USER_GET_VRING_BASE (i.e. stop virtqueue op), already require reply messages.
Theoretically REPLY_ACK could enable communicating non-fatal errors between frontend/backend and allow for retries or fallback behavior, but we don't really support that in either frontend or backend yet.
The main CLs are: