-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
I don't have a Google account, so am reporting this here on crosvm-dev.
Hope that's okay.
Recently, I'm frequently seeing error messages like the following when
using crosvm with vhost-user-gpu:
> [2023-03-13T13:59:01.796197257+00:00 ERROR devices::virtio::vhost::user::vmm::handler] failed to start vhost_user_virtio_gpu worker: backend request failure: failed to handle a vhost-user request: temporary socket error: Resource temporarily unavailable (os error 11)
Below I have annotated the relevant code to explain why this happens.
(from devices/src/virtio/vhost/user/vmm/handler/sys/
unix.rs)
> pub async fn run_backend_request_handler(
> handler: Option<BackendReqHandler>,
> ex: &Executor,
> ) -> Result<()> {
> let mut handler = match handler {
> Some(h) => h,
> None => std::future::pending().await,
> };
>
> let h = SafeDescriptor::try_from(&handler as &dyn AsRawDescriptor)
> .map(AsyncWrapper::new)
> .context("failed to get safe descriptor for handler")?;
> let handler_source = ex
> .async_from(h)
> .context("failed to create an async source")?;
Here the backend request socket is set to non-blocking. This applies to
both file descriptors for it — the one owned by handler_source/h, and
the one owned by handler.
>
> loop {
> handler_source
> .wait_readable()
> .await
> .context("failed to wait for the handler to become readable")?;
This waits until there's something to read from the socket, but the
problem is that there being something to read does not guarantee that a
full request is available to read.
> match handler.handle_request() {
This returns a SocketRetry error if there isn't a full message available
to read.
> Ok(_) => (),
> Err(VhostError::ClientExit) => {
> info!("vhost-user connection closed");
> // Exit as the client closed the connection.
> return Ok(());
> }
> Err(e) => {
> bail!("failed to handle a vhost-user request: {}", e);
> }
> };
> }
> }
There are a couple of ways this could be fixed:
• Don't set the socket to be non-blocking. It would still be possible
to use poll() or similar to yield execution between messages
(although the crosvm codebase lacks a method for it), so blocking
would only happen during the very short time it takes to receive a
message.
• async-ify handle_request(), and the functions it uses to read from
the socket. Reading from the socket could be done in a loop that
would repeat on EAGAIN, with execution being yielded in between loop
iterations.
It might be tempting to try to split up reading the request's header and
body, like was done in
https://crrev.com/5b456d31, but this would be
incorrect, because the backend request socket is a stream socket.
Stream sockets have no concept of packets, so even just reading the
header could result in a partial read. As a result, something like the
async read loop suggested above is needed to guarantee that messages are
read properly. I think the fix in the linked commit is probably wrong
for the same reason, because the vhost-user frontend request socket is
also a stream socket. It looks like the bug goes back to
https://crrev.com/3040cf02, when async vhost-user request handling was
first introduced.
-----BEGIN PGP SIGNATURE-----
iQIzBAEBCAAdFiEEH9wgcxqlHM/ARR3h+dvtSFmyccAFAmQPa/EACgkQ+dvtSFmy
ccBNNhAAkN7SvMzbUDXNJox3F8R9NNw2uecdRESILeQaD1rraeW6aJc6am9MhqQV
CYV/bW6mB7ddyOKEbGPXeJYoXDTbWFPctgh67DzcYyGjaDsBYY8l1dU5zYsJxLeT
/G13z7zMawst8AyaQtrsPO4wrHSi7C2irZow8Mf7q4y4yCX4kjppoWXjN/eC6EuH
hRojkVwAVbWBZrvYqrVN6/7kXPSG5VyeWYqKF7yi2O/nC0Q7WjVx3XDTPCsAF6BV
V4rx6Qez7kkx2JVHlQdn45i4yTbT7E+kznViS5t8CdOEH/rul83Bt5m7kYPwWkN2
6UcW2PfJXlUjJVRxJBYST43sQOrXMaxwCCHlvWgjhHnQl2+/LEiYc5bCgEuxQzXZ
K/+ZVdODJTa0oYSSQBPWz2RCXv17pAOVW+eFXxv01j4vNFgThYL6gQ35AW6STcn2
oJ+XJzHg1Kvr+KA9x4CvR7ladvmmzBnWb5lMxXdn8qQN3NsuYiWXMMZFH0jLBXGV
mYwxeuAOGCdnxH/wyTWPORghr4a3R26xMuiMTa7qHa7OVQZ3I/chdjZzjgu/3Hnq
ldWSZ7hMgEOIu9A2mqS/j25JnyIP+SPS06RSbLM6y17LbYmqq4RB1baaZ/yibWql
Ze+aaJILR/IoM8qIVbUht7D0iJukhqYMYE62JrUcX7phU6Q2uT4=
=p73j
-----END PGP SIGNATURE-----