Checkpointing/restore TCP: "endpoint still has waiters upon save"

122 views
Skip to first unread message

Travis DePrato

unread,
Jul 26, 2022, 12:31:26 AM7/26/22
to gvisor...@googlegroups.com
It seems that checkpointing containers with active TCP connections isn't supported (though the most recent information I found on that was this GitHub issue from 2018).

As far as I can tell, setting DisconnectOk: true works (in Network.CreateLinksAndRoutes) to allow just dropping any active connections, but this doesn't work if there are any active waiters for a socket (it yields "endpoint still has waiters upon save"):

    if e.waiterQueue != nil && !e.waiterQueue.IsEmpty() {
        panic("endpoint still has waiters upon save")
    }

I'm not entirely sure why this restriction exists. AFAICT (piecing together history here), previously it wasn't possible to checkpoint/restore a waiter.Queue, but that changed (see https://github.com/google/gvisor/commit/8682ce689e928ec32ec810a7eb038fb582c66093).

Is there a reason that this exists? Commenting it out half works. If there are no active connections (just a socket listening for new connection), it works™. If there are active connections during the checkpoint, however, it stops working (the server doesn't accept new connections). The test for this is just a very simple Go HTTP server that sleeps for one second.

This is something I'd really like to figure out, so if anyone has any tips/pointers, I'd be very appreciative!

Travis DePrato
he/him/his

Kevin Krakauer

unread,
Jul 27, 2022, 2:55:28 PM7/27/22
to Travis DePrato, gvisor...@googlegroups.com
You're correct that we don't support checkpointing with active TCP
connections. This is something we'd like to have, but just haven't had
the bandwidth to do.

I think we could remove the check since the e.waiterQueue won't blow
up during checkpointing.

If you need true checkpointing of active TCP connections, that gets a
lot more complicated. Not only do we have to save our state, something
outside of the sandbox has to perform careful redirection of the
connection to the newly-restored sandbox. That's probably a lot of
work, and has to be done by whoever is running gVisor rather than
gVisor itself.

Kevin
> --
> You received this message because you are subscribed to the Google Groups "gVisor Users [Public]" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to gvisor-users...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/gvisor-users/SJ1PR20MB47144102B2F04C7A1751D1D1AE949%40SJ1PR20MB4714.namprd20.prod.outlook.com.

Travis DePrato

unread,
Jul 27, 2022, 6:45:59 PM7/27/22
to Kevin Krakauer, gvisor...@googlegroups.com
just haven't had the bandwidth to do.

😏

Thanks for the response! That all makes sense. I’m particularly interested in just the case where there's a socket listening and allowing any active streams to be simply disconnected. As far as I understand that shouldn't require any magic from the host as long as the host routes new connections correctly.

But that^ doesn't seem to be working today (or at the very least, doesn't work if there are active TCP connections that have been accept'd but not yet close'd when the checkpoint occurs – it does seem to work if there are no in-flight TCP connections). If you (or anyone) has any tips on how to debug that, I'd be grateful.

Travis DePrato
he/him/his

From: 'Kevin Krakauer' via gVisor Users [Public] <gvisor...@googlegroups.com>
Sent: Wednesday, July 27, 2022 11:55:15 AM
To: Travis DePrato <tra...@pathbird.com>
Cc: gvisor...@googlegroups.com <gvisor...@googlegroups.com>
Subject: Re: Checkpointing/restore TCP: "endpoint still has waiters upon save"
 

Kevin Krakauer

unread,
Jul 28, 2022, 1:32:52 PM7/28/22
to Travis DePrato, gvisor...@googlegroups.com
Unfortunately it's not obvious to me why. We don't intentionally restrict this, so we'd have to start from scratch in figuring it out.

If you're interested in debugging, I'd start by looking at the endpoint accepeQueue and endpoint.Listen to narrow down the cause.

Kevin
Reply all
Reply to author
Forward
0 new messages