Hi all,
We've run into a problem with a JupyterHub on Kubernetes set up (based on Z2JH), where user pods sometimes end up in an OutOfmemory state on startup. This seems to happen only with multiple hubs and user placeholders. Has anyone else run into this issue?
Our set up is a single Kuberenetes cluster. Nodes are sized to support a single user pod at a time, and we're relying on autoscaling to set the cluster size appropriately. We typically run several hubs on this cluster, and we've recently be experimenting with the user placeholders to speed start up time.
As best I can tell, the issue occurs as follows: Let's suppose we're running two hubs, alpha and beta. Alpha has two placeholders running, ph-1 and ph-2. Since each takes up a full node, they are running on node-a and node-b. Now, on the beta hub, user-beta-1 starts their server. There is no extra space, so ph-1 gets evicted, and user-beta-1 is assigned to node-a. This all works fine, even with the ph-1 and user-beta-1 being from different namespaces. The cluster notices the unschedulable pod (ph-1) and starts scaling up. Before the new node is ready, another user from the beta hub, user-beta-2, starts their server. Again, there is no extra space, so ph-2 is evicted. Now, however, both user-beta-2 and ph-1 (which was waiting for space to open up) get assigned to node-b (where ph-2 just left). ph-1 starts up more quickly and reserves the node's resources, so when user-beta-2 starts, it finds insufficient resources. (In our setup, memory is the critical limit.) user-beta-2 enters an OutOfmemory state, where it sulks until I come around and delete the pod. (Even if we remove other pods from the node, it never recovers.)
One worry is that there is some mismatch in priorities between the different namespaces. But I don't think that is the (entire) issue -- placeholders from one namespace are evicted to make room for pods from another. I think it has more to do with assignment of waiting pods to (newly empty) nodes. As I understand it, this is done by the userScheduler, and there's one per hub namespace. Perhaps the two schedulers are making inconsistent decisions?
One solution would be to give each hub its own node pool, so that pods could only evict placeholders from the same hub. I'd like to avoid that if possible. Our hubs have different usage patterns, so it's nice to have one large pool of placeholders that can server whichever hub is seeing the most use at any given time.
I wonder if it's possible to run a single userScheduler for all the hubs. Perhaps this would force it to consider user pods from all namespaces when making decisions. But I don't know how to go about doing this offhand.
Another solution would be to find a way to restart user pods that get into the OutOfmemory state. If I delete the pod by hand and then restart it from the web interface (which itself requires a restart to notice that the user pod has gone away), it will come up just fine, even kicking out the placeholder that beat it before. Running a cron job that could do this every minute would be a fine stop-gap solution. But again, I'm a bit out of my depth here.
Any ideas or suggestions? I'll readily admit that there's a 50/50 chance I've misdiagnosed at least a part of the problem, so I'm happy to run any additional diagnostics that might clear things up.
Thanks,
Robert