Hi All,
Hopefully this is the correct place to ask this question - if not please redirect me to where I can. Thank you very much!
Problem
We run an Agones Fleet with replicas: 8, where each GameServer handles 500+ concurrent game sessions. Our allocator routes games only to servers labeled canAcceptGames: true. We use allocationOverflow to mark
old servers as canAcceptGames: false during deployments, so they stop accepting new games and drain existing sessions.
Fleet config:
spec:
replicas: 8
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 99%
maxUnavailable: 50%
allocationOverflow:
labels:
canAcceptGames: "false"
Current behavior during deployment
1. kubectl apply with a new image
2. allocationOverflow immediately marks 4 old Allocated servers as canAcceptGames: false — they stop accepting new games
3. Those 4 servers begin draining their existing 500+ game sessions (takes up to 10 minutes)
4. No new servers spin up until old servers finish draining and call SDK.Shutdown()
5. Only when an old server dies does Agones create a replacement
The result: For the entire drain period (up to 10 minutes), only 4 servers are accepting new games instead of 8. All traffic load concentrates on half the fleet. New servers don't appear until old ones fully
terminate.
What we expected
With maxSurge: 99%, we expected Agones to immediately spin up new servers alongside the old ones. The total would temporarily exceed replicas (e.g., 12 servers: 4 old draining + 4 old active + 4 new), then
converge back to 8 as old servers shut down.
What we want
1. When a deployment happens → immediately spin up new servers (e.g., 4) while old servers are still draining
2. Mark 50% of old servers as canAcceptGames: false via allocationOverflow (this part works)
3. As old draining servers finish and shut down, spin up remaining new servers until we reach replicas capacity
This way the fleet never drops below full capacity. New servers absorb load while old servers drain gracefully.
What we observe
The new GameServerSet is created but stays at desired: 0. Agones won't scale it up because it can't reduce the old GameServerSet — all old servers are Allocated and can't be removed until they call
SDK.Shutdown().
NAME SCHEDULING DESIRED CURRENT ALLOCATED READY
fleet-old-gss Packed 7 8 8 0 # can't reduce — all Allocated
fleet-new-gss Packed 0 0 0 0 # never scales up
It appears the rolling update algorithm requires old servers to be removed before creating new ones, even when maxSurge should allow headroom for additional servers.
Our workaround
We temporarily set replicas: 16 (2x) during deployment so Agones creates new servers to meet the higher count. After new servers are Allocated and receiving traffic, we scale back to replicas: 8. Old servers
drain and die without being replaced (total is still above 8).
This works but adds CI/CD orchestration complexity that we'd prefer the rolling update to handle natively.
Questions
1. Is this intended behavior? Should maxSurge create new servers even when old Allocated ones can't be removed yet?
2. Is there a way to tell Agones "create new servers first, then wait for old ones to drain" — rather than the current "reduce old first, then create new"?
3. Are there plans to support a surge-first rolling update strategy for long-lived Allocated GameServers?
Environment
- Agones version: 1.39.0
- Kubernetes: GKE
- All 8 GameServers are Allocated at deploy time, each handling 500+ active game sessions
- Drain time: up to 10 minutes per server
Thank you very much