I am looking into scenarios where one could encounter "writer segworker group shared snapshot collision". Only possibility (based on past investigations and I looking into code) found for the same is a case where QE session processes are terminated (shared snapshot slot keyed on gp_session_id is freed), QD session processes remain and then on new command for this session new QE processes are respawned (new shared snapshot slot needs to be assigned for same gp_session_id). This new process hits the collision error till the old process on QE is not gone.
This mostly happens when gp_vmem_idle_time reached, the QD will clean the idle writer and reader gang and close the connection to the QE, QE will quit in an async way. if QE cannot quit before QD starts a new command, it will find the same session id in the shared snapshot. QE session quit may take time due to ProcArrayLock contention.
gp_vmem_idle_resource_timeout is stated to help release resources for idle QE sessions to help concurrency.
Solution 1:
Understand releasing idle QE reader gangs will help release resources. I am wondering about how much we really save by quitting the QE writer gang process? What if we avoid doing that?
If we only quit reader gangs but avoid releasing the writer process as part of gp_vmem_idle_resource_timeout then can avoid this snapshot add collisions on new commands for the session. We already have conditional logic in DisconnectAndDestroyUnusedQEs() to only cleanup reader gangs and avoid writer gang if the session is using TempNamespace today. I am thinking we can by default only quit reader gang on idle session timeout and not writer for all the sessions. As always QE always have much higher max_connection value compared to QD, so the number of processes doesn't seem to be the issue.
Solution 2:
What if we reset the session id for this case similar to DisconnectAndDestroyAllGangs(). On a new command after idle session cleanup, a new session id will be assigned and hence also will avoid the shared snapshot collision. I have not explored the downsides of this approach but quick thought nothing jumps out. Please let me know if you see any concerns with it.
Note: So far we have a recommendation to increase the value of guc gp_snapshotadd_timeout to avoid this problem. I am trying to see if we can avoid the situation itself instead of having to tune some guc to avoid the error.
--