Once again, we are going to cancel the release this week. But we've made significant progress and have just about all of the release blocking bugs fixed, so we expect to release as normal next week. As it turns out, we had several different problems that all came together in a harmoniously bad way.
- The /playIndex endpoint used for checking the health of the container was returning 200, even when the server was failing. This endpoint has now been enhanced to do a super lightweight database call to ensure the server and its connection to the database are healthy.
- We found a piece of code in the in-progress efforts to implement session timeouts that caused a thread starvation/deadlock condition due to how a piece of the code was not fully asynchronous. This would cause the server to completely stop responding to any requests. This has been fixed.
- We found a piece of code we use to profile database calls was creating many copies of identical objects that could never be released due to the way the internal Ebeans profiler works. This would cause the server to run out of memory. This has been fixed.
- Out of an abundance of caution, we have rolled back some of the in-progress session timeout code for now and will reimplement this at a later date.
We're still tracking down some intermittent anomalies we are seeing in our nightly testing, but we think we have the major issues fixed. Last week, we implemented a bandaid in the cloud-deploy-infra repo to change the health check endpoint to /programs in order for AWS to be able to detect when a server became unhealthy and automatically restart it. This has now been changed back to /playIndex. If you set the CIVIFORM_CLOUD_DEPLOYMENT_VERSION variable to "latest" in order to apply this bandaid, make sure you set it back to "${CIVIFORM_VERSION}" on the next release.
If you have any questions or concerns, please don't hesitate to reach out to me or the team by email, Slack, the #eng-general channel, or your specific government channel in Slack.