I've been having issues with the scalarizr agent pretty much since I started using Scalr, and it seems to have gotten worse with each new agent version.
The environment:
Scalr 5.11.22 Community Edition
Scalarizr stable 4.6.6 through 4.10.0
-or-
Scalarizr latest 4.9.3 through 4.11.10
CentOS 7.2 instances
AWS and OpenStack clouds (although it happens 10x more on OpenStack than in AWS)
The issues:
1) A small percentage of systems (though the percentage has increased with later scalarizr agent versions) will get stuck in Pending state. Investigating these systems, the scalarizr agent appears to have completed the upgrade task and then crashed.
We're currently hosting 4.6.6 in a custom repo because that agent has the lowest rate of failure (~3% of all launches) - the 4.10.0 version of increased the failure rate to 20-25%!
2) A smaller percentage of systems will get stuck in Initializing state, with 2-3 failed message deliveries in the Scalr Internal Messaging panel. Once I realize the systems are stuck, I can resend the messages and the systems will come up normally. I'm not sure if the rate of this type of failure is higher with the later version of the agent, since the failure rate on the first issue was so unacceptably high.