Scalarizr agent issues

55 views
Skip to first unread message

slop...@gmail.com

unread,
Oct 19, 2016, 2:29:17 PM10/19/16
to scalr-discuss
I've been having issues with the scalarizr agent pretty much since I started using Scalr, and it seems to have gotten worse with each new agent version.

The environment:
Scalr 5.11.22 Community Edition
Scalarizr stable 4.6.6 through 4.10.0
-or-
Scalarizr latest 4.9.3 through 4.11.10
CentOS 7.2 instances
AWS and OpenStack clouds (although it happens 10x more on OpenStack than in AWS)

The issues:
1) A small percentage of systems (though the percentage has increased with later scalarizr agent versions) will get stuck in Pending state. Investigating these systems, the scalarizr agent appears to have completed the upgrade task and then crashed.
We're currently hosting 4.6.6 in a custom repo because that agent has the lowest rate of failure (~3% of all launches) - the 4.10.0 version of increased the failure rate to 20-25%!

2) A smaller percentage of systems will get stuck in Initializing state, with 2-3 failed message deliveries in the Scalr Internal Messaging panel. Once I realize the systems are stuck, I can resend the messages and the systems will come up normally. I'm not sure if the rate of this type of failure is higher with the later version of the agent, since the failure rate on the first issue was so unacceptably high.




Marc O'Brien

unread,
Oct 19, 2016, 3:02:14 PM10/19/16
to scalr-discuss
Hello Slopshid,

In case you did not have the link, our Agent change log is available here.  We have not had similar reports of high-occurrence intermittent "Pending" or "Initializing" state issues as you have described.  A fully copy of your agent logs from one of these instances using the latest agent version would be helpful to understand what is happening.  Likewise, due to the intermittent nature of the issue you are describing it would be useful to determine if there are any factors common to the failing instances, such as Cloud Platform as you noted, OS, Role, time of day, network or server load, etc.

Many thanks,
Wm. Marc O'Brien
Scalr Technical Support

James Smith

unread,
Oct 22, 2016, 8:57:43 AM10/22/16
to scalr-discuss
Quite correct, Marc. I found a correlation between very high network congestion internally and both the stuck pending and stuck initializing. 

For the stuck pending systems, if I ssh into the system and manually upgrade the agent then restart, it proceeds to the next step. This is an acceptable workaround for my purposes.

For the stuck initializing systems, once the messages have been marked Failed in the Scalr Internal Messaging panel I can hit Re-send message and the messages will succeed. Is there any way to increase the time between retries, or the number of retries, or both?

James Smith

unread,
Oct 25, 2016, 11:34:55 AM10/25/16
to scalr-discuss
Anyone? Guidance on increasing the time between msgSender retries, or increasing the number of retries?

Marc O'Brien

unread,
Nov 1, 2016, 4:41:41 PM11/1/16
to scalr-discuss
Hi James,

As you had noted, it sounds like there may be connectivity and/or congestion issues between your Scalr servers and Scalr managed instances that needs to be checked and mitigated.  For your first issue I am glad to hear you have a workaround, but this should not be necessary.  A copy of your scalarizr agent logs (/var/log/scalarizr*.log) and your meta-data files would be useful to investigate that issue.  Regarding the second issue with stuck initializing systems, we would likewise need to see a copy of your Scalr server logs (/opt/scalr-server/var/logs) to see what is happening under the hood.


Many thanks,
Wm. Marc O'Brien
Scalr Technical Support

Reply all
Reply to author
Forward
0 new messages