This only seems to happen when trying to add a node to the cluster. Initially servers that come up in the cluster do not have the issue. Getting the error in the logs:
2023-06-24 23:35:38,301 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0348: Timeout after [300] seconds waiting for service container stability. Operation will roll back. Step that first updated the service container was 'add' at address '[
("core-service" => "management"),
("management-interface" => "http-interface")
]'
I compared some debug logs with a working environment and noticed that the failed server did not have
DEBUG [org.jboss.weld.BootstrapTracker] (MSC service thread 1-3) START bootstrap > startInitialization in the log.
There is a data cache setup with this configuration:
<cache-container name="replicated_cache" marshaller="JBOSS" modules="org.wildfly.clustering.server" statistics-enabled="true">
<transport lock-timeout="60000"/>
<replicated-cache name="DataCache" statistics-enabled="true">
<transaction locking="OPTIMISTIC" mode="FULL_XA"/>
<state-transfer timeout="${env.DataCache_STATE_TRANSFER_TIMEOUT:600000}"/>
</replicated-cache>
</cache-container>
I noticed on the working server that
DEBUG [org.infinispan.cache.impl.CacheImpl] (ServerService Thread Pool -- 91) Started cache DataCache on xx.xx.xx.xxx
is in the log, but not in the log of the failed server. I do see it for the other caches like
Started cache org.infinispan.CONFIG on xxx.xxx.xx.xx
Question - Could the cache not being started prevent
startInitialization, which then cause the deployment to fail? Would something like setting the state-tranfer timeout to 0 help?