Hello all,
I have some issues with smallry-metrics and the messaging subsystem in a WildFly 26.1.2.Final and 27.0.0.Final and I hope, you can help me. The metrics subsystem seems to scan all subsystems at startup for later metric collection. But if some resources are deleted afterwards, they got still called every time someone calls the "/metrics" endpoint. Because the resource is gone, there are "ERROR" logs from the model controller.
Easy setup to reproduce it:
- define a jms-queue in messaging. If you did this via CLI, restart/reload the server
- when WildFly is up, call the metrics endpoint, you will se the metrics for the jms-queue
- remove the jms-queue via CLI
- call the metrics enpoint again and you see errors for the deleted jms-queue in your log
Why is this an issue? I use WildFly in Kubernetes in a stateful set with embedded messaging and persistent volumes. To get messaging and HA, I use the
standalone-full-ha.xml as basis and change the following things:
- add a JGroups stack for messaging
- messaging live-only ha policy with scale down
- use port 8180 exclusively for the messaging HTTP connector
- configure the given messaging cluster connection with some timeouts and switch off message load balancing
Some of these configuration options may not be necessary. But I did it, because I thought it can solve my issues.
If I set the replica count to two, pod "0" starts up. Pod "1" does the same, but it creates a
"runtime-queue" with "
notif." in its name and a time based UUID. I think this is triggered by the cluster-connection. And I think it is pod 0 which connects to it. If for some reason pod 0 has to restart, this runtime queue gets invalid and is removed. When pod 0 comes back again, it creates itself a runtime queue. Pod 1 stays without such a queue and connects itself to the new runtime-queue on pod 0.
The WildFlys are working as expected, but every time the metrics are collected on pod 1, errors got logged, because the metrics subsystem tries to get values from the old "
notif." runtime queue, which is gone.
The second issue with this is, that I can't get rid of it:
- Start the stateful set with two pods
- pod 0 has no "notif." queue
- pod 1 has such a runtime queue to which pod 0 is connected
- pod 0 has to restart
- the "notif." runtime queue on pod 1 gets invalid and is deleted
- now pod 1 is the only one in this stateful set
- every time the metrics endpoint of pod 1 is called it logs errors because of the deleted runtime queue resource
- now wait until pod 0 is up again
- because pod 0 finds pod 1, pod 0 creates such a runtime-queue to which pod 1 connects
- pod 1 doesn't have such a runtime queue now, only pod 0
- all is working fine, but pod 1 continues to log errors
- restart pod 1 to get rid of the errors
- now pod 0 is the only one in the stateful set, but because pod 1 is gone the runtime-queue on pod 0 gets invalid and deleted.
- now pod 0 logs errors
- pod 1 comes up again, finds pod 0
- pod 1 creates a runtime queue again with a new UUID and pod 0 connects to it
- pod 0 doesn't have such a queue anymore
- we are at the beginning with only pod 1 has such a runtime queue
- now pod 1 is okay, but pod 0 continues to log errors
- if you restart pod 0 to get rid of the errors the whole process starts again.
I can't reach the original situation where both pods are running and none is logging errors.
I hope it is only a matter of configuration. One possibility could be to delay the messaging cluster startup until the metrics subsystem has collected the resources. But I haven't found a configuration option for this.
Do you have a solution for such a setup without errors from metrics collection?
Thanks in advance for your time and help!
Regards
Bernd