WildFly messaging causes log errors due to metrics collection

88 views
Skip to first unread message

Bernd Köcke

unread,
Dec 8, 2022, 12:21:48 PM12/8/22
to WildFly
Hello all,

I have some issues with smallry-metrics and the messaging subsystem in a WildFly 26.1.2.Final and 27.0.0.Final and I hope, you can help me. The metrics subsystem seems to scan all subsystems at startup for later metric collection. But if some resources are deleted afterwards, they got still called every time someone calls the "/metrics" endpoint. Because the resource is gone, there are "ERROR" logs from the model controller.

Easy setup to reproduce it:
  • define a jms-queue in messaging. If you did this via CLI, restart/reload the server
  • when WildFly is up, call the metrics endpoint, you will se the metrics for the jms-queue
  • remove the jms-queue via CLI
  • call the metrics enpoint again and you see errors for the deleted jms-queue in your log
Why is this an issue? I use WildFly in Kubernetes in a stateful set with embedded messaging and persistent volumes. To get messaging and HA, I use the standalone-full-ha.xml as basis and change the following things:
  • add a JGroups stack for messaging
  •  messaging live-only ha policy with scale down
  •  use port 8180 exclusively for the messaging HTTP connector
  • configure the given messaging cluster connection with some timeouts and switch off message load balancing
Some of these configuration options may not be necessary. But I did it, because I thought it can solve my issues.

If I set the replica count to two, pod "0" starts up. Pod "1" does the same, but it creates a "runtime-queue" with "notif." in its name and a time based UUID. I think this is triggered by the cluster-connection. And I think it is pod 0 which connects to it. If for some reason pod 0 has to restart, this runtime queue gets invalid and is removed. When pod 0 comes back again, it creates itself a runtime queue. Pod 1 stays without such a queue and connects itself to the new runtime-queue on pod 0.

The WildFlys are working as expected, but every time the metrics are collected on pod 1, errors got logged, because the metrics subsystem tries to get values from the old "notif." runtime queue, which is gone.

The second issue with this is, that I can't get rid of it:
  • Start the stateful set with two pods
  • pod 0 has no "notif." queue
  • pod 1 has such a runtime queue to which pod 0 is connected
  • pod 0 has to restart
  • the "notif." runtime queue on pod 1 gets invalid and is deleted
  • now pod 1 is the only one in this stateful set
  • every time the metrics endpoint of pod 1 is called it logs errors because of the deleted runtime queue resource
  • now wait until pod 0 is up again
  • because pod 0 finds pod 1, pod 0 creates such a runtime-queue to which pod 1 connects
  • pod 1 doesn't have such a runtime queue now, only pod 0
  • all is working fine, but pod 1 continues to log errors
  • restart pod 1 to get rid of the errors
  • now pod 0 is the only one in the stateful set, but because pod 1 is gone the runtime-queue on pod 0 gets invalid and deleted.
  • now pod 0 logs errors
  • pod 1 comes up again, finds pod 0
  • pod 1 creates a runtime queue again with a new UUID and pod 0 connects to it
  • pod 0 doesn't have such a queue anymore
  • we are at the beginning with only pod 1 has such a runtime queue
  • now pod 1 is okay, but pod 0 continues to log errors
  • if you restart pod 0 to get rid of the errors the whole process starts again.
I can't reach the original situation where both pods are running and none is logging errors.

I hope it is only a matter of configuration. One possibility could be to delay the messaging cluster startup until the metrics subsystem has collected the resources. But I haven't found a configuration option for this.

Do you have a solution for such a setup without errors from metrics collection?

Thanks in advance for your time and help!

Regards
Bernd

Brian Stansberry

unread,
Dec 8, 2022, 6:38:27 PM12/8/22
to WildFly
Hi Bernd,

I consider this to be a bug, so I filed https://issues.redhat.com/browse/WFLY-17381. More of a software design bug as the software is doing what it was intended to do, but that's inadequate.

A (perhaps not great) workaround is to use the metrics subsystem 'exposed-subsystems' attribute[1] to list the subsystems for which you want metrics, not including messaging-activemq. That's not great if you want metrics from many subsystems. I filed a JIRA[2] to possibly improve that.


Best regards,
Brian

Bernd Köcke

unread,
Dec 9, 2022, 6:56:38 AM12/9/22
to WildFly
Hi Brian,

thanks a lot for your fast response and that you already filed the issues.

Thanks for your hint about the configuration in the metrics subsystem. But the metrics for messaging queues are really useful, especially in a cloud environment. That's the reason why I haven't switched them off. But I think, I don't need the values for runtime queues. I looked into the code of the metrics subsystem and found, that it is not so easy to enhance the configuration of scanned subsystems. May be with an allow and deny list and paths inside subsystems. And the most challenging thing will be to implement it in an way that I is easy understandable.  

Thanks again, best regards,
Bernd

Bernd Köcke

unread,
Dec 16, 2022, 2:39:53 AM12/16/22
to WildFly
Hi Brian,

we found a workaround for the error log lines. A colleague reminds me that I used a filter spec with a logger in the past to rewrite some log lines. I haven't tried this because the error log line looks like json and I thought I have to select something in this structure. But at the end it's only text. So I use this CLI to add a special logger:

/subsystem=logging/logger=org.jboss.as.controller.management-operation:add(filter-spec="any(not(match(\"WFLYCTL0013\")),not(match(\"\\\"messaging-activemq\\\"\")),not(match(\"\\\"runtime-queue\\\"\")),not(match(\"not found\")))",level=INFO,use-parent-handlers=true)

And this works. Only if all match commands are matching the log line is discarded. This is not really a resolution, because each time the metrics are called several exceptions happen in the background. But it works for the moment.

Best regards
Bernd

Brian Stansberry

unread,
Dec 19, 2022, 5:42:55 PM12/19/22
to WildFly

Thanks, Bernd!  I added this as a workaround suggestion on https://issues.redhat.com/browse/WFLY-17381.

Best regards,
Brian
Reply all
Reply to author
Forward
0 new messages