Hey All,
I have an odd issue i was hoping i could get someone to possibly point me in the right direction. I'm typically good at researching and figuring things out ( with googles help of course :) but im really lost on this one.
I'm using Payara (4.1.1.162 build 116) in a HA clustered environment and its working fairly well. However recently I have started to see complaints from employees using my application that jobs are getting stuck.
We use JMS queues to essentially trigger work on the back end. the work we submit is incremental, one process (Message Driven Bean) picks the message up and at its work the last thing it does is sends a new message to a different Queue essentially starting up a new process that works the data in a different way, multiple processes to for a workflow.
We have 2 servers in a cluster and I'm using MS_SQL server for the backed message persistence. Each node is set up as a LOCAL broker. I have a ton of logging i can see each process when it gets a message (the start of on message) and i log when the process completes and after I call send for the next Message. I can see the cluster working because I will notice the same batch id bounce back and forth from server to server as if processes through the workflow.
But very rarely i will get reports that a batch is not advancing like it should and as i view the logs i can see that at some point i get my Message sent log but i never see the next process log message received. We have ways of deleting a batch and starting over, and most of the time restarting a batch will cause it to run though the whole process with out a hitch.
I have spent days digging and logging and debugging, i fairly confident the code and server configuration is good because clearing and restarting a stuck batch works. my processes are not hanging mid way because i always see the sent message log right before the bean exits.
Where can i look, i would love to resolve this issue but I'm really at a loss. I have no idea where to start. I'm not getting exceptions or errors in my logs, and stopping an instance and running a single instance in the cluster resolves this issue. its not a single message driven bean that is the problem, i have seen this get stuck at many points in the chain. I don't see messages in the dmq.
Has anyone had similar issues and might know where i could start poking around? Your help is greatly appreciated.