[bugs:#787] Event errors persisting after server restart
Status: open
Created: Tue Apr 05, 2016 03:12 PM UTC by Johan Forsberg
Last Updated: Tue Apr 05, 2016 03:12 PM UTC
Owner: nobody
We are experiencing issues with event subscriptions that only seem to manifest when subscribing to (change) events from large numbers of devices (100s) across many servers (10s). When restarting the servers, there is a certain probability that the client's subscriptions for all devices on a given server are somewhat corrupted.
The symtoms are that the "corrupt" subscriptions still appear to receive change events as usual, but they also receive errors like this every 10 seconds or so:
Error: tango://nb-johfor-0:10000/r3-312u5/wat/fsw-02/state
Tango error stack
Severity = ERROR
Error reason = API_EventTimeout
Desc : Event channel is not responding anymore, maybe the server or event system is down
Origin : EventConsumer::KeepAliveThread()
This is the same error that is correctly reported when the servers are really down, it just never goes away for some devices.
I have reproduced this running locally on my machine, with some 1000 dummy python devices across 20 servers, with polling on the State attribute, and with a minimal client written in C++ that listens to the State attribute for all the devices. After killing the servers and starting them again, but keeping the client running, around 3-4 random servers (and the corresponding 100s of devices) exhibit the above problem.
It is an Ubuntu machine, and I have tested with TANGO 8.1.2 (distribution packages) and 9.2.2 (built from tarball). Both times with ZMQ 4.0.5. We also see it in out CentOS 7 production environment and with PyTango clients.
Some thoughts: It seems like this is related to the ZMQ "keepalive" thread, somehow. Perhaps it's not being informed about the new subscription and therefore keeps reporting the error. Also, the randomness in the behavior suggests some race condition. Also, I'm assuming this is a client issue, I have not really looked into if other kinds of devices make a difference.
Sent from sourceforge.net because Tango-cs...@lists.sf.net is subscribed to https://sourceforge.net/p/tango-cs/bugs/
To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/tango-cs/admin/bugs/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.
Hi Johan,
Is it possible for you to minimise your Python device server code and your C++ client code (but reproducing the problem) and send them to us. We could use them to reproduce and (hopefully) fix the problem
Thank's in advance
Emmanuel
Hello,
Bug fix now commited in the repo. A patch file for Tango 9.2.2 will be available soon
Cheers
Emmanuel
[bugs:#787] Event errors persisting after server restart
Status: open
Created: Tue Apr 05, 2016 03:12 PM UTC by Johan Forsberg
Last Updated: Thu Apr 07, 2016 02:18 PM UTC
Owner: nobody
Hi Emmanuel,
Great news! Will you also make a patch for Tango-8.1.2? I have tried to port the changes to the 8.1.2 source distribution (on top of patches 1-4). It seems to solve this problem but I can't tell if it breaks something else. Can you have a look at the attached patch and let us know if it is correct?
Thanks,
Andreas
Attachments:
[bugs:#787] Event errors persisting after server restart
Status: open
Created: Tue Apr 05, 2016 03:12 PM UTC by Johan Forsberg
Last Updated: Tue Apr 12, 2016 10:30 AM UTC
Owner: nobody
Hi Andreas,
I don't think we will make a patch for Tango 8. The patch attached to your post seems correct to me.
Note it also solves bug 788 but I don't think you will consider this as a problem!
Cheers
Emmaunel
[bugs:#787] Event errors persisting after server restart
Status: open
Created: Tue Apr 05, 2016 03:12 PM UTC by Johan Forsberg
Last Updated: Thu Apr 14, 2016 09:23 AM UTC
Owner: nobody
Ok, thanks for checking.
Cheers,
Andreas
[bugs:#787] Event errors persisting after server restart
Status: open
Created: Tue Apr 05, 2016 03:12 PM UTC by Johan Forsberg
Last Updated: Thu Apr 14, 2016 10:44 AM UTC
Owner: nobody