[Tango-cs-bug-info] [tango-cs:bugs] #787 Event errors persisting after server restart

14 views
Skip to first unread message

tango-cs...@lists.sourceforge.net

unread,
Apr 5, 2016, 11:13:38 AM4/5/16
to Tango-cs...@lists.sf.net

[bugs:#787] Event errors persisting after server restart

Status: open
Created: Tue Apr 05, 2016 03:12 PM UTC by Johan Forsberg
Last Updated: Tue Apr 05, 2016 03:12 PM UTC
Owner: nobody

We are experiencing issues with event subscriptions that only seem to manifest when subscribing to (change) events from large numbers of devices (100s) across many servers (10s). When restarting the servers, there is a certain probability that the client's subscriptions for all devices on a given server are somewhat corrupted.

The symtoms are that the "corrupt" subscriptions still appear to receive change events as usual, but they also receive errors like this every 10 seconds or so:

Error: tango://nb-johfor-0:10000/r3-312u5/wat/fsw-02/state
Tango error stack
Severity = ERROR
Error reason = API_EventTimeout
Desc : Event channel is not responding anymore, maybe the server or event system is down
Origin : EventConsumer::KeepAliveThread()

This is the same error that is correctly reported when the servers are really down, it just never goes away for some devices.

I have reproduced this running locally on my machine, with some 1000 dummy python devices across 20 servers, with polling on the State attribute, and with a minimal client written in C++ that listens to the State attribute for all the devices. After killing the servers and starting them again, but keeping the client running, around 3-4 random servers (and the corresponding 100s of devices) exhibit the above problem.

It is an Ubuntu machine, and I have tested with TANGO 8.1.2 (distribution packages) and 9.2.2 (built from tarball). Both times with ZMQ 4.0.5. We also see it in out CentOS 7 production environment and with PyTango clients.

Some thoughts: It seems like this is related to the ZMQ "keepalive" thread, somehow. Perhaps it's not being informed about the new subscription and therefore keeps reporting the error. Also, the randomness in the behavior suggests some race condition. Also, I'm assuming this is a client issue, I have not really looked into if other kinds of devices make a difference.


Sent from sourceforge.net because Tango-cs...@lists.sf.net is subscribed to https://sourceforge.net/p/tango-cs/bugs/

To unsubscribe from further messages, a project admin can change settings at https://sourceforge.net/p/tango-cs/admin/bugs/options. Or, if this is a mailing list, you can unsubscribe from the mailing list.

message-footer.txt

tango-cs...@lists.sourceforge.net

unread,
Apr 7, 2016, 10:19:53 AM4/7/16
to Tango-cs...@lists.sf.net

Hi Johan,

Is it possible for you to minimise your Python device server code and your C++ client code (but reproducing the problem) and send them to us. We could use them to reproduce and (hopefully) fix the problem

Thank's in advance

Emmanuel

message-footer.txt

tango-cs...@lists.sourceforge.net

unread,
Apr 7, 2016, 11:55:47 AM4/7/16
to tango-cs...@lists.sourceforge.net
Hi Emmanuel,

Thanks for replying!

I am including a minimal python device server and C++ client (the latter
written by my colleague Andreas) with which I have been able to locally
reproduce the issue. I created ~20 servers with ~100 devices each (not
sure if the number of devices actually matters) locally on my machine.
Then I ran the client with a wildcard to match all the devices, e.g.

./subscribe test/*/*

and let it set up all connections. Then I stopped the servers, and
started them again. The problem always seems to happen to the devices in
a random handful of the servers and does not go away.

Let me know if you have any problems reproducing the issue.

Cheers,

/Johan


On 2016-04-07 16:18, tango-cs...@lists.sourceforge.net wrote:
>
> Hi Johan,
>
> Is it possible for you to minimise your Python device server code and
> your C++ client code (but reproducing the problem) and send them to
> us. We could use them to reproduce and (hopefully) fix the problem
>
> Thank's in advance
>
> Emmanuel
>
> ------------------------------------------------------------------------
>
> *[bugs:#787] <https://sourceforge.net/p/tango-cs/bugs/787/> Event
> errors persisting after server restart*
>
> *Status:* open
> *Created:* Tue Apr 05, 2016 03:12 PM UTC by Johan Forsberg
> *Last Updated:* Tue Apr 05, 2016 03:12 PM UTC
> *Owner:* nobody
>
> We are experiencing issues with event subscriptions that only seem to
> manifest when subscribing to (change) events from large numbers of
> devices (100s) across many servers (10s). When restarting the servers,
> there is a certain probability that the client's subscriptions for all
> devices on a given server are somewhat corrupted.
>
> The symtoms are that the "corrupt" subscriptions still appear to
> receive change events as usual, but they /also/ receive errors like
> ------------------------------------------------------------------------
>
> Sent from sourceforge.net because Tango-cs...@lists.sf.net is
> subscribed to https://sourceforge.net/p/tango-cs/bugs/
>
> To unsubscribe from further messages, a project admin can change
> settings at https://sourceforge.net/p/tango-cs/admin/bugs/options. Or,
> if this is a mailing list, you can unsubscribe from the mailing list.
>
>
>
> ------------------------------------------------------------------------------
>
>
> _______________________________________________
> Tango-cs-bug-info mailing list
> Tango-cs...@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/tango-cs-bug-info

pydsexp.py
subscribe.cpp
message-footer.txt

tango-cs...@lists.sourceforge.net

unread,
Apr 12, 2016, 6:31:40 AM4/12/16
to Tango-cs...@lists.sf.net

Hello,

Bug fix now commited in the repo. A patch file for Tango 9.2.2 will be available soon

Cheers

Emmanuel


[bugs:#787] Event errors persisting after server restart

Status: open
Created: Tue Apr 05, 2016 03:12 PM UTC by Johan Forsberg

Last Updated: Thu Apr 07, 2016 02:18 PM UTC
Owner: nobody

message-footer.txt

tango-cs...@lists.sourceforge.net

unread,
Apr 14, 2016, 5:24:15 AM4/14/16
to Tango-cs...@lists.sf.net

Hi Emmanuel,
Great news! Will you also make a patch for Tango-8.1.2? I have tried to port the changes to the 8.1.2 source distribution (on top of patches 1-4). It seems to solve this problem but I can't tell if it breaks something else. Can you have a look at the attached patch and let us know if it is correct?

Thanks,
Andreas

Attachments:


[bugs:#787] Event errors persisting after server restart

Status: open
Created: Tue Apr 05, 2016 03:12 PM UTC by Johan Forsberg

Last Updated: Tue Apr 12, 2016 10:30 AM UTC
Owner: nobody

message-footer.txt

tango-cs...@lists.sourceforge.net

unread,
Apr 14, 2016, 6:45:50 AM4/14/16
to Tango-cs...@lists.sf.net

Hi Andreas,

I don't think we will make a patch for Tango 8. The patch attached to your post seems correct to me.
Note it also solves bug 788 but I don't think you will consider this as a problem!

Cheers

Emmaunel


[bugs:#787] Event errors persisting after server restart

Status: open
Created: Tue Apr 05, 2016 03:12 PM UTC by Johan Forsberg

Last Updated: Thu Apr 14, 2016 09:23 AM UTC
Owner: nobody

message-footer.txt

tango-cs...@lists.sourceforge.net

unread,
Apr 14, 2016, 7:15:48 AM4/14/16
to Tango-cs...@lists.sf.net

Ok, thanks for checking.

Cheers,
Andreas


[bugs:#787] Event errors persisting after server restart

Status: open
Created: Tue Apr 05, 2016 03:12 PM UTC by Johan Forsberg

Last Updated: Thu Apr 14, 2016 10:44 AM UTC
Owner: nobody

message-footer.txt
Reply all
Reply to author
Forward
0 new messages