Yes, this is still an intermittent problem. This occurrence of the Pulseblaster, NI devices, etc 'not responding' has been pinpointed down to missed messages between the device tab and its workers when instructing a worker to do something. I have been focusing my efforts into understanding the use of zmq and the network pattern being used by labscript to communicate messages between the device and its workers. Judging by the ZMQ ports being used (Maybe the labscript team can correct me, it appears the tabs and workers are using PUSH/PULL between each other), in a perfect world there shouldn't be any dropped messages. The ports being used don't silently drop messages, they block. It is unlikely that any messages being sent between the worker and the device tab are hitting HWM, and even if it did the port's response to this is to just hold the messages in a queue. The messages are being sent via the localhost, so there's no external network to lose the messages to. However, in reality the workers are missing a message from the device tab now and then, and causing the parent to wait indefinitely so something is causing these random losses.
I've crossed out most of my thoughts on what's causing this, leaving it down to either it being 1) The messages are being dropped in the OS IO buffer. 2) Something about the use of encryption is causing these missed messages. Which could be since, assuming the
`allow_insecure = True`
thing is turning off encryption, I've yet to see a dropped message from my limited experimentation when using insecure communication. 3) Some other hiccup is causing the port to 'not be ready'. Though I'm likely going to cross this out as well because my understanding is that even if a zmq port isn't quite connected, any messages in the inbound buffer are held, not lost. 4) ZMQ will just occasionally lose messages and there's nothing you can do about it and you have to code in some means of checking.
The current workaround is to just put a short timeout on the parent device tab when sending the, 'do this job' message to the worker, and then re-sending the message on a timeout exception. This has removed the issue, but doesn't really fix the source of the dropped messages.
My email is
mdo...@mail.csuchico.edu and I'm more than happy to get any input or assistance in figuring this out.
Also, if anyone knows the network structure to labscript that would be great. My current understanding is that a broker is using a PULL to get messages from an event queue using a PUSH. The broker then uses a XPUB to send the events to the device tabs listening to it with SUB. The tabs then instruct the workers to do tasks with PUSH, which are listening with pollers on a PULL. The tabs then listen for a response from the workers with pollers on a PULL, who send confirmation and results with a PUSH. If this is correct or incorrect I would be grateful to know.