blacs devices become unresponsive for extended period

83 views
Skip to first unread message

Michael D.

unread,
Jan 13, 2021, 2:43:11 PM1/13/21
to the labscript suite
Hello,

I've been battling a strange issue where a blacs device will enter the "device has not responded for xx seconds..minutes..hours..days" error state. This requires a manual restart of the tab, and sometimes you must restart the tab multiple times to recover. This happens the most often to the Pulseblaster, but I have also seen it happen to the NI device. It happens to multiple labs here across different setups.

I have tracked it down to occurring when the tab object communicates to a worker to start a job, end a job, and get the results of said job. This happens intermittently and is difficult to reproduce. I've only been able to encourage the likelihood of the error happening when stressing the computer of its resources, but I've also seen this happen 10 minutes after a shot starts on a brand new computer with plenty of resources to spare.

Through debugging I've seen the code stopping on these lines:

events = dict(self.poller.poll(timeout)) - In zprocess WriteQueue's function put()

events = dict(self.in_poller.poll(timeout)) - in zprocess ReadQueue's function get()

events = dict(self.out_poller.poll(timeout)) - in zprocess ReadQueue's function put()

When this happens the BLACs log will generally show the last update being:

DEBUG BLACS.pulseblaster_0.mainloop: Processing event _status_monitor
DEBUG BLACS.pulseblaster_0.mainloop: Instructing worker main_worker to do job check_status
DEBUG BLACS.pulseblaster_0.mainloop: Waiting for worker to acknowledge job request

and then hours of lines of INFO BLACS.AnalysisSubmission.mainloop: Processed signal: check/retry

This is not unique to the pulseblaster, and I think it only happens more often to it because it calls jobs the most due to its use of check_status to update the pulseblaster state. The NI device has also rarely entered this error state between shots when its workers are starting up or finishing. This issues occurs both when an experiment shot is happening and when blacs is idling.

I've mostly worked around this by modifying the tab_base_classes code to use a timeout value in put/get statements. This has allowed blacs to catch the TimeoutError from zprocess and try again, circumventing the issue. However, there's still rare occasions where this issue is still happening, almost as if it's ignoring the timeout value. The only place that doesn't have a timeout value (because I cant use one there) is the worker's mainloop where it sits with a get() listening for a job event to come in from the parent. 

Is there a resource I can go to to find out how exactly those three lines listed above work? Is this poll function the one from python's I/O library?

Michael D.

unread,
Jan 13, 2021, 3:28:33 PM1/13/21
to the labscript suite
After sending this and digging I am seeing that poll() is a function of Poller in zmq? Looking at the documentation I don't see why it would be waiting there for infinity or not taking the timeout value.

Zak V

unread,
Jan 14, 2021, 5:22:49 PM1/14/21
to the labscript suite
Hi Michael,

I'm not very familiar with the internals of zprocess so this is a bit of a shot in the dark, but do you get this bug when using insecure communication? To check that, set `allow_insecure = True` in the `[security]` section of your labconfig. It may also be necessary to delete or comment out (with a semicolon) the `shared_secret` line, but I'm not sure. You may also need to restart zlock/zlog by killing those processes or restarting your computer.

A while ago I was having various intermittent interprocess communication issues with labscript, though different ones than what you're seeing. They were somewhat rare so they were hard to track down, but I believe they never occurred when using insecure communication.

Cheers,
Zak

Michael D.

unread,
Jan 14, 2021, 8:21:43 PM1/14/21
to the labscript suite
Hi Zak, 

Thanks for the information. I'll definitely look into that and see if it helps. 

I've done further digging and experimenting and have narrowed down what's going on. Apparently what seems to be happening is that when the tab thread attempts to instruct the worker to do a job the worker never receives the message. The main loop in the tab object sends the worker a job via the put() statement. It appears that this runs successfully and the tab proceeds to wait for an acknowledgement from the worker with a get(). However, the worker never received the network command. Even though the parent thinks the put() worked, the worker is still listening to the to_worker socket via the zmq poller from its own get() command in the worker's main while loop. So the worker sits on the events=dict(etc...) while the parent tab is waiting for a response as well. So both the tab and the worker are waiting on a get() that will never get a response. Hence the infinite "hardware device has not responded for xx amount of time". 

Since you can't use a timeout on the get() in the beginning of the worker because it has to wait indefinitely for a job to come in, if labscript hangs in this specific situation it'll look as if it's ignoring my timeout values in the tab process. If labscript hangs in the tab it'll trigger the timeout value and either crash on multiple failures or successfully resend the data. I'm not completely clear on why this miscommunication is happening, but whatever it is this is intermittently occurring when the tab tries to tell the worker to do something. It's somehow getting lost along the way before the worker can hear it with its get(). 

dihm....@gmail.com

unread,
Jan 19, 2021, 10:38:29 AM1/19/21
to the labscript suite
Michael,

I know I'm a bit late to the party here, and hopefully you have already figure things out. Just wanted to add that what you are seeing could be the result of a zombie zprocess thread. The thought process is that you may have a rogue zprocess handler running around accepting queue commands but it isn't connected to your current BLACS instance. That could explain why zprocess thinks commands are being sent and received correctly, but nothing is happening in the hardware side. Out of curiosity, what suite structure do you have? Is everything contained on one computer or are the various labscript components spread over the network.

I also agree with Zak, that testing zprocess in insecure mode is a good place to start. I would also recommend restarting all the zprocess threads (ie zprocess, zlock, zlog). Easiest way to do that is close everything, then make sure all python processes are closed via the task manager (assuming Windows of course).

-David

dihm....@gmail.com

unread,
Apr 2, 2021, 2:27:44 PM4/2/21
to the labscript suite
Michael,

I just remembered that I promised I would try to help sort this out on the JQI pow-wow but never followed up/don't have your direct e-mail. Is this still an issue for you? I am (maybe overly) confident we could figure it out over a video chat, if you were interested/available.

-David

Michael D.

unread,
Apr 5, 2021, 12:33:08 PM4/5/21
to the labscript suite
Yes, this is still an intermittent problem. This occurrence of the Pulseblaster, NI devices, etc 'not responding' has been pinpointed down to missed messages between the device tab and its workers when instructing a worker to do something. I have been focusing my efforts into understanding the use of zmq and the network pattern being used by labscript to communicate messages between the device and its workers. Judging by the ZMQ ports being used (Maybe the labscript team can correct me, it appears the tabs and workers are using PUSH/PULL between each other), in a perfect world there shouldn't be any dropped messages. The ports being used don't silently drop messages, they block. It is unlikely that any messages being sent between the worker and the device tab are hitting HWM, and even if it did the port's response to this is to just hold the messages in a queue. The messages are being sent via the localhost, so there's no external network to lose the messages to. However, in reality the workers are missing a message from the device tab now and then, and causing the parent to wait indefinitely so something is causing these random losses.

I've crossed out most of my thoughts on what's causing this, leaving it down to either it being 1) The messages are being dropped in the OS IO buffer. 2) Something about the use of encryption is causing these missed messages. Which could be since, assuming the `allow_insecure = True` thing is turning off encryption, I've yet to see a dropped message from my limited experimentation when using insecure communication. 3) Some other hiccup is causing the port to 'not be ready'. Though I'm likely going to cross this out as well because my understanding is that even if a zmq port isn't quite connected, any messages in the inbound buffer are held, not lost. 4) ZMQ will just occasionally lose messages and there's nothing you can do about it and you have to code in some means of checking.

The current workaround is to just put a short timeout on the parent device tab when sending the, 'do this job' message to the worker, and then re-sending the message on a timeout exception. This has removed the issue, but doesn't really fix the source of the dropped messages.

My email is mdo...@mail.csuchico.edu and I'm more than happy to get any input or assistance in figuring this out.

Also, if anyone knows the network structure to labscript that would be great. My current understanding is that a broker is using a PULL to get messages from an event queue using a PUSH. The broker then uses a XPUB to send the events to the device tabs listening to it with SUB. The tabs then instruct the workers to do tasks with PUSH, which are listening with pollers on a PULL. The tabs then listen for a response from the workers with pollers on a PULL, who send confirmation and results with a PUSH. If this is correct or incorrect I would be grateful to know.
Reply all
Reply to author
Forward
0 new messages