Hi Michael,
There is a tool we use to debug lockups here: https://github.com/acq4/acq4/blob/develop/acq4/pyqtgraph/debug.py#L1098
Just create an instance, and every 10 seconds it will print a stack trace from every running thread. This should make it possible to determine where the system is getting hung up.
One possibility is that the task runner (I assume you are using the task runner?) is doing something inefficient when displaying the results from the task. However, I tested this with a mock camera and it seemed to have no trouble with a 30 second task (~6500 frames). Is it possible you are running out of memory (are you using a 32- or 64-bit python)?
Luke
To unsubscribe from this group and stop receiving emails from it, send an email to acq4+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/acq4/CAFVjdW%2BkjfmEfW0uAocX9PU3Lr_975D6Y7MfHQR%2BXRsKP6xkyA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "ACQ4" group.
To unsubscribe from this group and stop receiving emails from it, send an email to acq4+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/acq4/MWHPR1201MB0143359717E7A2103189AC79C9C40%40MWHPR1201MB0143.namprd12.prod.outlook.com.
I tested this on my rig and got the same behavior you described. For a 30-second recording, ACQ4 hangs for several seconds while it processes the image data. For a 60-second recording, the entire machine hangs for several minutes because it goes into swap.
I found one place where I could reduce memory overhead, when converting from a list of frames into a single array (whether this needs to be done at all is a different question, but that would take more work to change). With these changes, the acquisition thread still takes several seconds to process the image data, but it no longer causes the other threads to lock up: https://github.com/acq4/acq4/pull/61. Let me know if that helps at all.
Now this has unmasked another issue, that HDF5 cannot handle chunks larger than 4GB. Perhaps you can take a look at that (assuming you get the same error) ?
Best place to fix this might be in CameraTask.storeResult(); note that the keyword arguments to dh.storeResult(...) are ultimately passed to MetaArray.writeHDF5().
Luke
I tested this on my rig and got the same behavior you described. For a 30-second recording, ACQ4 hangs for several seconds while it processes the image data. For a 60-second recording, the entire machine hangs for several minutes because it goes into swap.
I found one place where I could reduce memory overhead, when converting from a list of frames into a single array (whether this needs to be done at all is a different question, but that would take more work to change). With these changes, the acquisition thread still takes several seconds to process the image data, but it no longer causes the other threads to lock up: https://github.com/acq4/acq4/pull/61. Let me know if that helps at all.
To view this discussion on the web visit https://groups.google.com/d/msgid/acq4/CAFVjdWKCoeYZjPW4hn3LViJzEEuBsm5TUUNfdzhJDi%3D0bpar5g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "ACQ4" group.
To unsubscribe from this group and stop receiving emails from it, send an email to acq4+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/acq4/MWHPR1201MB01434A17652DC9A11A50FE9AC9C50%40MWHPR1201MB0143.namprd12.prod.outlook.com.
On Tue, Jun 20, 2017 at 8:56 PM, Luke Campagnola <lu...@alleninstitute.org> wrote:I tested this on my rig and got the same behavior you described. For a 30-second recording, ACQ4 hangs for several seconds while it processes the image data. For a 60-second recording, the entire machine hangs for several minutes because it goes into swap.
I found one place where I could reduce memory overhead, when converting from a list of frames into a single array (whether this needs to be done at all is a different question, but that would take more work to change). With these changes, the acquisition thread still takes several seconds to process the image data, but it no longer causes the other threads to lock up: https://github.com/acq4/acq4/pull/61. Let me know if that helps at all.
Unfortunately, the patch does not change anything on my side. The processing still interferes with the laser thread which cases its shut down. It is possible to have the processing running in separate thread?Cheers,Michael
--
You received this message because you are subscribed to the Google Groups "ACQ4" group.
To unsubscribe from this group and stop receiving emails from it, send an email to acq4+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/acq4/CACZXET8pBu7%2Bok_F35OK02_9OVXBtjDQ%2B0qZGam5ES5Upy0gqg%40mail.gmail.com.
On Thu, Jun 22, 2017 at 1:25 AM, Luke Campagnola <luke.ca...@gmail.com> wrote:The task (including image processing and storage) is already running in a background thread. When it begins to process and store the image data, it actually seems to lock up all threads (in my tests, at least), and in some cases the entire machine. It could just be that the disk I/O is saturated, so any thread or process may become blocked waiting for disk access? On our system, we have a relatively fast SSD for data storage that is separate from the operating system disk (including ACQ4 configuration files).Maybe we should focus instead on figuring out why the laser is switching off--could you find the code that switches off the laser and insert a `traceback.print_stack()` so we can see who is calling it?I am back working on this issue.I don't think there is a call to the laser routine. I think the problem arises in the laser thread (MaiTaiThread) :https://github.com/mgraupe/acq4/blob/maitaiLaser/acq4/devices/MaiTaiLaser/MaiTaiLaser.py (line 240 onwards)It seems to me as if the laser switches off as precaution when something goes wrong in the serial protocol communication. The stalling interrupts the thread such that function writes and reads are not completed properly. This will in turn make the laser turn off.Is there a way to write that thread fail-safe? Can I build in a pre-caution there?
On Wed, Jul 12, 2017 at 7:36 AM, Michael Graupner <graupner...@gmail.com> wrote:On Thu, Jun 22, 2017 at 1:25 AM, Luke Campagnola <luke.ca...@gmail.com> wrote:The task (including image processing and storage) is already running in a background thread. When it begins to process and store the image data, it actually seems to lock up all threads (in my tests, at least), and in some cases the entire machine. It could just be that the disk I/O is saturated, so any thread or process may become blocked waiting for disk access? On our system, we have a relatively fast SSD for data storage that is separate from the operating system disk (including ACQ4 configuration files).Maybe we should focus instead on figuring out why the laser is switching off--could you find the code that switches off the laser and insert a `traceback.print_stack()` so we can see who is calling it?I am back working on this issue.I don't think there is a call to the laser routine. I think the problem arises in the laser thread (MaiTaiThread) :https://github.com/mgraupe/acq4/blob/maitaiLaser/acq4/devices/MaiTaiLaser/MaiTaiLaser.py (line 240 onwards)It seems to me as if the laser switches off as precaution when something goes wrong in the serial protocol communication. The stalling interrupts the thread such that function writes and reads are not completed properly. This will in turn make the laser turn off.Is there a way to write that thread fail-safe? Can I build in a pre-caution there?When I did the test on my machine, it seemed to be that the entire OS was blocked while the HDF5 file was being written. So I am not sure that this can be solved just by moving the laser access to a more privileged thread, or even to a separate process.Ok, next step: you said that there are no problems if you run the task without storing any data. Let's just verify that it is indeed the camera image storage that is causing the system to lockup. Under devices/Camera/Camera.py, CameraTask.storeResult(), comment out the last line of the function `dh.writeFile(data, k, info=info)`, and then run your task.
If that really does prevent the lockup occurring, then it may help to think about how and where you are storing data -- on our machines, we use a fast SSD for data storage that is separate from the system drive (which includes ACQ4 and its configuration files). Another possibility is that we need to write the image data out more slowly as it is being collected, rather than in a single write at the end.
Luke
--
You received this message because you are subscribed to the Google Groups "ACQ4" group.
To unsubscribe from this group and stop receiving emails from it, send an email to acq4+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/acq4/CACZXET_tEvL1U2H%3DNtLAdWL%2B__ATFyBGiNBqLv0SdZhtP8qcxg%40mail.gmail.com.
--
You received this message because you are subscribed to the Google Groups "ACQ4" group.
To unsubscribe from this group and stop receiving emails from it, send an email to acq4+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/acq4/CACZXET_%2BmbCr2yoW5iONw0_4XEf-HwDeUJ9-65HN6qYxZLV3hA%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/acq4/MWHPR12MB1293E32688B7201A35874E00C9680%40MWHPR12MB1293.namprd12.prod.outlook.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/acq4/MWHPR12MB1293C0403D55C5A6E0CE8512C9600%40MWHPR12MB1293.namprd12.prod.outlook.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/acq4/CAFVjdWKjUhD9-c58W%3DT%3DQPdNZs_d6kv9ukur7GM-e_BsxCc08Q%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/acq4/MWHPR12MB1293C0403D55C5A6E0CE8512C9600%40MWHPR12MB1293.namprd12.prod.outlook.com.