[SyneRBI/SIRF] occasional Bad file descriptor in cGadgetron (#641)

0 ଥର ଦେଖାଯାଇଛି
ଅପଠିତ ପ୍ରଥମ ମେସେଜକୁ ଯାଆନ୍ତୁ

Kris Thielemans

ଅପଠିତ,
ଅପ୍ରେଲ 27, 2020, 7:22:08 AM4/27/20
ପ୍ରାପ୍ତେଷୁ SyneRBI/SIRF,Subscribed

This job https://travis-ci.org/github/SyneRBI/SIRF-SuperBuild/jobs/679959452#L16334
from SyneRBI/SIRF-SuperBuild#377 (which is a DEVEL build) fails, while others are fine. The error is in the MR test

ERROR: test3.test_main
...
error: ??? "'write: Bad file descriptor' exception caught at line 545 of /Users/travis/build/SyneRBI/SIRF-SuperBuild/sources/SIRF/src/xGadgetron/cGadgetron/cgadgetron.cpp; the reconstruction engine output may provide more information"
-------------------- >> begin captured stdout << ---------------------
File: /Users/travis/build/SyneRBI/SIRF-SuperBuild/INSTALL/python/sirf/Gadgetron.py
Line: 1384
check_status found the following message sent from the engine:

I'll rerun the job, as I guess this won't happen again, but it is worrying nevertheless.

@evgueni-ovtchinnikov any ideas?


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.

Evgueni Ovtchinnikov

ଅପଠିତ,
ଅପ୍ରେଲ 27, 2020, 7:43:52 AM4/27/20
ପ୍ରାପ୍ତେଷୁ SyneRBI/SIRF,Subscribed

I have seen this error message in Travis logs many times, but the reported error never ever happened locally, so impossible to investigate, I am afraid.

Johannes Mayer

ଅପଠିତ,
ଅପ୍ରେଲ 29, 2020, 11:02:22 AM4/29/20
ପ୍ରାପ୍ତେଷୁ SyneRBI/SIRF,Subscribed

I get this one too from time to time

Kris Thielemans

ଅପଠିତ,
ଅପ୍ରେଲ 29, 2020, 1:28:06 PM4/29/20
ପ୍ରାପ୍ତେଷୁ SyneRBI/SIRF,Subscribed

hmmm. this is going to be tough then. Any ideas for writing some debugging checks and doing a special test-run with 1000 tests and see when it fails?

Richard Brown

ଅପଠିତ,
ଜୁନ 9, 2020, 4:25:12 AM6/9/20
ପ୍ରାପ୍ତେଷୁ SyneRBI/SIRF,Subscribed

@evgueni-ovtchinnikov I haven't looked through the source code, but if this pertains to file writing, could you put it in a for loop? Similar to what you already do for trying to connect to the gadgetron server)?

bool success = false;
unsigned num_attempts = 5;
for (unsigned i=0; i<num_attempts; ++i) {
    try {
         success = do_the_thing_that_causes_the_error();
    }
    catch {}
    if (success) break;
}
if (!success)
    throw std::runtime_error("bad file descriptor");

Evgueni Ovtchinnikov

ଅପଠିତ,
ଜୁନ 9, 2020, 11:52:21 AM6/9/20
ପ୍ରାପ୍ତେଷୁ SyneRBI/SIRF,Subscribed

@johannesmayer: if you get this error when running your mrtest.cpp, then one possible culprit is your MRAcquisitionData::read, where you create ISMRMRD::Dataset and call its methods readHeader, getNumberOfAcquisitions and readAcquisition without Mutex locking/unlocking.

I have very little idea what Mutex does - something to do with multithreading - but I noticed Gadgetron was using it, so I just followed suit, see e.g. AcquisitionsFile::get_acquisition.

@rijobro: what you suggest looks like papering over the crack, I am afraid. I would try to investigate a bit more before resorting to your fallback.

Evgueni Ovtchinnikov

ଅପଠିତ,
ଜୁନ 10, 2020, 7:34:35 AM6/10/20
ପ୍ରାପ୍ତେଷୁ SyneRBI/SIRF,Subscribed

added missing mutex locks/unlocks, HTH

Richard Brown

ଅପଠିତ,
ଜୁନ 10, 2020, 7:46:22 AM6/10/20
ପ୍ରାପ୍ତେଷୁ SyneRBI/SIRF,Subscribed

I have very little idea what Mutex does - something to do with multithreading - but I noticed Gadgetron was using it, so I just followed suit, see e.g. AcquisitionsFile::get_acquisition.

Mutex is used to stop multiple threads accessing the same files/variables simultaneously, leading to data races, etc.

So it could well be that missing mutex's solve the problem. Thanks.

Richard Brown

ଅପଠିତ,
ଜୁଲାଇ 1, 2020, 2:32:34 PM7/1/20
ପ୍ରାପ୍ତେଷୁ SyneRBI/SIRF,Subscribed

Bug still persisting (PR from today): https://travis-ci.org/github/SyneRBI/SIRF/jobs/703951360#L28836

ସମସ୍ତଙ୍କୁ ପ୍ରତ୍ୟୁତ୍ତର ଦିଅନ୍ତୁ
ଲେଖକଙ୍କୁ ପ୍ରତ୍ୟୁତ୍ତର ଦିଅନ୍ତୁ
ଫର୍‌ୱାର୍ଡ କରନ୍ତୁ
0ଟି ନୂଆ ମେସେଜ୍