1) I build statically linked haskell programs.
2) For some reason, these fail on some of the machines with an error
on the system call to open a file. Many of the processes actually
run.
3) As a result, condor has all of my jobs finishing (well, 4 are left)
out of 1200, but maybe only 200 or so actually ran correctly. Now, I
have a "swiss cheese" situation, where I have a subject of job numbers
that are complete, and a large set that need to run again. I don't
really know why some machines failed, and some didn't. I don't get an
error on my local machine or local VM.
This grid computing thing sounds nice, but it is a major pain in the ass.
I assumed all the machines would look basically identical (thus the
virtual machine), but if that's the case, why is it only a subset had
an error reading the input files?
--
P. Oscar Boykin http://boykin.acis.ufl.edu
Assistant Professor, Department of Electrical and Computer Engineering
University of Florida
Thanks,
David
Actually I think it is more obscure than that. It looks like an issue with 64 bit machines using my 32 bit binary and most of them not having a particular file at a particular path.
I'll investigate further tomorrow and report.
I had some issues with broken links in the transfer and will move other related topics when the broken links get resolved. In the mean time, don't hesitate to post questions on the DAGMan topic.