issue with condor usage

4 views
Skip to first unread message

P. Oscar Boykin

unread,
Jan 4, 2011, 3:29:37 PM1/4/11
to acisp2...@googlegroups.com
Does anyone have any suggestion for the following situation:

1) I build statically linked haskell programs.
2) For some reason, these fail on some of the machines with an error
on the system call to open a file. Many of the processes actually
run.
3) As a result, condor has all of my jobs finishing (well, 4 are left)
out of 1200, but maybe only 200 or so actually ran correctly. Now, I
have a "swiss cheese" situation, where I have a subject of job numbers
that are complete, and a large set that need to run again. I don't
really know why some machines failed, and some didn't. I don't get an
error on my local machine or local VM.

This grid computing thing sounds nice, but it is a major pain in the ass.

I assumed all the machines would look basically identical (thus the
virtual machine), but if that's the case, why is it only a subset had
an error reading the input files?
--
P. Oscar Boykin                            http://boykin.acis.ufl.edu
Assistant Professor, Department of Electrical and Computer Engineering
University of Florida

rjo...@gmail.com

unread,
Jan 4, 2011, 8:43:54 PM1/4/11
to acis.p2p.users
Are you using NFS or copying files? I wonder if the error is because
NFS is not being mounted properly.

David Isaac Wolinsky

unread,
Jan 4, 2011, 9:04:10 PM1/4/11
to acisp2...@googlegroups.com
What does your submit file look like?
Could you provide the error, log, and output files?

Thanks,
David

P. Oscar Boykin

unread,
Jan 4, 2011, 9:37:10 PM1/4/11
to acisp2...@googlegroups.com

Actually I think it is more obscure than that. It looks like an issue with 64 bit machines using my 32 bit binary and most of them not having a particular file at a particular path.

I'll investigate further tomorrow and report.

> --
> You received this message because you are subscribed to the Google Groups "acis.p2p.users" group.
> To post to this group, send email to acisp2...@googlegroups.com.
> To unsubscribe from this group, send email to acisp2pusers...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/acisp2pusers?hl=en.
>

Rolando Raqueno

unread,
Jan 5, 2011, 5:49:34 AM1/5/11
to acisp2...@googlegroups.com
Hello,

With regard to the restart of failed jobs, we've found the use of the DAGMan facility in Condor to be an invaluable tool to manage this situation.  When you run jobs under this control, failed runs will be noted in a "rescue" file which flags completed vs. aborted jobs.

This rescue file can then be resubmitted to bypass the completed jobs and resume for the ones that have been flagged as failed jobs.

The log files under DAGMan is more comprehensive than the basic logging of regular Condor submissions which becomes indispensable in tracking down potential hardware incompatibilities. We've found that if you have a node in your flock that has a bad floating point unit, it will rifle through all your jobs incorrectly making it appear that all of them was processed properly.  If this is the case, you may just want to subset the number of nodes that you target for execution in order to isolate this type of problem.

We have internal documentation on our wiki regarding the use of DAGMan and have moved at least one topic out into the open 


I had some issues with broken links in the transfer and will move other related topics when the broken links get resolved.  In the mean time, don't hesitate to post questions on the DAGMan topic.

Hope that helps,

Rolando

P.S. You can also put requirements in Condor to target only the 32-bit machines and stay off any 64-bit architectures.

---

Rolando Raqueño, Ph.D.
Research Scientist
Digital Imaging and Remote Sensing Laboratory
Chester F. Carlson Center for Imaging Science
Rochester Institute of Technology
--
ᗋᘣ

Reply all
Reply to author
Forward
0 new messages