[Condor-users] Problems with DAG

4 views
Skip to first unread message

Alexander Dietz

unread,
Mar 28, 2008, 7:13:40 AM3/28/08
to Condor-Users Mail List
Hi,

I am running into trouble while running a DAG with DAGs with condor version 7.0.1. When the DAG's are created one of the sub-DAGs is called "datafind.GRB060429_DATAFIND.dag". From this file another file called "datafind.GRB060429_DATAFIND.dag.condor.sub" is created by calling "condor_submit_dag" (so it is written in the latter file). This file (i.e. "datafind.GRB060429_DATAFIND.dag.condor.sub" which is attached) contains several arguments (which are arguments to the "condor_dagman" command), but for unknown reasons there are arguments not contained in the documentation of "condor_dagman", like the argument "AutoRescue".
Since this file is created by "condor_dagman" with obvious no arguments, how do these needless arguments get in? I cannot find them in the condor-configuration files nor in the environment variables anywhere. Is there some other way that arguments could get in?

Thanks for any help,
  Alex

Alexander Dietz

unread,
Mar 28, 2008, 7:15:00 AM3/28/08
to Condor-Users Mail List
Hi,

I am running into trouble while running a DAG with DAGs with condor version 7.0.1. When the DAG's are created one of the sub-DAGs is called "datafind.GRB060429_DATAFIND.dag". From this file another file called "datafind.GRB060429_DATAFIND.dag.condor.sub" is created by calling "condor_submit_dag" (so it is written in the latter file). This file (i.e. "datafind.GRB060429_DATAFIND.dag.condor.sub" which is attached) contains several arguments (which are arguments to the "condor_dagman" command), but for unknown reasons there are arguments not contained in the documentation of "condor_dagman", like the argument "AutoRescue", which causes the DAG to fail (see output in the file "offsource.GRB060429_OFFSOURCE_CATEGORY_1.dag.dagman.out").
datafind.GRB060429_DATAFIND.dag.condor.sub
offsource.GRB060429_OFFSOURCE_CATEGORY_1.dag.dagman.out

R. Kent Wenger

unread,
Mar 28, 2008, 10:42:59 AM3/28/08
to Condor-Users Mail List

Is it possible that you're running a 7.1.0 pre-release condor_submit_dag
binary? This is the only explanation I can think of. There are pretty
major changes in how rescue DAGs work in 7.1.0, and the -Autorescue flag
is part of that.

You can find the version of your condor_submit_dag binary by doing the
following:

strings `which condor_submit_dag` |grep CondorVersion:

(assuming you're on some flavor of Unix or Linux; I'm not sure what the
equivalent is on Windows).

Does the datafind.GRB060429_DATAFIND.dag.condor.sub file work okay for
DAGMan itself, or does it cause problems? If it causes problems, you may
have a version mismatch between condor_submit_dag and condor_dagman.

(One general note here: the DAGMan version doesn't have to match the
version of the rest of the Condor installation, but the versions of
condor_submit_dag and condor_dagman should always match each other.)

Kent Wenger
Condor Team
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-use...@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

Bryan S. Maher

unread,
Mar 28, 2008, 11:02:47 AM3/28/08
to Condor-Users Mail List

Hi All:

 

I have a new Condor pool uniformly running v7.0.1 on Windows.   After a day or two the slot1 resources fail to show up when issuing a condor_status command.  Here is sample output:

 

Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

 

sl...@IDEE.inf.ad. WINNT51    INTEL  Owner     Idle     0.030  1023  0+04:32:59

sl...@IDEE.inf.ad. WINNT51    INTEL  Owner     Idle     0.000  1023  0+04:33:00

sl...@LEVIATHAN.in WINNT51    INTEL  Owner     Idle     0.000  1534  0+04:35:05

sl...@CQ-00.inf.ad WINNT52    INTEL  Unclaimed Idle     0.000  1006  5+14:26:38

sl...@CQ-01.inf.ad WINNT52    INTEL  Unclaimed Idle     0.000  1006  0+02:25:07

sl...@CQ-02.inf.ad WINNT52    INTEL  Unclaimed Idle     0.000  1006  0+02:25:05

sl...@CQ-03.inf.ad WINNT52    INTEL  Unclaimed Idle     0.000  1006  0+02:25:05

sl...@CQ-04.inf.ad WINNT52    INTEL  Unclaimed Idle     0.000  1006  0+02:25:07

 

                     Total Owner Claimed Unclaimed Matched Preempting Backfill

 

       INTEL/WINNT51     3     3       0         0       0          0        0

       INTEL/WINNT52     5     0       0         5       0          0        0

 

               Total     8     3       0         5       0          0        0

 

As you can see, even the totals fail to count the slot1 resources.  A condor_reconfig is sufficient to bring slot1 back to life.   The StartLog on an affected machine looks like:

 

3/17 12:03:18 ******************************************************

3/17 12:03:18 ** condor_startd.exe (CONDOR_STARTD) STARTING UP

3/17 12:03:18 ** C:\condor\bin\condor_startd.exe

3/17 12:03:18 ** $CondorVersion: 7.0.1 Feb 27 2008 BuildID: 76180 $

3/17 12:03:18 ** $CondorPlatform: INTEL-WINNT50 $

3/17 12:03:18 ** PID = 1880

3/17 12:03:18 ** Log last touched 3/17 11:01:32

3/17 12:03:18 ******************************************************

3/17 12:03:18 Using config source: C:\condor\condor_config

3/17 12:03:18 Using local config sources:

3/17 12:03:18    C:\condor\condor_config.local

3/17 12:03:18 DaemonCore: Command Socket at <x.x.x.x:1071>

3/17 12:03:18 MachAttributes::publish: failed to get Windows version information

3/17 12:03:24 slot1: New machine resource allocated

3/17 12:03:24 slot2: New machine resource allocated

3/17 12:03:29 About to run initial benchmarks.

3/17 12:03:33 Completed initial benchmarks.

.

.  slot2 continues to run benchmarks, slot1 never runs benchmarks …

.

3/17 12:03:33 slot2: State change: IS_OWNER is false

3/17 12:03:33 slot2: Changing state: Owner -> Unclaimed

3/17 12:03:33 slot1: State change: IS_OWNER is false

3/17 12:03:33 slot1: Changing state: Owner -> Unclaimed

3/17 16:03:33 State change: RunBenchmarks is TRUE

3/17 16:03:33 slot2: Changing activity: Idle -> Benchmarking

3/17 16:03:36 State change: benchmarks completed

3/17 16:03:36 slot2: Changing activity: Benchmarking -> Idle

3/17 20:03:36 State change: RunBenchmarks is TRUE

3/17 20:03:36 slot2: Changing activity: Idle -> Benchmarking

3/17 20:03:39 State change: benchmarks completed

.

.  reconfig sent, slot1 begins to run benchmarks in lieu of slot2

.  slot1 is reappears in condor_status for a while …

.

3/22 21:50:06 Got SIGHUP.  Re-reading config files.

3/23 00:10:06 State change: RunBenchmarks is TRUE

3/23 00:10:06 slot1: Changing activity: Idle -> Benchmarking

3/23 00:10:10 State change: benchmarks completed

3/23 00:10:10 slot1: Changing activity: Benchmarking -> Idle

3/23 04:10:10 State change: RunBenchmarks is TRUE

3/23 04:10:10 slot1: Changing activity: Idle -> Benchmarking

3/23 04:10:14 State change: benchmarks completed

3/23 04:10:14 slot1: Changing activity: Benchmarking -> Idle

.

.  slot1 benchmarks continue but slot1 is no longer visible in condor_status …

.

3/28 04:12:18 slot1: Changing activity: Benchmarking -> Idle

3/28 08:12:19 State change: RunBenchmarks is TRUE

3/28 08:12:19 slot1: Changing activity: Idle -> Benchmarking

3/28 08:12:22 State change: benchmarks completed

3/28 08:12:22 slot1: Changing activity: Benchmarking -> Idle

<end>

 

Any ideas?

 

-Bryan

 

 

Alexander Dietz

unread,
Mar 28, 2008, 11:50:28 AM3/28/08
to Condor-Users Mail List
Thanks for the reply,

On 28/03/2008, R. Kent Wenger <wen...@cs.wisc.edu> wrote:
On Fri, 28 Mar 2008, Alexander Dietz wrote:

> I am running into trouble while running a DAG with DAGs with condor version
> 7.0.1. When the DAG's are created one of the sub-DAGs is called "
> datafind.GRB060429_DATAFIND.dag". From this file another file called "
> datafind.GRB060429_DATAFIND.dag.condor.sub" is created by calling
> "condor_submit_dag" (so it is written in the latter file). This file (i.e. "
> datafind.GRB060429_DATAFIND.dag.condor.sub" which is attached) contains
> several arguments (which are arguments to the "condor_dagman" command), but
> for unknown reasons there are arguments not contained in the documentation
> of "condor_dagman", like the argument "AutoRescue".
> Since this file is created by "condor_dagman" with obvious no arguments, how
> do these needless arguments get in? I cannot find them in the
> condor-configuration files nor in the environment variables anywhere. Is
> there some other way that arguments could get in?


Is it possible that you're running a 7.1.0 pre-release condor_submit_dag
binary?  This is the only explanation I can think of.  There are pretty
major changes in how rescue DAGs work in 7.1.0, and the -Autorescue flag
is part of that.

You can find the version of your condor_submit_dag binary by doing the
following:

     strings `which condor_submit_dag` |grep CondorVersion:

yes, it seems indeed that a 7.1.0 pre-release condor_submit_dag version is being used!
Thanks very much, I will pursue the sysadmin now...

Alex
 

carl langlois

unread,
Mar 28, 2008, 11:51:07 AM3/28/08
to Condor-Users Mail List
Hi Bryan,

Do you have any core.something.WIN32 in your log directory? I got a similar problem that some slot disappear from the pool at one point in time and have notice to core file in the log directory. But don't know why it append.


Carl



Bryan S. Maher

unread,
Mar 28, 2008, 12:44:55 PM3/28/08
to Condor-Users Mail List

Carl,

 

I just checked.  No CORE files of any sort on any of the affected machines.

 

-Bryan

Kewley, J (John)

unread,
Mar 28, 2008, 1:15:58 PM3/28/08
to Condor-Users Mail List
I think this is the known problem with udp updates for Windows machines in general.
A fair few sites have mentioned problems like this in the past when whole
machines used to vanish. Now that there are more and more multi-slot machines,
it appears that some of the slots report OK and some don't.
 
If you check previous posts in this forum you'll see some suggestions from the
Condor team, but I think the only thing that I found to work was enabling
tcp rather than udp for the classad heartbeat.
 
Cheers
 
JK


From: condor-use...@cs.wisc.edu [mailto:condor-use...@cs.wisc.edu] On Behalf Of carl langlois
Sent: Friday, March 28, 2008 3:51 PM

To: Condor-Users Mail List
Subject: Re: [Condor-users] slot1 resources disappear after a few days.

Bryan S. Maher

unread,
Mar 28, 2008, 2:20:29 PM3/28/08
to Condor-Users Mail List

John,

 

I was hoping that the UDP issues had been resolved by now; my previous Condor 6.6.x pool was using TCP updates because of this issue.   Slot2 never seems to be affected by this… do you still think UDP updates are to blame?  I suppose it doesn’t hurt to give it a try.

 

Thanks,

 

Bryan

Reply all
Reply to author
Forward
0 new messages