Is it possible that you're running a 7.1.0 pre-release condor_submit_dag
binary? This is the only explanation I can think of. There are pretty
major changes in how rescue DAGs work in 7.1.0, and the -Autorescue flag
is part of that.
You can find the version of your condor_submit_dag binary by doing the
following:
strings `which condor_submit_dag` |grep CondorVersion:
(assuming you're on some flavor of Unix or Linux; I'm not sure what the
equivalent is on Windows).
Does the datafind.GRB060429_DATAFIND.dag.condor.sub file work okay for
DAGMan itself, or does it cause problems? If it causes problems, you may
have a version mismatch between condor_submit_dag and condor_dagman.
(One general note here: the DAGMan version doesn't have to match the
version of the rest of the Condor installation, but the versions of
condor_submit_dag and condor_dagman should always match each other.)
Kent Wenger
Condor Team
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-use...@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
Hi All:
I have a new Condor pool uniformly running v7.0.1 on Windows. After a day or two the slot1 resources fail to show up when issuing a condor_status command. Here is sample output:
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
sl...@IDEE.inf.ad. WINNT51 INTEL Owner Idle 0.030 1023 0+04:32:59
sl...@IDEE.inf.ad. WINNT51 INTEL Owner Idle 0.000 1023 0+04:33:00
sl...@LEVIATHAN.in WINNT51 INTEL Owner Idle 0.000 1534 0+04:35:05
sl...@CQ-00.inf.ad WINNT52 INTEL Unclaimed Idle 0.000 1006 5+14:26:38
sl...@CQ-01.inf.ad WINNT52 INTEL Unclaimed Idle 0.000 1006 0+02:25:07
sl...@CQ-02.inf.ad WINNT52 INTEL Unclaimed Idle 0.000 1006 0+02:25:05
sl...@CQ-03.inf.ad WINNT52 INTEL Unclaimed Idle 0.000 1006 0+02:25:05
sl...@CQ-04.inf.ad WINNT52 INTEL Unclaimed Idle 0.000 1006 0+02:25:07
Total Owner Claimed Unclaimed Matched Preempting Backfill
INTEL/WINNT51 3 3 0 0 0 0 0
INTEL/WINNT52 5 0 0 5 0 0 0
Total 8 3 0 5 0 0 0
As you can see, even the totals fail to count the slot1 resources. A condor_reconfig is sufficient to bring slot1 back to life. The StartLog on an affected machine looks like:
3/17 12:03:18 ******************************************************
3/17 12:03:18 ** condor_startd.exe (CONDOR_STARTD) STARTING UP
3/17 12:03:18 ** C:\condor\bin\condor_startd.exe
3/17 12:03:18 ** $CondorVersion: 7.0.1 Feb 27 2008 BuildID: 76180 $
3/17 12:03:18 ** $CondorPlatform: INTEL-WINNT50 $
3/17 12:03:18 ** PID = 1880
3/17 12:03:18 ** Log last touched 3/17 11:01:32
3/17 12:03:18 ******************************************************
3/17 12:03:18 Using config source: C:\condor\condor_config
3/17 12:03:18 Using local config sources:
3/17 12:03:18 C:\condor\condor_config.local
3/17 12:03:18 DaemonCore: Command Socket at <x.x.x.x:1071>
3/17 12:03:18 MachAttributes::publish: failed to get Windows version information
3/17 12:03:24 slot1: New machine resource allocated
3/17 12:03:24 slot2: New machine resource allocated
3/17 12:03:29 About to run initial benchmarks.
3/17 12:03:33 Completed initial benchmarks.
.
. slot2 continues to run benchmarks, slot1 never runs benchmarks …
.
3/17 12:03:33 slot2: State change: IS_OWNER is false
3/17 12:03:33 slot2: Changing state: Owner -> Unclaimed
3/17 12:03:33 slot1: State change: IS_OWNER is false
3/17 12:03:33 slot1: Changing state: Owner -> Unclaimed
3/17 16:03:33 State change: RunBenchmarks is TRUE
3/17 16:03:33 slot2: Changing activity: Idle -> Benchmarking
3/17 16:03:36 State change: benchmarks completed
3/17 16:03:36 slot2: Changing activity: Benchmarking -> Idle
3/17 20:03:36 State change: RunBenchmarks is TRUE
3/17 20:03:36 slot2: Changing activity: Idle -> Benchmarking
3/17 20:03:39 State change: benchmarks completed
.
. reconfig sent, slot1 begins to run benchmarks in lieu of slot2
. slot1 is reappears in condor_status for a while …
.
3/22 21:50:06 Got SIGHUP. Re-reading config files.
3/23 00:10:06 State change: RunBenchmarks is TRUE
3/23 00:10:06 slot1: Changing activity: Idle -> Benchmarking
3/23 00:10:10 State change: benchmarks completed
3/23 00:10:10 slot1: Changing activity: Benchmarking -> Idle
3/23 04:10:10 State change: RunBenchmarks is TRUE
3/23 04:10:10 slot1: Changing activity: Idle -> Benchmarking
3/23 04:10:14 State change: benchmarks completed
3/23 04:10:14 slot1: Changing activity: Benchmarking -> Idle
.
. slot1 benchmarks continue but slot1 is no longer visible in condor_status …
.
3/28 04:12:18 slot1: Changing activity: Benchmarking -> Idle
3/28 08:12:19 State change: RunBenchmarks is TRUE
3/28 08:12:19 slot1: Changing activity: Idle -> Benchmarking
3/28 08:12:22 State change: benchmarks completed
3/28 08:12:22 slot1: Changing activity: Benchmarking -> Idle
<end>
Any ideas?
-Bryan
On Fri, 28 Mar 2008, Alexander Dietz wrote:
> I am running into trouble while running a DAG with DAGs with condor version
> 7.0.1. When the DAG's are created one of the sub-DAGs is called "
> datafind.GRB060429_DATAFIND.dag". From this file another file called "
> datafind.GRB060429_DATAFIND.dag.condor.sub" is created by calling
> "condor_submit_dag" (so it is written in the latter file). This file (i.e. "
> datafind.GRB060429_DATAFIND.dag.condor.sub" which is attached) contains
> several arguments (which are arguments to the "condor_dagman" command), but
> for unknown reasons there are arguments not contained in the documentation
> of "condor_dagman", like the argument "AutoRescue".
> Since this file is created by "condor_dagman" with obvious no arguments, how
> do these needless arguments get in? I cannot find them in the
> condor-configuration files nor in the environment variables anywhere. Is
> there some other way that arguments could get in?
Is it possible that you're running a 7.1.0 pre-release condor_submit_dag
binary? This is the only explanation I can think of. There are pretty
major changes in how rescue DAGs work in 7.1.0, and the -Autorescue flag
is part of that.
You can find the version of your condor_submit_dag binary by doing the
following:
strings `which condor_submit_dag` |grep CondorVersion:
Carl,
I just checked. No CORE files of any sort on any of the affected machines.
-Bryan
Sent: Friday, March 28, 2008 3:51 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] slot1 resources disappear after a few days.
John,
I was hoping that the UDP issues had been resolved by now; my previous Condor 6.6.x pool was using TCP updates because of this issue. Slot2 never seems to be affected by this… do you still think UDP updates are to blame? I suppose it doesn’t hurt to give it a try.
Thanks,
Bryan