[Condor-users] Can't find address of local schedd

Marcelo Chiapparini

unread,

Apr 13, 2009, 4:36:46 PM4/13/09

to Condor users

Hello,

I am running Condor 7.05 in a Rocks 5.1 cluster. Users were submitting
jobs normally, but today, suddenly, Condor doesn't accept any jobs.
Below is the message we obtain when trying to submit:

$ condor_submit hello.sub

ERROR: Can't find address of local schedd

and whit condor_q

$ condor_q
Error:

Extra Info: You probably saw this error because the condor_schedd is not
running on the machine you are trying to query. If the condor_schedd is not
running, the Condor system will not be able to find an address and port to
connect to and satisfy this request. Please make sure the Condor daemons are
running and try again.

Extra Info: If the condor_schedd is running on the machine you are trying to
query and you still see the error, the most likely cause is that you have
setup a personal Condor, you have not defined SCHEDD_NAME in your
condor_config file, and something is wrong with your SCHEDD_ADDRESS_FILE
setting. You must define either or both of those settings in your config
file, or you must use the -name option to condor_q. Please see the Condor
manual for details on SCHEDD_NAME and SCHEDD_ADDRESS_FILE.

Apparently Condor daemons are not running any more:

$ ps -ef | grep condor
500 21305 20730 0 17:25 pts/2 00:00:00 grep condor

I am new to Condor. So, I will thanks any help.

Thanks in advance

Marcelo
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-use...@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

Robert Rati

unread,

Apr 13, 2009, 5:23:24 PM4/13/09

to Condor-Users Mail List

Looks like you need to restart condor. You can do this by running
"condor_master" as root.

Rob

Marcelo Chiapparini

unread,

Apr 13, 2009, 6:38:36 PM4/13/09

to Condor-Users Mail List

Hi Robert,

thank you for the help. After reinitializing the front-end by hand, I
continue with the problem:

$ condor_status
CEDAR:6001:Failed to connect to <xxx.xx.xxx.xx:xxxx>
Error: Couldn't contact the condor_collector on cluster-name.domain

Extra Info: the condor_collector is a process that runs on the central
manager of your Condor pool and collects the status of all the machines and
jobs in the Condor pool. The condor_collector might not be running, it might
be refusing to communicate with you, there might be a network problem, or
there may be some other problem. Check with your system administrator to fix
this problem.

If you are the system administrator, check that the condor_collector is
running on cluster-name.domain, check the HOSTALLOW configuration in your
condor_config, and check the MasterLog and CollectorLog files in your log
directory for possible clues as to why the condor_collector is not
responding. Also see the Troubleshooting section of the manual.

I am looking for the Masterlog files, but I can't find them. Where
they are suppose to be? The troubleshooting section of the manual
doesn't help.

The condor_master command doesn't help too:

# condor_master
# condor_status
CEDAR:6001:Failed to connect to ... <snip>

Thanks very much for the help

Marcelo

2009/4/13 Robert Rati <rr...@redhat.com>:

Jason Stowe

unread,

Apr 13, 2009, 11:29:51 PM4/13/09

to Condor-Users Mail List

Marcelo,
The errors you are getting could be caused by a few problems, so below
is a more detailed process to help you debug this:

> $ condor_status
> CEDAR:6001:Failed to connect to <xxx.xx.xxx.xx:xxxx>
> Error: Couldn't contact the condor_collector on cluster-name.domain
>
> Extra Info: the condor_collector is a process that runs on the central

...

> responding. Also see the Troubleshooting section of the manual.

This error indicates that the condor_status command couldn't
communicate with the collector. This most likely means:
(1) the collector (and the condor_master/other daemons) isn't running
on the central manager,
(2) the collector is running, but not on the server the command thinks
it is, or
(3) the collector is running where condor_status thinks it is, but
condor_status doesn't have permission to talk with it.

To rule out #1, on the central manager of the pool, after you run
condor_master on the head node for the cluster, what do you get when
you run:

$ ps -ef | grep condor

Does the condor_master/condor_collector show up here?
This should tell you the directory log files are located in:
$ condor_config_val -config -verbose LOG

To check for option #2, determine where the collector should be by running:
condor_config_val -verbose COLLECTOR_HOST
Does this match the machine you expect to be the central manager?

> I am looking for the Masterlog files, but I can't find them. Where
> they are suppose to be? The troubleshooting section of the manual
> doesn't help.

The master log is located:
condor_config_val MASTER_LOG

> The condor_master command doesn't help too:
>
> # condor_master

condor_master merely starts the condor_master daemon, which on the
central manager for the pool (see the COLLECTOR_HOST setting), should
start the collector and other daemons.
For situation #3, do you get permission denied errors in the logfiles?
Checking the HOSTALLOW_READ settings on the central manager will be
the next step:
http://www.cs.wisc.edu/condor/manual/v7.2/3_6Security.html#sec:Host-Security

For further help you can also set TOOL_DEBUG = D_FULLDEBUG and run
condor_status -debug.

Good luck, and I hope this helps.

Best,
Jason

--
===================================
Jason A. Stowe
main: 888.292.5320

http://www.cyclecloud.com
http://www.cyclecomputing.com

Cycle Computing, LLC
Leader in Condor Grid Solutions
Enterprise Condor Support and Management Tools

Come see us at Bio-IT World in Boston!

Carsten Aulbert

unread,

Apr 14, 2009, 5:38:23 AM4/14/09

to condor...@cs.wisc.edu

Hi Marcelo,

Marcelo Chiapparini schrieb:

> I am running Condor 7.05 in a Rocks 5.1 cluster. Users were submitting
> jobs normally, but today, suddenly, Condor doesn't accept any jobs.
> Below is the message we obtain when trying to submit:
>
> $ condor_submit hello.sub
>
> ERROR: Can't find address of local schedd
>

if even the condor_master is dead, you might want to look into the log
files (path will vary, is governed by your configuration) and then
restart the master (which will take care of the other daemons).

For me it would be /etc/init.d/condor start, but it might be just
condor_master for you...

HTH

Carsten

Marcelo Chiapparini

unread,

Apr 14, 2009, 9:31:10 AM4/14/09

to Condor-Users Mail List

Jason,

thank you for the help. Below are the results of your advices:

2009/4/14 Jason Stowe <jst...@cyclecomputing.com>:

> Marcelo,
> The errors you are getting could be caused by a few problems, so below
> is a more detailed process to help you debug this:
>> $ condor_status
>> CEDAR:6001:Failed to connect to <xxx.xx.xxx.xx:xxxx>
>> Error: Couldn't contact the condor_collector on cluster-name.domain
>>
>> Extra Info: the condor_collector is a process that runs on the central
> ...
>> responding. Also see the Troubleshooting section of the manual.
>
> This error indicates that the condor_status command couldn't
> communicate with the collector. This most likely means:
> (1) the collector (and the condor_master/other daemons) isn't running
> on the central manager,
> (2) the collector is running, but not on the server the command thinks
> it is, or
> (3) the collector is running where condor_status thinks it is, but
> condor_status doesn't have permission to talk with it.
>
> To rule out #1, on the central manager of the pool, after you run
> condor_master on the head node for the cluster, what do you get when
> you run:
> $ ps -ef | grep condor
> Does the condor_master/condor_collector show up here?

No. Deamons are not running on the central node:

# condor_master
# ps -ef | grep condor
root 25980 15002 0 09:41 pts/1 00:00:00 grep condor

> This should tell you the directory log files are located in:
> $ condor_config_val -config -verbose LOG

I found they! They are in /var/opt/condor/log. Thanks!

> To check for option #2, determine where the collector should be by running:
> condor_config_val -verbose COLLECTOR_HOST

# condor_config_val -verbose COLLECTOR_HOST
COLLECTOR_HOST: lacad-dft.fis.uerj.br

> Does this match the machine you expect to be the central manager?

Yes!

> For situation #3, do you get permission denied errors in the logfiles?
> Checking the HOSTALLOW_READ settings on the central manager will be
> the next step:
> http://www.cs.wisc.edu/condor/manual/v7.2/3_6Security.html#sec:Host-Security

# condor_config_val -verbose HOSTALLOW_READ
HOSTALLOW_READ: *
Defined in '/opt/condor/etc/condor_config', line 209.

Looking at the CollectorLog file, it is clear that something happened
at 14:42:01, because the write to this log was interrupted in the
middle of a sentence. See the last lines of the CollectorLog:

<snip>
4/13 14:40:22 NegotiatorAd : Inserting ** "< lacad-dft.fis.uerj.br >"
4/13 14:41:55 (Sending 84 ads in response to query)
4/13 14:41:55 Got QUERY_STARTD_PVT_ADS
4/13 14:41:55 (Sending 64 ads in response to query)
4/13 14:42:01 Got QUERY

and nothing more was written since this. This was yesterday, when
Condor stops to work.
Looking at the MasterLog file we find the same. Again, things were
interrupted abruptly at 14:42:14. (sorry for the long log, but I want
to give a good idea of what happened...)

<snip>
4/10 10:50:18 Preen pid is 10018
4/10 10:50:18 Child 10018 died, but not a daemon -- Ignored
4/11 10:50:18 Preen pid is 12156
4/11 10:50:18 Child 12156 died, but not a daemon -- Ignored
4/12 10:50:18 Preen pid is 10655
4/12 10:50:18 Child 10655 died, but not a daemon -- Ignored
4/13 10:50:18 Preen pid is 18824
4/13 10:50:18 Child 18824 died, but not a daemon -- Ignored
4/13 14:34:51 The SCHEDD (pid 4063) exited with status 4
4/13 14:34:51 Sending obituary for "/opt/condor/sbin/condor_schedd"
4/13 14:34:51 restarting /opt/condor/sbin/condor_schedd in 10 seconds
4/13 14:35:01 Started DaemonCore process
"/opt/condor/sbin/condor_schedd", pid and pgroup = 20203
4/13 14:35:01 The SCHEDD (pid 20203) exited with status 4
4/13 14:35:01 Sending obituary for "/opt/condor/sbin/condor_schedd"
4/13 14:35:01 restarting /opt/condor/sbin/condor_schedd in 11 seconds
4/13 14:35:12 Started DaemonCore process
"/opt/condor/sbin/condor_schedd", pid and pgroup = 20210
4/13 14:35:12 The SCHEDD (pid 20210) exited with status 44
4/13 14:35:12 Sending obituary for "/opt/condor/sbin/condor_schedd"
4/13 14:35:12 restarting /opt/condor/sbin/condor_schedd in 13 seconds
4/13 14:35:25 Started DaemonCore process
"/opt/condor/sbin/condor_schedd", pid and pgroup = 20214
4/13 14:35:25 The SCHEDD (pid 20214) exited with status 44
4/13 14:35:25 Sending obituary for "/opt/condor/sbin/condor_schedd"
4/13 14:35:25 restarting /opt/condor/sbin/condor_schedd in 17 seconds
4/13 14:35:42 Started DaemonCore process
"/opt/condor/sbin/condor_schedd", pid and pgroup = 20218
4/13 14:35:42 The SCHEDD (pid 20218) exited with status 44
4/13 14:35:42 restarting /opt/condor/sbin/condor_schedd in 25 seconds
4/13 14:36:07 Started DaemonCore process
"/opt/condor/sbin/condor_schedd", pid and pgroup = 20219
4/13 14:36:07 The SCHEDD (pid 20219) exited with status 44
4/13 14:36:07 restarting /opt/condor/sbin/condor_schedd in 41 seconds
4/13 14:36:48 Started DaemonCore process
"/opt/condor/sbin/condor_schedd", pid and pgroup = 20220
4/13 14:36:48 The SCHEDD (pid 20220) exited with status 44
4/13 14:36:48 restarting /opt/condor/sbin/condor_schedd in 73 seconds
4/13 14:38:01 Started DaemonCore process
"/opt/condor/sbin/condor_schedd", pid and pgroup = 20222
4/13 14:38:01 The SCHEDD (pid 20222) exited with status 44
4/13 14:38:01 restarting /opt/condor/sbin/condor_schedd in 137 seconds
4/13 14:40:18 Started DaemonCore process
"/opt/condor/sbin/condor_schedd", pid and pgroup = 20226
4/13 14:40:18 The SCHEDD (pid 20226) exited with status 44
4/13 14:40:18 restarting /opt/condor/sbin/condor_schedd in 265 seconds
4/13 14:42:01 The COLLECTOR (pid 3779) exited with status 44
4/13 14:42:01 Sending obituary for "/opt/condor/sbin/condor_collector"
4/13 14:42:01 restarting /opt/condor/sbin/condor_collector in 10 seconds
4/13 14:42:01 attempt to connect to <152.92.133.74:9618> failed:
Connection refused (connect errno = 111).
4/13 14:42:01 ERROR: SECMAN:2003:TCP connection to <152.92.133.74:9618> failed

4/13 14:42:01 Failed to start non-blocking update to <152.92.133.74:9618>.
4/13 14:42:11 Started DaemonCore process
"/opt/condor/sbin/condor_collector", pid and pgroup = 20233
4/13 14:42:14 attempt to connect to <152.92.133.74:9618> failed:
Connection refused (connect errno = 111).
4/13 14:42:14 ERROR: SECMAN:2003:TCP connection to <152.92.133.74:9618> failed

4/13 14:42:14 Failed to start non-blocking update to <152.92.133.74:9618>.
4/13 14:42:14 The COLLECTOR (pid 20233) exited with status 44
4/13 14:42:14 Sending obituary for "/opt/condor/sbin/condor_collector"
4/13 14:42:

Is this a physical problem with the hardware? I reboot physically the
cluster today, 4/14, but Condor refuses to run. Nothing was written to
the logs since yesterday 4/13 14:42:14.

Any help will be very welcome,

Regards

Marcelo

Robert Rati

unread,

Apr 14, 2009, 10:05:42 AM4/14/09

to Condor-Users Mail List

If the master died unexpectedly without being able to clean
/var/opt/condor/logup after itself, it's possible there is a lock file
hanging around that is preventing condor_master from starting again. In
your log directory, which appears to be /var/opt/condor/log for your
configuration, look for a file named InstanceLock. If it exists, try
removing it as root and then running condor_master again.

Rob

Robert Futrick

unread,

Apr 14, 2009, 10:16:11 AM4/14/09

to Condor-Users Mail List

Hello Marcelo,

Based on what you've written, it sounds like you're experiencing case #1 in Jason's email. Your daemons are configured to run on the correct server, but stopped running suddenly and now will not start again.

Considering you didn't make any other changes, and the sudden nature of the stop, you might be out of disk space. That's a common cause of daemons stopping logging mid-logline. Another option is that permissions or something else changed to prevent Condor from writing to that directory.

Try running "df" on the /var/opt/condor/log to make sure you have disk space. Being out of disk space is not the only reason Condor could have stopped working, but it is a good initial check.

Regards,
Rob

-- 

===================================
Rob Futrick
main: 888.292.5320

Cycle Computing, LLC
Leader in Condor Grid Solutions
Enterprise Condor Support and CycleServer Management Tools

http://www.cyclecomputing.com
http://www.cyclecloud.com

Marcelo Chiapparini

unread,

Apr 14, 2009, 10:34:11 AM4/14/09

to Condor-Users Mail List

Hi Rob,

Bingo! you was right:

# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda1 15872604 4889488 10163804 33% /
/dev/sda5 828959588 2753132 783418536 1% /state/partition1
/dev/sda2 3968124 3831872 0 100% /var
tmpfs 4087108 0 4087108 0% /dev/shm
tmpfs 1995656 4992 1990664 1% /var/lib/ganglia/rrds

/var is full!

Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda2 3968124 3831872 0 100% /var

Now I have to figure out what is the reason. To fix it and to prevent
it to happen again. The user is compiling his programs with
condor_compile and submitting them in the standard universe. May be
/var is full with his checkpoint images? If not, any help will be very
well come!

Regards

Marcelo

ps: I want to thanks all the support from people of this marvelous list!

2009/4/14 Robert Futrick <rfut...@cyclecomputing.com>:

--
Marcelo Chiapparini
http://sites.google.com/site/marcelochiapparini

Marcelo Chiapparini

unread,

Apr 14, 2009, 11:30:38 AM4/14/09

to Condor-Users Mail List

Hello,

I looked at /var/opt/condor/spool directory. Here is it content:

# ls -all
total 3508704
drwxr-xr-x 3 condor condor 4096 Apr 13 14:35 .
drwxr-xr-x 5 condor condor 4096 Dec 15 11:15 ..
-rw------- 1 condor condor 248004 Apr 13 14:41 Accountantnew.log
-rwxr-xr-x 1 condor condor 2077155 Apr 13 13:45 cluster15.ickpt.subproc0
-rwxr-xr-x 1 condor condor 2077155 Apr 13 08:50 cluster8.ickpt.subproc0
-rw-r--r-- 1 condor condor 277414943 Apr 13 11:43 cluster8.proc0.subproc0
-rw-r--r-- 1 condor condor 2322432 Apr 13 14:34 cluster8.proc0.subproc0.tmp
-rw-r--r-- 1 condor condor 277414943 Apr 13 12:09 cluster8.proc1.subproc0
-rw-r--r-- 1 condor condor 277419039 Apr 13 11:33 cluster8.proc2.subproc0
-rw-r--r-- 1 condor condor 277419039 Apr 13 12:02 cluster8.proc4.subproc0
-rw-r--r-- 1 condor condor 277414943 Apr 13 11:43 cluster8.proc5.subproc0
-rwxr-xr-x 1 condor condor 2077155 Apr 13 09:07 cluster9.ickpt.subproc0
-rw-r--r-- 1 condor condor 101482496 Apr 13 12:24 cluster9.proc0.subproc0.tmp
-rw-r--r-- 1 condor condor 277410847 Apr 13 11:48 cluster9.proc10.subproc0
-rw-r--r-- 1 condor condor 277414943 Apr 13 11:58 cluster9.proc14.subproc0
-rw-r--r-- 1 condor condor 277410847 Apr 13 11:48 cluster9.proc15.subproc0
-rw-r--r-- 1 condor condor 1024000 Apr 13 14:29 cluster9.proc15.subproc0.tmp
-rw-r--r-- 1 condor condor 43974656 Apr 13 12:24 cluster9.proc16.subproc0.tmp
-rw-r--r-- 1 condor condor 16863232 Apr 13 12:33 cluster9.proc17.subproc0.tmp
-rw-r--r-- 1 condor condor 277414943 Apr 13 11:48 cluster9.proc1.subproc0
-rw-r--r-- 1 condor condor 77766656 Apr 13 12:33 cluster9.proc2.subproc0.tmp
-rw-r--r-- 1 condor condor 277419039 Apr 13 11:58 cluster9.proc4.subproc0
-rw-r--r-- 1 condor condor 277414943 Apr 13 11:48 cluster9.proc6.subproc0
-rw-r--r-- 1 condor condor 9547776 Apr 13 12:33 cluster9.proc7.subproc0.tmp
-rw-r--r-- 1 condor condor 277419039 Apr 13 11:58 cluster9.proc9.subproc0
-rw-r--r-- 1 condor condor 218377 Apr 13 13:45 history
-rw------- 2 condor condor 262144 Apr 13 14:34 job_queue.log
-rw------- 2 condor condor 262144 Apr 13 14:34 job_queue.log.4
-rw------- 1 condor condor 0 Apr 13 14:35 job_queue.log.tmp
drwxrwxrwt 2 condor condor 4096 Dec 15 11:15 local_univ_execute

As can be seen, there are many files named clusterN.procM.subproc0
which are huge (277 MB). The content of the directory amounts 3.5 GB.
The size of the /var directory is 3.8 GB (the default Rocks
installation). So, /spool directory is consuming all the room in /var.
What is the content of clusterN.procM.subproc0 files? How can I
prevent these files to grow so much? It is safe to erase them?

Thanks in advance

Marcelo

2009/4/14 Marcelo Chiapparini <marcelo...@gmail.com>:

Marcelo Chiapparini

unread,

Apr 14, 2009, 1:52:22 PM4/14/09

to Condor-Users Mail List

Dear condorers

after cleaning up the /var/opt/condor/spool directory, I was be able
to start Condor with condor_master. Now things and up an running
again. Now I have to see why files
clusterN.procM.subproc0 became so big.