I am running Condor 7.05 in a Rocks 5.1 cluster. Users were submitting
jobs normally, but today, suddenly, Condor doesn't accept any jobs.
Below is the message we obtain when trying to submit:
$ condor_submit hello.sub
ERROR: Can't find address of local schedd
and whit condor_q
$ condor_q
Error:
Extra Info: You probably saw this error because the condor_schedd is not
running on the machine you are trying to query. If the condor_schedd is not
running, the Condor system will not be able to find an address and port to
connect to and satisfy this request. Please make sure the Condor daemons are
running and try again.
Extra Info: If the condor_schedd is running on the machine you are trying to
query and you still see the error, the most likely cause is that you have
setup a personal Condor, you have not defined SCHEDD_NAME in your
condor_config file, and something is wrong with your SCHEDD_ADDRESS_FILE
setting. You must define either or both of those settings in your config
file, or you must use the -name option to condor_q. Please see the Condor
manual for details on SCHEDD_NAME and SCHEDD_ADDRESS_FILE.
Apparently Condor daemons are not running any more:
$ ps -ef | grep condor
500 21305 20730 0 17:25 pts/2 00:00:00 grep condor
I am new to Condor. So, I will thanks any help.
Thanks in advance
Marcelo
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-use...@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
Rob
thank you for the help. After reinitializing the front-end by hand, I
continue with the problem:
$ condor_status
CEDAR:6001:Failed to connect to <xxx.xx.xxx.xx:xxxx>
Error: Couldn't contact the condor_collector on cluster-name.domain
Extra Info: the condor_collector is a process that runs on the central
manager of your Condor pool and collects the status of all the machines and
jobs in the Condor pool. The condor_collector might not be running, it might
be refusing to communicate with you, there might be a network problem, or
there may be some other problem. Check with your system administrator to fix
this problem.
If you are the system administrator, check that the condor_collector is
running on cluster-name.domain, check the HOSTALLOW configuration in your
condor_config, and check the MasterLog and CollectorLog files in your log
directory for possible clues as to why the condor_collector is not
responding. Also see the Troubleshooting section of the manual.
I am looking for the Masterlog files, but I can't find them. Where
they are suppose to be? The troubleshooting section of the manual
doesn't help.
The condor_master command doesn't help too:
# condor_master
# condor_status
CEDAR:6001:Failed to connect to ... <snip>
Thanks very much for the help
Marcelo
2009/4/13 Robert Rati <rr...@redhat.com>:
This error indicates that the condor_status command couldn't
communicate with the collector. This most likely means:
(1) the collector (and the condor_master/other daemons) isn't running
on the central manager,
(2) the collector is running, but not on the server the command thinks
it is, or
(3) the collector is running where condor_status thinks it is, but
condor_status doesn't have permission to talk with it.
To rule out #1, on the central manager of the pool, after you run
condor_master on the head node for the cluster, what do you get when
you run:
$ ps -ef | grep condor
Does the condor_master/condor_collector show up here?
This should tell you the directory log files are located in:
$ condor_config_val -config -verbose LOG
To check for option #2, determine where the collector should be by running:
condor_config_val -verbose COLLECTOR_HOST
Does this match the machine you expect to be the central manager?
> I am looking for the Masterlog files, but I can't find them. Where
> they are suppose to be? The troubleshooting section of the manual
> doesn't help.
The master log is located:
condor_config_val MASTER_LOG
> The condor_master command doesn't help too:
>
> # condor_master
condor_master merely starts the condor_master daemon, which on the
central manager for the pool (see the COLLECTOR_HOST setting), should
start the collector and other daemons.
For situation #3, do you get permission denied errors in the logfiles?
Checking the HOSTALLOW_READ settings on the central manager will be
the next step:
http://www.cs.wisc.edu/condor/manual/v7.2/3_6Security.html#sec:Host-Security
For further help you can also set TOOL_DEBUG = D_FULLDEBUG and run
condor_status -debug.
Good luck, and I hope this helps.
Best,
Jason
--
===================================
Jason A. Stowe
main: 888.292.5320
http://www.cyclecloud.com
http://www.cyclecomputing.com
Cycle Computing, LLC
Leader in Condor Grid Solutions
Enterprise Condor Support and Management Tools
Come see us at Bio-IT World in Boston!
Marcelo Chiapparini schrieb:
> I am running Condor 7.05 in a Rocks 5.1 cluster. Users were submitting
> jobs normally, but today, suddenly, Condor doesn't accept any jobs.
> Below is the message we obtain when trying to submit:
>
> $ condor_submit hello.sub
>
> ERROR: Can't find address of local schedd
>
if even the condor_master is dead, you might want to look into the log
files (path will vary, is governed by your configuration) and then
restart the master (which will take care of the other daemons).
For me it would be /etc/init.d/condor start, but it might be just
condor_master for you...
HTH
Carsten
thank you for the help. Below are the results of your advices:
2009/4/14 Jason Stowe <jst...@cyclecomputing.com>:
> Marcelo,
> The errors you are getting could be caused by a few problems, so below
> is a more detailed process to help you debug this:
>> $ condor_status
>> CEDAR:6001:Failed to connect to <xxx.xx.xxx.xx:xxxx>
>> Error: Couldn't contact the condor_collector on cluster-name.domain
>>
>> Extra Info: the condor_collector is a process that runs on the central
> ...
>> responding. Also see the Troubleshooting section of the manual.
>
> This error indicates that the condor_status command couldn't
> communicate with the collector. This most likely means:
> (1) the collector (and the condor_master/other daemons) isn't running
> on the central manager,
> (2) the collector is running, but not on the server the command thinks
> it is, or
> (3) the collector is running where condor_status thinks it is, but
> condor_status doesn't have permission to talk with it.
>
> To rule out #1, on the central manager of the pool, after you run
> condor_master on the head node for the cluster, what do you get when
> you run:
> $ ps -ef | grep condor
> Does the condor_master/condor_collector show up here?
No. Deamons are not running on the central node:
# condor_master
# ps -ef | grep condor
root 25980 15002 0 09:41 pts/1 00:00:00 grep condor
> This should tell you the directory log files are located in:
> $ condor_config_val -config -verbose LOG
I found they! They are in /var/opt/condor/log. Thanks!
> To check for option #2, determine where the collector should be by running:
> condor_config_val -verbose COLLECTOR_HOST
# condor_config_val -verbose COLLECTOR_HOST
COLLECTOR_HOST: lacad-dft.fis.uerj.br
> Does this match the machine you expect to be the central manager?
Yes!
> For situation #3, do you get permission denied errors in the logfiles?
> Checking the HOSTALLOW_READ settings on the central manager will be
> the next step:
> http://www.cs.wisc.edu/condor/manual/v7.2/3_6Security.html#sec:Host-Security
# condor_config_val -verbose HOSTALLOW_READ
HOSTALLOW_READ: *
Defined in '/opt/condor/etc/condor_config', line 209.
Looking at the CollectorLog file, it is clear that something happened
at 14:42:01, because the write to this log was interrupted in the
middle of a sentence. See the last lines of the CollectorLog:
<snip>
4/13 14:40:22 NegotiatorAd : Inserting ** "< lacad-dft.fis.uerj.br >"
4/13 14:41:55 (Sending 84 ads in response to query)
4/13 14:41:55 Got QUERY_STARTD_PVT_ADS
4/13 14:41:55 (Sending 64 ads in response to query)
4/13 14:42:01 Got QUERY
and nothing more was written since this. This was yesterday, when
Condor stops to work.
Looking at the MasterLog file we find the same. Again, things were
interrupted abruptly at 14:42:14. (sorry for the long log, but I want
to give a good idea of what happened...)
<snip>
4/10 10:50:18 Preen pid is 10018
4/10 10:50:18 Child 10018 died, but not a daemon -- Ignored
4/11 10:50:18 Preen pid is 12156
4/11 10:50:18 Child 12156 died, but not a daemon -- Ignored
4/12 10:50:18 Preen pid is 10655
4/12 10:50:18 Child 10655 died, but not a daemon -- Ignored
4/13 10:50:18 Preen pid is 18824
4/13 10:50:18 Child 18824 died, but not a daemon -- Ignored
4/13 14:34:51 The SCHEDD (pid 4063) exited with status 4
4/13 14:34:51 Sending obituary for "/opt/condor/sbin/condor_schedd"
4/13 14:34:51 restarting /opt/condor/sbin/condor_schedd in 10 seconds
4/13 14:35:01 Started DaemonCore process
"/opt/condor/sbin/condor_schedd", pid and pgroup = 20203
4/13 14:35:01 The SCHEDD (pid 20203) exited with status 4
4/13 14:35:01 Sending obituary for "/opt/condor/sbin/condor_schedd"
4/13 14:35:01 restarting /opt/condor/sbin/condor_schedd in 11 seconds
4/13 14:35:12 Started DaemonCore process
"/opt/condor/sbin/condor_schedd", pid and pgroup = 20210
4/13 14:35:12 The SCHEDD (pid 20210) exited with status 44
4/13 14:35:12 Sending obituary for "/opt/condor/sbin/condor_schedd"
4/13 14:35:12 restarting /opt/condor/sbin/condor_schedd in 13 seconds
4/13 14:35:25 Started DaemonCore process
"/opt/condor/sbin/condor_schedd", pid and pgroup = 20214
4/13 14:35:25 The SCHEDD (pid 20214) exited with status 44
4/13 14:35:25 Sending obituary for "/opt/condor/sbin/condor_schedd"
4/13 14:35:25 restarting /opt/condor/sbin/condor_schedd in 17 seconds
4/13 14:35:42 Started DaemonCore process
"/opt/condor/sbin/condor_schedd", pid and pgroup = 20218
4/13 14:35:42 The SCHEDD (pid 20218) exited with status 44
4/13 14:35:42 restarting /opt/condor/sbin/condor_schedd in 25 seconds
4/13 14:36:07 Started DaemonCore process
"/opt/condor/sbin/condor_schedd", pid and pgroup = 20219
4/13 14:36:07 The SCHEDD (pid 20219) exited with status 44
4/13 14:36:07 restarting /opt/condor/sbin/condor_schedd in 41 seconds
4/13 14:36:48 Started DaemonCore process
"/opt/condor/sbin/condor_schedd", pid and pgroup = 20220
4/13 14:36:48 The SCHEDD (pid 20220) exited with status 44
4/13 14:36:48 restarting /opt/condor/sbin/condor_schedd in 73 seconds
4/13 14:38:01 Started DaemonCore process
"/opt/condor/sbin/condor_schedd", pid and pgroup = 20222
4/13 14:38:01 The SCHEDD (pid 20222) exited with status 44
4/13 14:38:01 restarting /opt/condor/sbin/condor_schedd in 137 seconds
4/13 14:40:18 Started DaemonCore process
"/opt/condor/sbin/condor_schedd", pid and pgroup = 20226
4/13 14:40:18 The SCHEDD (pid 20226) exited with status 44
4/13 14:40:18 restarting /opt/condor/sbin/condor_schedd in 265 seconds
4/13 14:42:01 The COLLECTOR (pid 3779) exited with status 44
4/13 14:42:01 Sending obituary for "/opt/condor/sbin/condor_collector"
4/13 14:42:01 restarting /opt/condor/sbin/condor_collector in 10 seconds
4/13 14:42:01 attempt to connect to <152.92.133.74:9618> failed:
Connection refused (connect errno = 111).
4/13 14:42:01 ERROR: SECMAN:2003:TCP connection to <152.92.133.74:9618> failed
4/13 14:42:01 Failed to start non-blocking update to <152.92.133.74:9618>.
4/13 14:42:11 Started DaemonCore process
"/opt/condor/sbin/condor_collector", pid and pgroup = 20233
4/13 14:42:14 attempt to connect to <152.92.133.74:9618> failed:
Connection refused (connect errno = 111).
4/13 14:42:14 ERROR: SECMAN:2003:TCP connection to <152.92.133.74:9618> failed
4/13 14:42:14 Failed to start non-blocking update to <152.92.133.74:9618>.
4/13 14:42:14 The COLLECTOR (pid 20233) exited with status 44
4/13 14:42:14 Sending obituary for "/opt/condor/sbin/condor_collector"
4/13 14:42:
Is this a physical problem with the hardware? I reboot physically the
cluster today, 4/14, but Condor refuses to run. Nothing was written to
the logs since yesterday 4/13 14:42:14.
Any help will be very welcome,
Regards
Marcelo
Rob
-- =================================== Rob Futrick main: 888.292.5320 Cycle Computing, LLC Leader in Condor Grid Solutions Enterprise Condor Support and CycleServer Management Tools http://www.cyclecomputing.com http://www.cyclecloud.com
Bingo! you was right:
# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda1 15872604 4889488 10163804 33% /
/dev/sda5 828959588 2753132 783418536 1% /state/partition1
/dev/sda2 3968124 3831872 0 100% /var
tmpfs 4087108 0 4087108 0% /dev/shm
tmpfs 1995656 4992 1990664 1% /var/lib/ganglia/rrds
/var is full!
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda2 3968124 3831872 0 100% /var
Now I have to figure out what is the reason. To fix it and to prevent
it to happen again. The user is compiling his programs with
condor_compile and submitting them in the standard universe. May be
/var is full with his checkpoint images? If not, any help will be very
well come!
Regards
Marcelo
ps: I want to thanks all the support from people of this marvelous list!
2009/4/14 Robert Futrick <rfut...@cyclecomputing.com>:
--
Marcelo Chiapparini
http://sites.google.com/site/marcelochiapparini
I looked at /var/opt/condor/spool directory. Here is it content:
# ls -all
total 3508704
drwxr-xr-x 3 condor condor 4096 Apr 13 14:35 .
drwxr-xr-x 5 condor condor 4096 Dec 15 11:15 ..
-rw------- 1 condor condor 248004 Apr 13 14:41 Accountantnew.log
-rwxr-xr-x 1 condor condor 2077155 Apr 13 13:45 cluster15.ickpt.subproc0
-rwxr-xr-x 1 condor condor 2077155 Apr 13 08:50 cluster8.ickpt.subproc0
-rw-r--r-- 1 condor condor 277414943 Apr 13 11:43 cluster8.proc0.subproc0
-rw-r--r-- 1 condor condor 2322432 Apr 13 14:34 cluster8.proc0.subproc0.tmp
-rw-r--r-- 1 condor condor 277414943 Apr 13 12:09 cluster8.proc1.subproc0
-rw-r--r-- 1 condor condor 277419039 Apr 13 11:33 cluster8.proc2.subproc0
-rw-r--r-- 1 condor condor 277419039 Apr 13 12:02 cluster8.proc4.subproc0
-rw-r--r-- 1 condor condor 277414943 Apr 13 11:43 cluster8.proc5.subproc0
-rwxr-xr-x 1 condor condor 2077155 Apr 13 09:07 cluster9.ickpt.subproc0
-rw-r--r-- 1 condor condor 101482496 Apr 13 12:24 cluster9.proc0.subproc0.tmp
-rw-r--r-- 1 condor condor 277410847 Apr 13 11:48 cluster9.proc10.subproc0
-rw-r--r-- 1 condor condor 277414943 Apr 13 11:58 cluster9.proc14.subproc0
-rw-r--r-- 1 condor condor 277410847 Apr 13 11:48 cluster9.proc15.subproc0
-rw-r--r-- 1 condor condor 1024000 Apr 13 14:29 cluster9.proc15.subproc0.tmp
-rw-r--r-- 1 condor condor 43974656 Apr 13 12:24 cluster9.proc16.subproc0.tmp
-rw-r--r-- 1 condor condor 16863232 Apr 13 12:33 cluster9.proc17.subproc0.tmp
-rw-r--r-- 1 condor condor 277414943 Apr 13 11:48 cluster9.proc1.subproc0
-rw-r--r-- 1 condor condor 77766656 Apr 13 12:33 cluster9.proc2.subproc0.tmp
-rw-r--r-- 1 condor condor 277419039 Apr 13 11:58 cluster9.proc4.subproc0
-rw-r--r-- 1 condor condor 277414943 Apr 13 11:48 cluster9.proc6.subproc0
-rw-r--r-- 1 condor condor 9547776 Apr 13 12:33 cluster9.proc7.subproc0.tmp
-rw-r--r-- 1 condor condor 277419039 Apr 13 11:58 cluster9.proc9.subproc0
-rw-r--r-- 1 condor condor 218377 Apr 13 13:45 history
-rw------- 2 condor condor 262144 Apr 13 14:34 job_queue.log
-rw------- 2 condor condor 262144 Apr 13 14:34 job_queue.log.4
-rw------- 1 condor condor 0 Apr 13 14:35 job_queue.log.tmp
drwxrwxrwt 2 condor condor 4096 Dec 15 11:15 local_univ_execute
As can be seen, there are many files named clusterN.procM.subproc0
which are huge (277 MB). The content of the directory amounts 3.5 GB.
The size of the /var directory is 3.8 GB (the default Rocks
installation). So, /spool directory is consuming all the room in /var.
What is the content of clusterN.procM.subproc0 files? How can I
prevent these files to grow so much? It is safe to erase them?
Thanks in advance
Marcelo
2009/4/14 Marcelo Chiapparini <marcelo...@gmail.com>:
after cleaning up the /var/opt/condor/spool directory, I was be able
to start Condor with condor_master. Now things and up an running
again. Now I have to see why files
clusterN.procM.subproc0 became so big.
Thanks a lot to everyone who helped!
Regards