Sharon Goliath wrote:
> Hi Michael,
>
> Do you have anything new on this problem?
>
> Thanks,
> Sharon
>
> Michael Paterson wrote:
>> Do you have the container.log (the Nimbus log) where it tries / fails
>> or if it's empty showing the idle machines (627,628,629,630)
>> attempting to shutdown?
>>
>> Sharon Goliath wrote:
>>> So, yesterday John updated CS to the latest version of dev (thanks
>>> for the URL).
>>>
>>> I submitted 100 test jobs, which ran successfully, and finished
>>> about 17:30 Tuesday. At approximately the same time, another user
>>> submitted 263 jobs, which are still running. Here's the output from
>>> cloud_status -m this morning:
>>>
>>> ID HOSTNAME VMTYPE STATUS CLUSTER
>>> 643 cadc-vm06 arbase_jjk Running
>>> iris.cadc.dao.nrc.ca
>>> 644 cadc-vm01 arbase_jjk Running
>>> iris.cadc.dao.nrc.ca
>>> 645 cadc-vm07 arbase_jjk Running
>>> iris.cadc.dao.nrc.ca
>>>
>>> Total VMs: 3. Total Clouds: 1
>>>
>>>
>>> Here's the output from condor_status this morning - cadc-vm02,
>>> cadc-vm03, cadc-vm04, cadc-vm05 are all VMs from the jobs I
>>> submitted, cadc-vm01, cadc-vm06, cadc-vm07 are VMs from the jobs the
>>> other user submitted:
>>>
>>> Name OpSys Arch State Activity LoadAv Mem
>>> ActvtyTime
>>>
>>> cadc-vm02 LINUX INTEL Unclaimed Idle 0.080 2048
>>> 0+03:25:54
>>> cadc-vm03 LINUX INTEL Unclaimed Idle 0.000 2048
>>> 0+03:28:58
>>> cadc-vm04 LINUX INTEL Unclaimed Idle 0.000 2048
>>> 0+03:30:48
>>> cadc-vm05 LINUX INTEL Unclaimed Idle 0.000 2048
>>> 0+03:29:54
>>> cadc-vm01 LINUX X86_64 Claimed Busy 1.000 4096
>>> 0+03:48:36
>>> cadc-vm06 LINUX X86_64 Claimed Busy 1.000 4096
>>> 0+07:21:59
>>> cadc-vm07 LINUX X86_64 Claimed Busy 1.000 4096
>>> 0+05:53:54
>>> Total Owner Claimed Unclaimed Matched Preempting
>>> Backfill
>>>
>>> INTEL/LINUX 4 0 0 4 0
>>> 0 0
>>> X86_64/LINUX 3 0 3 0 0
>>> 0 0
>>>
>>> Total 7 0 3 4 0
>>> 0 0
>>>
>>>
>>> Here's the relevant output from cloudscheduler.log this morning:
>>>
>>> 2010-10-27 08:21:37,817 - DEBUG - Scheduler - Name of Condor Machine
>>> to shutdown: cadc-vm02
>>> 2010-10-27 08:21:37,818 - DEBUG - Scheduler - Name of Condor Machine
>>> to shutdown: cadc-vm03
>>> 2010-10-27 08:21:37,818 - DEBUG - Scheduler - Name of Condor Machine
>>> to shutdown: cadc-vm04
>>> 2010-10-27 08:21:37,818 - DEBUG - Scheduler - Name of Condor Machine
>>> to shutdown: cadc-vm05
>>>
>>>
>>> Here's the output of virsh list on the VMM with the four VMs "to
>>> shutdown" - the VMs are still booted, and I can ssh to them:
>>>
>>> goliaths proc5-20 ~ [52] virsh list
>>> Id Name State
>>> ----------------------------------
>>> 0 Domain-0 running
>>> 10 wrksp-627 idle
>>> 11 wrksp-628 idle
>>> 12 wrksp-630 idle
>>> 13 wrksp-629 idle
>>> 14 testsl5a idle
>>> 17 wrksp-645 running
>>>
>>>
>>> Condor's still running on the VMs:
>>>
>>> [root@cadc-vm04 ~]# ps -ef | grep condor
>>> condor 1444 1 0 Oct26 ? 00:00:00
>>> /usr/sbin/condor_master -pidfile /var/run/condor/master.pid
>>> condor 1456 1444 0 Oct26 ? 00:00:17 condor_startd -f
>>> condor 1457 1444 0 Oct26 ? 00:00:00 condor_schedd -f
>>> root 1484 1457 0 Oct26 ? 00:00:00 condor_procd -A
>>> /var/run/condor/procd_pipe.SCHEDD -R 10000000 -S 60 -C 500
>>> root 1499 1456 0 Oct26 ? 00:00:00 condor_procd -A
>>> /var/run/condor/procd_pipe.STARTD -R 10000000 -S 60 -C 500
>>>
>>>
>>> Is there anything else you'd like me to look at? Is the condor off
>>> functionality in the latest dev release? I can't remember if
>>> Michael is working on that now, or had just recently added that.
>>>
>>> Thanks,
>>> Sharon
>>