Job disconnection

Fatemeh

unread,

Oct 21, 2009, 8:36:18 PM10/21/09

to Archer User's Group

Hello,

My job seems to have been disconnected and the file that is generated
by the program (result_0) does not seem to have been tranferred
correctly. Here is the contents of the log file:

000 (432.000.000) 10/22 00:08:49 Job submitted from host:
<5.190.31.154:9501>
...
022 (432.000.000) 10/22 00:11:57 Job disconnected, attempting to
reconnect
Local schedd and job shadow died, schedd now running again
Trying to reconnect to sl...@C068006087.ipop <5.68.6.87:43771>
...
024 (432.000.000) 10/22 00:12:11 Job reconnection failed
Job not found at execution machine
Can not reconnect to sl...@C068006087.ipop, rescheduling job
...
001 (432.000.000) 10/22 00:20:05 Job executing on host:
<5.68.6.87:43771>
...
006 (432.000.000) 10/22 00:20:14 Image size of job updated: 3008
...
007 (432.000.000) 10/22 00:20:40 Shadow exception!
Error from starter on sl...@C068006087.ipop: STARTER at
5.68.6.87 failed
to send file(s) to <5.190.31.154:49372>: error reading from /opt/
condor/var/exe
cute/dir_25682/result_0: (errno 2) No such file or directory; SHADOW
failed to r
eceive file(s) from <5.68.6.87:36267>
0 - Run Bytes Sent By Job
1139793 - Run Bytes Received By Job
...
012 (432.000.000) 10/22 00:20:41 Job was held.
Error from starter on sl...@C068006087.ipop: STARTER at
5.68.6.87 failed
to send file(s) to <5.190.31.154:49372>: error reading from /opt/
condor/var/exe
cute/dir_25682/result_0: (errno 2) No such file or directory; SHADOW
failed to r
eceive file(s) from <5.68.6.87:36267>
Code 13 Subcode 2
...

Does anyone know why this happened?

Thanks,

-Fatemeh

rjo...@gmail.com

unread,

Oct 21, 2009, 9:54:22 PM10/21/09

to Archer User's Group

not sure - a while ago we were seeing jobs disconnecting before due to
keepalive timeouts (this might be it, but I believe we have fixed
this)

can you post your job submit file(s) here?

--rf

David Isaac Wolinsky

unread,

Oct 21, 2009, 11:29:41 PM10/21/09

to archer-us...@googlegroups.com

As root,
- editing /etc/condor/condor_config and appending
"MAX_CONCURRENT_UPLOADS = 5" to the end without quotation marks
- Execute /opt/condor/sbin/condor_reconfig
As user,
- submit jobs

Also I noticed you're submitting a lot of jobs every time. I suggest
you start with a few. That helps narrow down the problem.

Regards,
David

Fatemeh

unread,

Oct 22, 2009, 3:39:05 PM10/22/09

to Archer User's Group

Here is the submit file for the first job:

Universe = vanilla
Executable = ./VMMBrain
Output = /mnt/ganfs/C190031154/VMMBrain/Apache/
forward_feature_selection_fldr_k_750/run_outputs/run_0.cfg.out
Error = /mnt/ganfs/C190031154/VMMBrain/Apache/
forward_feature_selection_fldr_k_750/run_outputs/run_0.cfg.err
Log = /mnt/ganfs/C190031154/VMMBrain/Apache/
forward_feature_selection_fldr_k_750/run_outputs/run_0.cfg.log
Arguments = -slice 1 -option 13 -wrkld_name apache -wrkld_path /mnt/
ganfs/C190031154/VMMBrain/Apache -train_fldr Normal -test_fldr
All_Abnormal_Traces -train_file win_apache_server_I_Normal.txt -
test_file validation1.txt -use_ED 1 -k_ED 750 -use_CS 0 -use_Canb 0 -
use_Manh 0 -alg 1 -total_features 300 -method 3 -use_features 0 -
weights 1 -results_fldr forward_feature_selection_fldr_k_750 -
result_ID 0 -ignore_late_alarms 0 -skip 0,50,100,150
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_output_files = result_0
Queue

I also tried running one job, and I got the following error:
000 (889.000.000) 10/22 01:19:16 Job submitted from host:
<5.190.31.154:9501>
...
001 (889.000.000) 10/22 01:19:54 Job executing on host:
<5.209.76.10:41394>
...
007 (889.000.000) 10/22 01:19:57 Shadow exception!
Error from starter on sl...@C209076010.ipop: STARTER at 5.209.76.10
failed to send file(s) to <5.190.31.154:59934>: error reading from /
opt/condor/var/execute/dir_27941/result_0: (errno 2) No such file or
directory; SHADOW failed to receive file(s) from <5.209.76.10:50777>

0 - Run Bytes Sent By Job
1139793 - Run Bytes Received By Job
...

012 (889.000.000) 10/22 01:19:58 Job was held.
Error from starter on sl...@C209076010.ipop: STARTER at 5.209.76.10
failed to send file(s) to <5.190.31.154:59934>: error reading from /
opt/condor/var/execute/dir_27941/result_0: (errno 2) No such file or
directory; SHADOW failed to receive file(s) from <5.209.76.10:50777>

Code 13 Subcode 2
...

This time, the job didn't disconnect, but it still failed. Any ideas?

Thanks,

-Fatemeh

> >> -Fatemeh- Hide quoted text -
>
> - Show quoted text -

rjo...@gmail.com

unread,

Oct 22, 2009, 3:45:52 PM10/22/09

to Archer User's Group

Can you try using /mnt/local/VMMBrain instead of /mnt/ganfs/C190../
VMMBrain? I'm not sure if /mnt/ganfs is writable even in your local
host.

Also, can you double-check your program is indeed creating the
result_0 file in the same directory it runs?

--rf

Fatemeh

unread,

Oct 22, 2009, 9:10:13 PM10/22/09

to Archer User's Group

You were correct, Mr. Figueiredo. When I downloaded another grid
appliance and uploaded my code to it, I forgot to change the code
again so that it produced the result files in the same directory as
the one it ran. Now that I fixed that, the jobs seem to be running
fine (at least for the past few hours). Hopefully, they should
complete correctly.

Thanks again for everyone's help!

-Fatemeh

> > > - Show quoted text -- Hide quoted text -

rjo...@gmail.com

unread,

Oct 22, 2009, 9:13:13 PM10/22/09

to Archer User's Group

Glad to hear - please keep us posted if the jobs finished correctly.
--rf

Fatemeh

unread,

Oct 23, 2009, 11:25:05 AM10/23/09

to Archer User's Group

I left the jobs running last night and when I sat at my computer this
morning, there was a message saying I was low on virtual memory and
the screen was completely black. :(
I had to reboot the machine and restart the grid appliance. Looking at
some of the log files, a few jobs completed successfully but I will
rerun the experiments to see if they are all able to complete.

Question: Is it possible to delete the job history? In particular, I
would like to reset the cluster number assigned to new jobs.

And another thing, the grid appliance is responding extremely slow to
my commands right now. Is there a particular reason why this is
happening?

Thanks,

-Fatemeh

Fatemeh

unread,

Oct 23, 2009, 12:44:37 PM10/23/09

to Archer User's Group

Hello,

I am no longer getting responses to my condor commands. Here is what
happens when I run condor_q:

-- Failed to fetch ads from: <5.212.97.131:9501> : C212097131.ipop

David, do I need to assign a static address, like the last time you
suggested?
http://www.grid-appliance.org/wiki/index.php/Archer:StaticAddresses
If so, how can I "restart ipop"?

Thanks,

-Fatemeh

rjo...@gmail.com

unread,

Oct 23, 2009, 3:27:49 PM10/23/09

to Archer User's Group

On Oct 23, 12:44 pm, Fatemeh <fatem...@gmail.com> wrote:
> Hello,
>
> I am no longer getting responses to my condor commands. Here is what
> happens when I run condor_q:
>
> -- Failed to fetch ads from: <5.212.97.131:9501> : C212097131.ipop
>
> David, do I need to assign a static address, like the last time you
> suggested?http://www.grid-appliance.org/wiki/index.php/Archer:StaticAddresses

Yes, I think it's a good idea if you'll be running an NFS server.

> If so, how can I "restart ipop"?

/etc/init.d/ipop.sh stop
/etc/init.d/ipop.sh start

--rf
>
> Thanks,
>
> -Fatemeh

Fatemeh

unread,

Oct 23, 2009, 5:18:14 PM10/23/09

to Archer User's Group

Thanks, I followed the instructions on the link, but the ping was not
successful. When I ran condor_q again, it took about 20 seconds but it
finally showed the results. Here is the new contents of the /etc/
ipop.vpn.config file:

#!/bin/bash
DEVICE="tapipop"
DIR="/opt/ipop"
DHCP=
USER=
GROUP=
STATIC=true
IP=5.212.97.131
NETMASK=255.0.0.0
USE_IPOP_HOSTNAME=true

Does this look right?

Thanks again,

-Fatemeh

Fatemeh

unread,

Oct 23, 2009, 6:01:23 PM10/23/09

to Archer User's Group

Here is the result of running condor_status:

CEDAR:6001:Failed to connect to <5.1.1.251:9618>
Error: Couldn't contact the condor_collector on 5.1.1.251.

Extra Info: the condor_collector is a process that runs on the
central
manager of your Condor pool and collects the status of all the
machines and
jobs in the Condor pool. The condor_collector might not be running, it
might
be refusing to communicate with you, there might be a network problem,
or
there may be some other problem. Check with your system administrator
to fix
this problem.

If you are the system administrator, check that the condor_collector
is
running on 5.1.1.251, check the HOSTALLOW configuration in your
condor_config, and check the MasterLog and CollectorLog files in your
log
directory for possible clues as to why the condor_collector is not
responding. Also see the Troubleshooting section of the manual.

-Fatemeh

> > - Show quoted text -- Hide quoted text -

rjo...@gmail.com

unread,

Oct 24, 2009, 9:19:32 AM10/24/09

to Archer User's Group

On Oct 23, 11:25 am, Fatemeh <fatem...@gmail.com> wrote:
> I left the jobs running last night and when I sat at my computer this
> morning, there was a message saying I was low on virtual memory and
> the screen was completely black. :(

how much memory does your host machine have? you may want to consider
submitting jobs from an appliance installed in a remote VM (e.g. one
of the archer nodes over there) if you're going to be running large
numbers of jobs that NFS-mount your data. You may also consider using
the NEU NFS server instead of your own VM, that will reduce the load
on your host.

> I had to reboot the machine and restart the grid appliance. Looking at
> some of the log files, a few jobs completed successfully but I will
> rerun the experiments to see if they are all able to complete.

It's possible that the ones that didn't complete are restarted
automatically by Condor automatically.

>
> Question: Is it possible to delete the job history? In particular, I
> would like to reset the cluster number assigned to new jobs.

I'm not sure - Alain, any pointers here?

I don't know if you want to reset the job history though - if your
cluster ID increases you always have a unique ID for each job, if you
reset you will lose this uniqueness.

>
> And another thing, the grid appliance is responding extremely slow to
> my commands right now. Is there a particular reason why this is
> happening?
>

See my comments above, if you have lots of jobs accessing NFS back to
your host, and if your host is slow or has low memory, you might
consider running the appliance on a beefier host.

--rf

Alain Roy

unread,

Oct 25, 2009, 8:26:41 PM10/25/09

to archer-us...@googlegroups.com

On Oct 24, 2009, at 8:19 AM, rjo...@gmail.com wrote:
>> Question: Is it possible to delete the job history? In particular, I
>> would like to reset the cluster number assigned to new jobs.
>
> I'm not sure - Alain, any pointers here?
>
> I don't know if you want to reset the job history though - if your
> cluster ID increases you always have a unique ID for each job, if you
> reset you will lose this uniqueness.

Yeah, I'm not sure it's advisable, but you can do it.

First, you need to turn off Condor. I'm not entirely sure how it's set
up in archer, but you can probably do:

/etc/init.d/condor stop

(or just kill condor, or whatever.)

Then you need to find your spool directory. If you don't know where it
is, run:

condor_config_val SPOOL

That will print out a directory name. Change into that directory. Now
delete everything in that directory, particularly the "history" for
all job history, and the "job_queue.log*" for the job queue and the
history of the cluster numbers.

Restart condor, and you're good to go. You are messing with stuff we
don't usually recommend you mess with, so be careful.

-alain
-----------------------------------------------------------------
Alain Roy r...@cs.wisc.edu
Condor Project http://www.cs.wisc.edu/condor

Fatemeh

unread,

Oct 26, 2009, 10:55:00 AM10/26/09

to Archer User's Group

Thanks for all the help.

-Fatemeh

Fatemeh

unread,

Oct 26, 2009, 3:24:08 PM10/26/09

to Archer User's Group

To answer Mr. Figueiredo's questions, the computer where I installed
the grid appliance has an Intel Xeon 3.06 GHz CPU with 2 GB of RAM.
Can someone provide some assistance on how I can access one of the
Northeastern Archer nodes? Would I have to download and install a grid
appliance on the machine? And how can I use the NEU NFS server?

Thanks,

-Fatemeh

rjo...@gmail.com

unread,

Oct 26, 2009, 3:34:30 PM10/26/09

to Archer User's Group

Check with Perhaad Misry how to access the file server over there. If
you use the Archer cluster file server, it may be enough to offload
your desktop appliance. Regardless of where you run the appliance, I
suggest you scale back to a smaller numbers of concurrent jobs
submitted, check how your appliance load (cpu, memory, network)
behaves, and then ramp up the number of jobs.

--rf

Reply all

Reply to author

Forward