problem submitting jobs for Simics/Gems.....

6 views
Skip to first unread message

Dimitris Kaseridis

unread,
Oct 17, 2009, 1:08:47 PM10/17/09
to archer-us...@googlegroups.com
Hi all,

I have been trying to submit jobs for Gems/Simics following the
directions on the wiki and although the jobs are submitted correctly
and the condor_q reports that they are running (R), they never finish.

For example I have created a simple simics configuration doing
nothing... just waiting in prompt ...and I submit a job to simulate
1000cycles. When executing it on my client VM it takes like 20-30secs
altogether.

I even put the files on the /mnt/local of my virtual machine and
executed from another client and it completed correctly in it.

I have all the files in /mnt/ganfs/C026090208/test

mod_opal_commands.py
mod_ruby_commands.py
mod_ruby_commands.pyc
opal.so
ruby.so
simics_condor_submit
simics_wrapper.sh
start_up_ruby.script

In theory if someone execute simics_wrapper.sh in a VM with simics it
is running fine.

This is my simics_condor_submit
*****************************************************************************************************************
# Condor submit script for the Simics Archer tutorial example

# Simics always runs in vanilla mode
universe = vanilla

# what will execute remotely is a "wrapper" script - which prepares the
# Simics workspace and runs Simics itself
executable = simics_wrapper.sh

# Specify requirements for job - the job will run on a machine that:
# 1) has the Simics module installed (you need this for all Simics
jobs in Archer), and
# 2) has and a minimum of 512MB RAM (you can change this according to your job)
Requirements = HasArcherSimics == TRUE && Memory >= 1024

# set up output, error and log files
log = simics.$(Cluster).$(Process).log
error = simics.$(Cluster).$(Process).err
output = simics.$(Cluster).$(Process).out

# specify files to transfer from/to remote machine
should_transfer_files = yes
when_to_transfer_output = on_exit
transfer_input_files = start_up_ruby.script, ruby.so,
mod_opal_commands.py, mod_ruby_commands.py
transfer_output_files = screen_dump_1000.out

#error checking
on_exit_remove = (ExitBySignal == False) && (ExitCode == 0)

#queue submits the job to the queue
Queue
*********************************************************************************************************************

and my simics_wrapper.sh

********************************************************************************************************************
griduser@C026090208:/mnt/local/test$ cat simics_wrapper.sh
#!/bin/sh
# This script sets up a Simics workspace for execution on a remote or local
# Archer grid appliance
# This script is based on the Archer simics tutorial; to run your own
# simulation, you will need to change it top manage your own files

# Create and change into the Simics workspace directory, also storing it in
# the tgt_wrk_spc variable
mkdir new-workspace
cd new-workspace
tgt_wrk_spc=`pwd`

# Go to the Simics installation directory and run workspace setup script
cd /opt/virtutech/simics-3.0.31/bin
./workspace-setup $tgt_wrk_spc

# Go back to workspace directory
cd $tgt_wrk_spc

# Setup directories for the TLB modules used in the tutorial:
mkdir x86-linux
mkdir x86-linux/lib
mkdir x86-linux/lib/python

# note that we are within the new-workspace subdirectory; files transferred
# by Condor are one level up (../). Copy those to the right place.
cp ../ruby.so x86-linux/lib
cp ../mod_ruby_commands.py x86-linux/lib/python
cp ../mod_opal_commands.py x86-linux/lib/python

./simics -c /mnt/ganfs/C123175188/abisko-8cpu-after-boot.config
-no-win -batch-mode -stall -x ../start_up_ruby.script >
../screen_dump_1000.out
********************************************************************************************************

And the condor_q output:
********************************************************************************************************

-- Submitter: C026090208.ipop : <5.26.90.208:9501> : C026090208.ipop
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
17.0 griduser 10/16 08:35 0+15:30:30 R 0 26.9 simics_wrapper.sh
18.0 griduser 10/16 08:35 0+15:30:30 R 0 26.9 simics_wrapper.sh
19.0 griduser 10/16 08:35 0+15:30:30 R 0 26.9 simics_wrapper.sh
20.0 griduser 10/16 08:35 0+15:30:30 R 0 26.9 simics_wrapper.sh
21.0 griduser 10/16 08:35 0+15:30:30 R 0 26.9 simics_wrapper.sh
22.0 griduser 10/16 08:35 0+15:30:30 R 0 26.9 simics_wrapper.sh

6 jobs; 0 idle, 6 running, 0 held
********************************************************************************************************


ANY suggestions are welcome.
Thanks,
Dimitris

UT-Austin

Girish Venkatasubramanian

unread,
Oct 17, 2009, 5:54:23 PM10/17/09
to archer-us...@googlegroups.com
Hi Dimitris,
It seems strange that the script runs on your local VM but not in
condor. Can you email the logs (error, output) of the condor job. Also
do you have the last command in the start_up_ruby.script as "quit"?
Can you attach that script too?
Thanks.

Dimitris Kaseridis

unread,
Oct 17, 2009, 6:12:04 PM10/17/09
to archer-us...@googlegroups.com
Thanks for the reply.

The err, output files of the condor jobs are empty... only the log
file has something like

**************************************
000 (017.000.000) 10/16 08:35:29 Job submitted from host: <5.26.90.208:9501>
...
001 (017.000.000) 10/16 08:38:00 Job executing on host: <5.62.61.34:49806>
...
006 (017.000.000) 10/16 08:38:11 Image size of job updated: 27408

********************************************

This is my start_up_ruby.script

********************************************
instruction-fetch-mode instruction-fetch-trace
istc-disable
dstc-disable
cpu-switch-time 1
magic-break-enable

# Load modules
load-module ruby

ruby0.setparam g_NUM_PROCESSORS 8
ruby0.setparam g_NUM_CHIPS 1
ruby0.setparam g_PROCS_PER_CHIP 8
ruby0.setparam g_MEMORY_SIZE_BYTES 4294967296
ruby0.setparam g_NUM_L2_BANKS 8
ruby0.setparam NUMBER_OF_VIRTUAL_NETWORKS 10

ruby0.setparam L1_CACHE_NUM_SETS_BITS 9
ruby0.setparam L2_CACHE_NUM_SETS_BITS 15
ruby0.setparam L2_CACHE_ASSOC 9
ruby0.setparam L1_CACHE_ASSOC 2

ruby0.init

ruby0.periodic-stats-file filename = ../periodic_LRU.txt
ruby0.periodic-stats-interval 1000
ruby0.periodic-stats-interval 1000

c 10000
quit

************************************************

--
Dimitris

Girish Venkatasubramanian

unread,
Oct 17, 2009, 6:57:25 PM10/17/09
to archer-us...@googlegroups.com
I see yuo have magic-break-enable. If I remember correctly, this will
break the execution every time there is a magic instruction and you
will have to (either manually or using a script) type continue to
continue the execution - yes?

Could that be the problem? Since this is a dummy script, can you
disable magic breaks and try?
Thanks.

rjo...@gmail.com

unread,
Oct 17, 2009, 7:37:25 PM10/17/09
to Archer User's Group
Dimitris,

Can you try running a simple script that simply does ls -l on your NFS
directory and exits?

This might give a clue if it might be a problem with simics, condor or
NFS.

--rf

On Oct 17, 6:57 pm, Girish Venkatasubramanian <giris...@gmail.com>
wrote:
> I see yuo have magic-break-enable. If I remember correctly, this will
> break the execution every time there is a magic instruction and you
> will have to (either manually or using a script) type continue to
> continue the execution - yes?
>
> Could that be the problem? Since this is a dummy script, can you
> disable magic breaks and try?
> Thanks.
>

Dimitris Kaseridis

unread,
Oct 17, 2009, 8:55:59 PM10/17/09
to archer-us...@googlegroups.com
ok that's interesting....

I replaced the context of the whole simics_wrapper.sh with that

*****************************
#!/bin/sh
# This script sets up a Simics workspace for execution on a remote or local
# Archer grid appliance
# This script is based on the Archer simics tutorial; to run your own
# simulation, you will need to change it top manage your own files

ls -la /mnt/ganfs/C001001254/

**************************


and I have the same behavior.... submitted correctly, start running on
a machine and not terminationg.

I can execute correctly all of the examples in my home directory
though.... that was my basic validation that I can submit jobs.
ANY ideas?

--
Dimitris

rjo...@gmail.com

unread,
Oct 17, 2009, 9:02:33 PM10/17/09
to Archer User's Group
Dimitris,

Not sure, it might be taking a while for NFS to mount properly. Let
the jobs run for a while, let's see.

One suggestion for the time being, try to steer your jobs to only
select your UTA resources by adding:

&& Group == "UT-Austin"

in the Requirements expression of your condor submit file.

--rf

Dimitris Kaseridis

unread,
Oct 17, 2009, 9:13:56 PM10/17/09
to archer-us...@googlegroups.com
It seems like sth is broken with the NFS.....

If my script doesn't include the nfs it runs correctly like "uname -a"
"hostname".
but even an ls to our NFS is running for 20mins now and doesn't terminate.

I user the montepi example for condor submit and I just add a .sh file.

I tried defining our Group as you mentioned and that also doesn't seem to work.

--
Dimitris

rjo...@gmail.com

unread,
Oct 17, 2009, 9:29:37 PM10/17/09
to Archer User's Group
one thing I noticed, the script below is referencing /mnt/ganfs/
C123175188 which is not the same as your appliance's hostname. Still,
your example ls -la /mnt/ganfs/C001001254 should work. I submitted a
similar script from my own appliance and it worked ok.

--rf

Dimitris Kaseridis

unread,
Oct 17, 2009, 9:33:44 PM10/17/09
to archer-us...@googlegroups.com
Yeah... i changed that today to a new client (faster host system)....
the new one is

C026090208

But I can't get the

ls -la /mnt/ganfs/C001001254

I modified the montepi condor_submit example file and just replaced it
with a test.sh with the above ls.
That is still running for 30+ mins now. If instead of the 'ls' I do a
"uname -a" ....it executes and returns the output as expected.

--
Dimitris

David Isaac Wolinsky

unread,
Oct 17, 2009, 9:33:57 PM10/17/09
to archer-us...@googlegroups.com
The problem appears to be the startd sites having a issues with the nfs
/ autofs stack. I've restarted all UT-Austin resources and will look
into it deeper if the issue occurs again.

Explicitly, I logged into a couple nodes, did an ls
/mnt/ganfs/C001001254 and it hung. I then logged into the machine in
another session and there was nothing obvious on why this was
occurring. At which point, I restarted the machines.

Regards,
David

Dimitris Kaseridis

unread,
Oct 17, 2009, 10:15:23 PM10/17/09
to archer-us...@googlegroups.com
Thanks, after David did the reboot of our kvms, everything works as
expected... my dummy gems simulations went through correctly.

I guess a reboot can always help ;-)

Thanks everybody for the fast response.....

--
Dimitris
Reply all
Reply to author
Forward
0 new messages