Re: trouble running on stampede (fwd)

53 views
Skip to first unread message

Ole Weidner

unread,
Jan 10, 2014, 3:21:02 PM1/10/14
to bigjob...@googlegroups.com, Brian Radak
[moving this to the BigJob list — Brian, are you on that list?]

Hi All,

I also have some trouble running on stampede. My error looks different — not sure if this is related to Brian’s problem:

curl: (6) Couldn't resolve host 'python-distribute.org'
/home1/00988/tg802352/.bigjob/python/bin/python: can't open file 'distribute_setup.py': [Errno 2] No such file or directory
/bin/sh: aprun: command not found
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that the RSA host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
70:05:eb:ce:90:05:a6:c2:6d:07:b7:2e:f3:9d:43:af.
Please contact your system administrator.
Add correct host key in /home1/00988/tg802352/.ssh/known_hosts to get rid of this message.
Offending key in /home1/00988/tg802352/.ssh/known_hosts:1
Keyboard-interactive authentication is disabled to avoid man-in-the-middle attacks.
Traceback (most recent call last):
File "<string>", line 59, in <module>
File "/home1/00988/tg802352/.bigjob/python/src/bigjob/bigjob/bigjob_agent.py", line 161, in __init__
self.coordination = bigjob_coordination(server_connect_url=self.coordination_url)
File "/home1/00988/tg802352/.bigjob/python/src/bigjob/pilot/filemanagement/../../coordination/bigjob_coordination_redis.py", line 80, in __init__
raise Exception("Please start Redis server!")
Exception: Please start Redis server!

- Ole


On Jan 10, 2014, at 14:52 , Brian Radak <rada...@umn.edu> wrote:

> I've re-attached all of the output and the slurm script. It is almost exactly as you had above.
>
> How do I check the Redis connectivity? I'm still using the gw86 server.
>
> I rechecked the AMBER tests on Stampede. Everything appears to run just fine.
>
> I also had to migrate from $WORK to $SCRATCH (quota is running out). I don't know why this would/could be a problem, but I'm re-running a test in $WORK to be sure (it's still waiting in the queue).
>
>
>
>
> On Fri, Jan 10, 2014 at 2:33 PM, Ashley Zebrowski <an...@rutgers.edu> wrote:
> Hello!
>
> If the script you are using is based off the one I sent out a while ago, it should look something like this like this:
>
> #!/bin/bash
> #SBATCH -J bigjob_throughput # Job name
> #SBATCH -o bj.o%j # Name of stdout output file(%j expands to jobId)
> #SBATCH -e bj.o%j # Name of stderr output file(%j expands to jobId)
> #SBATCH -p normal # Submit to the 'normal' or 'development' queue
> #SBATCH -N 1 # Total number of nodes requested (16 cores/node)
> #SBATCH -n 16 # Total number of mpi tasks requested
> #SBATCH -t 47:30:00 # Run time (hh:mm:ss) - 1.5 hours
> #SBATCH -A TG-MCB090174 # Allocation name to charge job against
>
> cd /home1/01414/ashleyz/experiments/bj-performance-experiments
> source /home1/01414/ashleyz/saga-python-env/bin/activate
> SAGA_VERBOSE=9 BIGJOB_VERBOSE=9 python benchmark.py configs/test.cfg &> /work/01414/ashleyz/benchmark-test.cfg.out
>
> There are multiple files to be checking. As I recall:
>
> the -o and -e files contain stdout/stderr *as output from your SLURM script*. For example, if the source command failed in my above script, I would likely find stdout/stderr information in the -o and -e files.
>
> Now, the question is if you have the &> redirector in there. I am using this for my main executable as it joins both stdout/stderr, and puts them in one file specific to the experiments I am running separate from any other commands in my SLURM script. Since stdout/stderr is being redirected on this line, -o and -e will NOT contain them. I am not sure if you are using such a redirect.
>
> As a note:
> -o, --output=<filename pattern>
> Instruct SLURM to connect the batch script's standard output directly to the file name specified in the "filename pattern". By default both standard output and standard error are directed to a file of the name "slurm-%j.out", where the "%j" is replaced with the job allocation number. See the --input option for filename specification options.
> So, if you -are- using &> or similar to merge stderr/stdout to a single file but would rather have everything handled by the batch system, you should be able to omit the -e parameter and SLURM should merge stderr/stdout to the single -o file.
>
> One more possible issue is to check the REDIS connectivity -- if a pilot is set up but cannot access the REDIS server, it may just hang.
>
> If you send me the submission script + the output, I would be happy to take a look. The output didn't get forwarded to me.
>
> Hope this helps!
> -Ashley Z
>
>
> On Fri, Jan 10, 2014 at 9:38 AM, Melissa Romanus <mmr...@scarletmail.rutgers.edu> wrote:
> I can't really be at a computer - the script to which you are referring, Brian, is Ashley's script that launches the job to a compute node. I have added her to this "private" list since she may have an additional idea. I may be able to look at it late tonight but if not then I will get to it over the weekend. I haven't run using Ashley's launch method though. I think Yaakoub might be of some assistance here?
>
> Melissa Romanus
> RADICAL
> The Cloud and Autonomic Computing Center
> Electrical and Computer Engineering Dept.
> Rutgers University
> Email: mel...@cac.rutgers.edu
>
> On Jan 10, 2014, at 9:03 AM, Brian Radak <rada...@umn.edu> wrote:
>
>> These are still the ASyncRE jobs where the pilot is submitted via slurm to a single compute node.
>>
>> As near as I can tell, no AMBER jobs are ever run (with output), as there are no output files in any of the expected places (not even stderr or stdout files).
>>
>> The module loading thing sounds promising. I wonder if I changed AMBER environments (ugh, stupid developer packages).
>>
>> Melissa: I re-attached bjrun0.out (the stdout from ASyncRE) as bjrun0.txt.
>>
>> Brian
>>
>>
>> On Fri, Jan 10, 2014 at 8:15 AM, Melissa Romanus <mmr...@scarletmail.rutgers.edu> wrote:
>> I have no insight beyond Yaakoubs. My initial thoughts when I first
>> read your email is that TACC/slurm threw out the job for some invalid
>> parameter, but it doesn't appear on BJ stdout. Brian, please rename
>> the .out file to .txt so I can view it on my phone. I'm doing the
>> quantitative bio boot camp so I haven't really had my laptop.
>>
>> Melissa Romanus
>> RADICAL
>> The Cloud and Autonomic Computing Center
>> Electrical and Computer Engineering Dept.
>> Rutgers University
>> Email: mel...@cac.rutgers.edu
>>
>> > On Jan 10, 2014, at 7:15 AM, Shantenu Jha <shante...@rutgers.edu> wrote:
>> >
>> > Brian,
>> >
>> > Off list.. Some exchanges with Yaakoub on this matter.
>> >
>> > Others might have insight..
>> >
>> > Shantenu
>> >
>> > ---------- Forwarded message ----------
>> > Date: Fri, 10 Jan 2014 07:12:13 -0500
>> > From: Yaakoub El Khamra <yelk...@gmail.com>
>> > To: Shantenu Jha <shante...@rutgers.edu>
>> > Subject: Re: trouble running on stampede
>> >
>> > <snip>
>> >
>> > usually cancelled means he cancelled the job or we killed it. Ask him if any of his other jobs ran since
>> > this one failed or if he cancelled this one. Also it seems that he might have passed something unknown to
>> > ibrun, does he have the right modules loaded?
>> >
>> > <snip>
>>
>>
>>
>> --
>> ================================ Current Address =======================
>> Brian Radak : BioMaPS Institute for Quantitative Biology
>> PhD candidate - York Research Group : Rutgers, The State University of New Jersey
>> University of Minnesota - Twin Cities : Center for Integrative Proteomics Room 308
>> Graduate Program in Chemical Physics : 174 Frelinghuysen Road,
>> Department of Chemistry : Piscataway, NJ 08854-8066
>> rada...@umn.edu : rad...@biomaps.rutgers.edu
>> ====================================================================
>> Sorry for the multiple e-mail addresses, just use the institute appropriate address.
>> <bjrun0.txt>
>
>
>
>
> --
> ================================ Current Address =======================
> Brian Radak : BioMaPS Institute for Quantitative Biology
> PhD candidate - York Research Group : Rutgers, The State University of New Jersey
> University of Minnesota - Twin Cities : Center for Integrative Proteomics Room 308
> Graduate Program in Chemical Physics : 174 Frelinghuysen Road,
> Department of Chemistry : Piscataway, NJ 08854-8066
> rada...@umn.edu : rad...@biomaps.rutgers.edu
> ====================================================================
> Sorry for the multiple e-mail addresses, just use the institute appropriate address.
> <bjrun0.txt><bjrun0.slurm><stderr-bj-c395fdec-7953-11e3-88c8-848f69fdd8e1-agent.txt><stdout-bj-c395fdec-7953-11e3-88c8-848f69fdd8e1-agent.txt>

signature.asc

Ole Weidner

unread,
Jan 10, 2014, 3:26:06 PM1/10/14
to bigjob...@googlegroups.com, Brian Radak
All,

not sure if this is helpful or even recommendable, but I was able to solve my problem by simply removing my $HOME/.bigjob directory.

- Ole
signature.asc
Reply all
Reply to author
Forward
0 new messages