trouble running on stampede

18 views
Skip to first unread message

rad...@rci.rutgers.edu

unread,
Jan 9, 2014, 12:42:34 PM1/9/14
to bigjob...@googlegroups.com
My BigJob scripts that worked fine on Stampede are now exiting with not
entirely clear error messages. There is an explicit yet cryptic message:

slurmd[c405-002]: *** JOB 2451071 CANCELLED AT 2014-01-09T11:31:15 ***

at the end of bj-stderr (attached, after setting logging to DEBUG). I'm
also getting a usage message for ibrun in bj-stdout (I believe I reported
that before, although I think everything still ran fine).

Help?
Brian


============================================ Current Address ============
Brian Radak : Rutgers University
PhD candidate - York Research Group : BioMaPS Institute
University of Minnesota - Twin Cities : CIPR 308
Graduate Program in Chemical Physics : 174 Frelinghuysen Road,
Department of Chemistry : Piscataway, NJ 08854
rada...@umn.edu : rad...@biomaps.rutgers.edu
=======================================================================
Sorry for the multiple e-mail addresses, just use the institute
appropriate address.
stderr-bj-c395fdec-7953-11e3-88c8-848f69fdd8e1-agent.txt
stdout-bj-c395fdec-7953-11e3-88c8-848f69fdd8e1-agent.txt
bjrun0.out

Melissa Romanus

unread,
Jan 11, 2014, 10:06:11 PM1/11/14
to bigjob...@googlegroups.com, Andre Luckow, Yaakoub El Khamra
Brian,

I'm not sure what's going on here from just looking at the logs... but I was hoping Andre L might have some insight.

The pilot definitely starts running. Then it appears like it attempts to grab the subjob to execute... but suddenly the job gets cancelled. I can't tell from the logs if it can't find a subjob or not, but it shouldn't shut down the pilot job even if there were no subjobs -- so I am thinking this is coming from TACC/slurm in some way and not from BigJob (due to the abruptness shown in the logs). I'm just surprised you aren't getting some feedback from the scheduler printed into at least one of these stdout or stderrs if that's the case.

If AL doesn't have a different insight from just looking at the logs, then I think we might need to get Yaakoub a bit more proactively involved in this issue.

As a note, when the Pilot shuts down normally (via pilotjob.cancel() call), you do get the "slurmd[c405-002]: *** JOB 2451071 CANCELLED AT 2014-01-09T11:31:15 ***" in bigjob-stderr-*-agent.txt at the bottom. So that's not actually an error message... just showing that the job was cancelled. The question here is who cancels the job (somehow BigJob? doubtful...) or TACC (why?). TACC/slurm would usually throw out the job before it starts running if a parameter was missing, but that's not what's happening - as we can clearly see from the logs that the Pilot becomes active in the slurm queue. Let's wait to hear from AL and then perhaps enlist YYE's help.

-Melissa



--
You received this message because you are subscribed to the Google Groups "bigjob-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigjob-users...@googlegroups.com.
To post to this group, send email to bigjob...@googlegroups.com.
Visit this group at http://groups.google.com/group/bigjob-users.
For more options, visit https://groups.google.com/groups/opt_out.

Melissa Romanus

unread,
Jan 13, 2014, 9:56:35 AM1/13/14
to bigjob...@googlegroups.com
still at bootcamp today guys. not sure if brian/AL have time to get to this in the meantime?

---------- Forwarded message ----------
From: Yaakoub El Khamra <yelk...@gmail.com>
Date: Sat, Jan 11, 2014 at 10:08 PM
Subject: Re: trouble running on stampede
To: Melissa Romanus <melissa...@rutgers.edu>
Cc: Andre Luckow <andre....@gmail.com>



Hey Melissa
can you get debug to print the full ibrun command please? I suspect a problem there if ibrun shows its help message. common issue: user tries to exec:

ibrun $EXE......

and $EXE is never defined (perhaps module not loaded).



Regards
Yaakoub El Khamra

rad...@rci.rutgers.edu

unread,
Jan 13, 2014, 11:53:51 AM1/13/14
to bigjob...@googlegroups.com
Apologies. I was without power most of the weekend.

I am still having this problem. I recently updated BigJob (easy_install -U
bigjob) and tried again; nothing different that I can see.

Should I try something drastic like deleting and reinstalling BigJob (as
Ole did a few days ago?). That seems like it should be fairly safe to do,
but would require all of the testing to be done again.

Brian

rad...@rci.rutgers.edu

unread,
Jan 14, 2014, 9:37:44 AM1/14/14
to bigjob...@googlegroups.com
Still looking for answers on this problem.

I can run the normal test case on Stampede with the same AMBER executable.
I can confirm (perhaps unhelpfully) that the "JOB #### CANCELLED" message
in bj-stderr is not an error message and appears for a job that runs as
expected. Anyone know why this goes to stderr if it's not an error? Is
this new or have I just never noticed this before?

Differences between the two jobs:
- one runs in my HOME directory, the other in SCRATCH
- I've only run the working job in the development queue, although scaling
down the production job to 1 node in the development queue doesn't make it
work

Other things that I can think of:
I have two allocations on Stampede and one (the default apparently) has a
negative SU balance. I confirmed this yesterday when I ran an interactive
job and was rejected unless I explicitly used the second account

I am badly in need of running some short jobs for timings for a grant
proposal due very soon.

Brian

Shantenu Jha

unread,
Jan 14, 2014, 9:52:19 AM1/14/14
to bigjob...@googlegroups.com
Brian

Can you confirm that the job that is being cancelled is not using the
expired allocation by default?

SHantenu

Melissa Romanus

unread,
Jan 14, 2014, 9:58:34 AM1/14/14
to bigjob...@googlegroups.com, Yaakoub El Khamra
YYE:

This is how BigJob builds the command for ibrun using MPI:

               elif self.LAUNCH_METHOD=="ibrun" and spmdvariation.lower()=="mpi":
                    # Non MPI launch is handled via standard SSH

                    command = envi + "mpirun_rsh   -np " +str(numberofprocesses) + " -hostfile " + machinefile + "  `build_env.pl` " + executable + " " + arguments

In Brian's case, his executable is AMBER, arguments are just generic AMBER arguments.

Then it just adds the change dir into working directory:

               if self.LAUNCH_METHOD == "aprun" or (self.LAUNCH_METHOD== "ibrun" and spmdvariation.lower()=="mpi"):
                    command ="cd " + workingdirectory + "; " + command

We had issue previously with multiple allocations but was confirmed fixed on Lonestar by J. Smith and myself.

https://github.com/saga-project/BigJob/commit/bea569ee2b8ffa8cd28204d0f22b330167da4839

-Melissa

Melissa Romanus

unread,
Jan 14, 2014, 10:13:08 AM1/14/14
to bigjob...@googlegroups.com
But, I maintain that the logs look like the Pilot goes into the queue and goes from Waiting to Running before it is killed. If Andre disagrees, please let me know. Because otherwise, I don't think we should operate under the assumption that the job itself is formulated incorrectly. For instance, BigJob shows job submitted and job ID obtained. Previously, Stampede has been super informative of not letting your job in the queue in the first place if it's not formulated correctly or if you don't have enough allocation to run the particular amount of resources you're asking for ... I need to know from YYE if it's possible for a job to become all the way active and then be immediately killed based on some criteria.

01/09/2014 11:30:49 AM - bigjob - DEBUG - Trying to submit pilot job to: slurm+ssh://rad...@login1.stampede.tacc.utexas.edu
01/09/2014 11:30:54 AM - bigjob - DEBUG - Submission succeeded. Job ID: [slurm+ssh://rad...@login1.stampede.tacc.utexas.edu]-[2451071]
01/09/2014 11:30:54 AM - bigjob - DEBUG - Create PilotCompute for BigJob: bigjob:bj-c395fdec-7953-11e3-88c8-848f69fdd8e1:login1.stampede.tacc.utexas.edu
01/09/2014 11:31:15 AM - bigjob - DEBUG - ### ComputeDataService wait for completion of 0 CUs/ 0 DUs ###
01/09/2014 11:31:15 AM - bigjob - DEBUG - ### END WAIT ###
01/09/2014 11:31:15 AM - bigjob - DEBUG - Cancel Pilot Job
01/09/2014 11:31:15 AM - bigjob - DEBUG - Cancel Job Service
01/09/2014 11:31:15 AM - bigjob - DEBUG - Cancel Job Service done
01/09/2014 11:31:15 AM - bigjob - DEBUG - stop pilot job: bigjob:bj-c395fdec-7953-11e3-88c8-848f69fdd8e1:login1.stampede.tacc.utexas.edu
01/09/2014 11:31:15 AM - bigjob - DEBUG - update state of pilot job to: Done stopped: True
01/09/2014 11:31:15 AM - bigjob - DEBUG - delete pilot job: bigjob:bj-c395fdec-7953-11e3-88c8-848f69fdd8e1:login1.stampede.tacc.utexas.edu
01/09/2014 11:31:17 AM - bigjob - DEBUG - Cancel Pilot Job finished



To post to this group, send email to bigjob...@googlegroups.com.
Visit this group at http://groups.google.com/group/bigjob-users.
For more options, visit https://groups.google.com/groups/opt_out.


--
You received this message because you are subscribed to the Google
Groups
"bigjob-users" group.
To unsubscribe from this group and stop receiving emails from it, send
an

To post to this group, send email to bigjob...@googlegroups.com.
Visit this group at http://groups.google.com/group/bigjob-users.
For more options, visit https://groups.google.com/groups/opt_out.



============================================ Current Address ============
 Brian Radak                             :   Rutgers University
 PhD candidate - York Research Group     :   BioMaPS Institute
 University of Minnesota - Twin Cities   :   CIPR 308
 Graduate Program in Chemical Physics    :   174 Frelinghuysen Road,
 Department of Chemistry                 :   Piscataway, NJ 08854
 rada...@umn.edu                        :   rad...@biomaps.rutgers.edu
=======================================================================
Sorry for the multiple e-mail addresses, just use the institute
appropriate address.

--
You received this message because you are subscribed to the Google Groups
"bigjob-users" group.
To unsubscribe from this group and stop receiving emails from it, send an

To post to this group, send email to bigjob...@googlegroups.com.
Visit this group at http://groups.google.com/group/bigjob-users.
For more options, visit https://groups.google.com/groups/opt_out.



============================================ Current Address ============
Brian Radak                             :   Rutgers University
PhD candidate - York Research Group     :   BioMaPS Institute
University of Minnesota - Twin Cities   :   CIPR 308
Graduate Program in Chemical Physics    :   174 Frelinghuysen Road,
Department of Chemistry                 :   Piscataway, NJ 08854
rada...@umn.edu                        :   rad...@biomaps.rutgers.edu
=======================================================================
Sorry for the multiple e-mail addresses, just use the institute
appropriate address.



--
You received this message because you are subscribed to the Google Groups "bigjob-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigjob-users+unsubscribe@googlegroups.com.

Melissa Romanus

unread,
Jan 14, 2014, 10:15:24 AM1/14/14
to bigjob...@googlegroups.com
Brian,

I can leave the Boot Camp and come up for the remainder of the morning. They are just teaching people about Chimera and I am already familiar with PyMol from Biochem.

I just need to attend the afternoon session, because we need it for our symposium on Friday.

I will ping Yaakoub and see if he's available. Then come up.

-Melissa

rad...@rci.rutgers.edu

unread,
Jan 14, 2014, 1:59:18 PM1/14/14
to bigjob...@googlegroups.com
This has been solved. The problem was in the specific python script
calling BigJob, not in BigJob itself, nor Stampede. Apparently we have not
been careful in checking input parameters and some very silly settings
were accepted when attempting to run short timings.

Brian
>>>>>>> email to bigjob-users...@googlegroups.com.
>>>>>>> To post to this group, send email to bigjob...@googlegroups.com.
>>>>>>> Visit this group at http://groups.google.com/group/bigjob-users.
>>>>>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups
>>>>> "bigjob-users" group.
>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>> send
>>>>> an
>>>>> email to bigjob-users...@googlegroups.com.
>>>>> To post to this group, send email to bigjob...@googlegroups.com.
>>>>> Visit this group at http://groups.google.com/group/bigjob-users.
>>>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>>>
>>>>>
>>>>
>>>> ============================================ Current Address
>>>> ============
>>>> Brian Radak : Rutgers University
>>>> PhD candidate - York Research Group : BioMaPS Institute
>>>> University of Minnesota - Twin Cities : CIPR 308
>>>> Graduate Program in Chemical Physics : 174 Frelinghuysen Road,
>>>> Department of Chemistry : Piscataway, NJ 08854
>>>> rada...@umn.edu :
>>>> rad...@biomaps.rutgers.edu
>>>> =======================================================================
>>>> Sorry for the multiple e-mail addresses, just use the institute
>>>> appropriate address.
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups
>>>> "bigjob-users" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an
>>>> email to bigjob-users...@googlegroups.com.
>>>> To post to this group, send email to bigjob...@googlegroups.com.
>>>> Visit this group at http://groups.google.com/group/bigjob-users.
>>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>>
>>>>
>>>
>>> ============================================ Current Address
>>> ============
>>> Brian Radak : Rutgers University
>>> PhD candidate - York Research Group : BioMaPS Institute
>>> University of Minnesota - Twin Cities : CIPR 308
>>> Graduate Program in Chemical Physics : 174 Frelinghuysen Road,
>>> Department of Chemistry : Piscataway, NJ 08854
>>> rada...@umn.edu : rad...@biomaps.rutgers.edu
>>> =======================================================================
>>> Sorry for the multiple e-mail addresses, just use the institute
>>> appropriate address.
>>>
>>>
>>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups
>> "bigjob-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send
>> an
>> email to bigjob-users...@googlegroups.com.
>> To post to this group, send email to bigjob...@googlegroups.com.
>> Visit this group at http://groups.google.com/group/bigjob-users.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>
> --
> You received this message because you are subscribed to the Google Groups
> "bigjob-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to bigjob-users...@googlegroups.com.

Andre Merzky

unread,
Jan 14, 2014, 4:16:26 PM1/14/14
to bigjob...@googlegroups.com
Since it took a while to find the cause of the problem, do you have
any suggestion on where to improve error messages etc?

Best, Andre.
Nothing is really difficult.

Melissa Romanus

unread,
Jan 14, 2014, 4:19:26 PM1/14/14
to bigjob...@googlegroups.com
It is in our async code. It's a check on the science parameters. Not for bj

Melissa Romanus
RADICAL
The Cloud and Autonomic Computing Center
Electrical and Computer Engineering Dept.
Rutgers University
Email: mel...@cac.rutgers.edu

Mark Santcroos

unread,
Jan 15, 2014, 11:26:17 AM1/15/14
to bigjob...@googlegroups.com
I guess what Andre meant is, that it was apparently not obvious by looking at BJ that the problem was in application space, while ideally that should have been easy to find out.

Gr,

Mark

Andre Merzky

unread,
Jan 15, 2014, 11:29:14 AM1/15/14
to bigjob...@googlegroups.com
+1

rad...@rci.rutgers.edu

unread,
Jan 15, 2014, 12:35:20 PM1/15/14
to bigjob...@googlegroups.com
I'll try to detail what happened:

Our application contains a while loop that essentially submits jobs from
the pilot (to the pilot?) for a given period of time. When I shortened the
wall time passed to the pilot (for short timing runs) I neglected to
change some other parameters which caused the while loop to be skipped
outright (and thus science did not happen). No jobs were ever submitted
and thus no errors occurred or could be reported.

If BigJob could do anything to help this situation (and I don't think it
really needs to) it could maybe raise a warning when a Pilot job cleans up
without ever submitting a subjob (compute unit?).

Brian

Andre Merzky

unread,
Jan 15, 2014, 12:40:39 PM1/15/14
to bigjob...@googlegroups.com
Ah, thanks for the explanation, that makes sense. Appreciated!

Best, Andre.
Reply all
Reply to author
Forward
0 new messages