[slurm-dev] Output and Error Files

0 views
Skip to first unread message

Yinka Adeosun

unread,
Aug 28, 2012, 10:18:04 AM8/28/12
to slurm-dev

Hi,

 

I am having issue with slurm not creating the output(xxx.out) and error (xxx.err) file In the directory it was run. Also, sview reporting that the job is running but it’s actually not running.

 

Yinka Adeosun

Unix Administrator

Vistronix, Inc

Contractor to US EPA Chesapeake Bay Program Office

410-295-1323

 

Andy Riebs

unread,
Aug 28, 2012, 10:47:04 AM8/28/12
to slurm-dev
Yinka,

The following information would help considerably in identifying the nature of your problem:

1. What version of SLURM? (If you configured and built it, what options did you use?)
2. What OS? What hardware?
3. Are user directories shared across the cluster?
4. Do you use automount for the user directories?
5. Can you include a copy of your slurm.conf?
6. Do you know where the user output *is* going?
7. Do the slurmctld.log and slurmd.log log files report any errors?

Andy


On 08/28/2012 09:55 AM, Yinka Adeosun wrote:

Hi,

 

I am having issue with slurm not creating the output(xxx.out) and error (xxx.err) file In the directory it was run. Also, sview reporting that the job is running but it’s actually not running.

 

Yinka Adeosun

Unix Administrator

Vistronix, Inc

Contractor to US EPA Chesapeake Bay Program Office

410-295-1323

 


-- 
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1-786-263-9743
My opinions are not necessarily those of HP

Yinka Adeosun

unread,
Aug 28, 2012, 1:47:04 PM8/28/12
to slurm-dev

I started here less than a week but I’d answer to the best of my knowledge.

 

1.       What version of SLURM? (If you configured and built it, what options did you use?) – 2.1.6-1
2. What OS? What hardware? – RHEL5. Dell Clusters(HPC)
3. Are user directories shared across the cluster? --- Yes
4. Do you use automount for the user directories? -- Yes
5. Can you include a copy of your slurm.conf? --- Yes, attached
6. Do you know where the user output *is* going? –Created where the job is run
7. Do the slurmctld.log and slurmd.log log files report any errors? Yes like from nodes “…unable to register: unable to contact slurm controller (connect failure)

Thanks,

Yinka

Image removed by sender.

slurm.conf-20120828.docx

Andy Riebs

unread,
Aug 29, 2012, 8:54:05 AM8/29/12
to slurm-dev
Hi Yinka,

I'm confused by your response to number 6. In different words, didn't you say that the base problem is that the files are not going to the directory whence the job was run?

Andy

On 08/28/2012 01:24 PM, Yinka Adeosun wrote:

I started here less than a week but I’d answer to the best of my knowledge.

 

1.       What version of SLURM? (If you configured and built it, what options did you use?) – 2.1.6-1
2. What OS? What hardware? – RHEL5. Dell Clusters(HPC)


3. Are user directories shared across the cluster? --- Yes
4. Do you use automount for the user directories? -- Yes
5. Can you include a copy of your slurm.conf? --- Yes, attached

6. Do you know where the user output *is* going? –Created where the job is run
7. Do the slurmctld.log and slurmd.log log files report any errors? Yes like from nodes “…unable to register: unable to contact slurm controller (connect failure)

Thanks,

Yinka

 

From: Andy Riebs [mailto:andy....@hp.com]
Sent: Tuesday, August 28, 2012 10:50 AM
To: slurm-dev
Subject: [slurm-dev] Re: Output and Error Files

 

Yinka,

The following information would help considerably in identifying the nature of your problem:

1. What version of SLURM? (If you configured and built it, what options did you use?)
2. What OS? What hardware?
3. Are user directories shared across the cluster?
4. Do you use automount for the user directories?
5. Can you include a copy of your slurm.conf?
6. Do you know where the user output *is* going?
7. Do the slurmctld.log and slurmd.log log files report any errors?

Andy

On 08/28/2012 09:55 AM, Yinka Adeosun wrote:

Hi,

 

I am having issue with slurm not creating the output(xxx.out) and error (xxx.err) file In the directory it was run. Also, sview reporting that the job is running but it’s actually not running.

 

Yinka Adeosun

Unix Administrator

Vistronix, Inc

Contractor to US EPA Chesapeake Bay Program Office

410-295-1323

 



-- 
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1-786-263-9743
My opinions are not necessarily those of HP

Image
              removed by sender.


-- 
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1-786-263-9743
My opinions are not necessarily those of HP

Andy Riebs

unread,
Aug 29, 2012, 9:23:04 AM8/29/12
to slurm-dev
On the sview problem, how long does sview report an incorrect state: a few seconds, a few minutes, until slurm is restarted?



On 08/28/2012 01:24 PM, Yinka Adeosun wrote:

I started here less than a week but I’d answer to the best of my knowledge.

 

1.       What version of SLURM? (If you configured and built it, what options did you use?) – 2.1.6-1
2. What OS? What hardware? – RHEL5. Dell Clusters(HPC)


3. Are user directories shared across the cluster? --- Yes
4. Do you use automount for the user directories? -- Yes
5. Can you include a copy of your slurm.conf? --- Yes, attached

6. Do you know where the user output *is* going? –Created where the job is run
7. Do the slurmctld.log and slurmd.log log files report any errors? Yes like from nodes “…unable to register: unable to contact slurm controller (connect failure)

Thanks,

Yinka

 

From: Andy Riebs [mailto:andy....@hp.com]
Sent: Tuesday, August 28, 2012 10:50 AM
To: slurm-dev
Subject: [slurm-dev] Re: Output and Error Files

 

Yinka,

The following information would help considerably in identifying the nature of your problem:

1. What version of SLURM? (If you configured and built it, what options did you use?)
2. What OS? What hardware?
3. Are user directories shared across the cluster?
4. Do you use automount for the user directories?
5. Can you include a copy of your slurm.conf?
6. Do you know where the user output *is* going?
7. Do the slurmctld.log and slurmd.log log files report any errors?

Andy

On 08/28/2012 09:55 AM, Yinka Adeosun wrote:

Hi,

 

I am having issue with slurm not creating the output(xxx.out) and error (xxx.err) file In the directory it was run. Also, sview reporting that the job is running but it’s actually not running.

 

Yinka Adeosun

Unix Administrator

Vistronix, Inc

Contractor to US EPA Chesapeake Bay Program Office

410-295-1323

 



-- 
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1-786-263-9743
My opinions are not necessarily those of HP

Image
              removed by sender.


-- 
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1-786-263-9743
My opinions are not necessarily those of HP

Yinka Adeosun

unread,
Aug 29, 2012, 9:54:05 AM8/29/12
to slurm-dev, andy....@hp.com

Andre,

 

Thanks for getting back. Yes, the output and the error files are not being created where the jobs are run. In my response I meant it was supposed to be creating the error/out files at the location it was run.

Image removed by sender.

Yinka Adeosun

unread,
Aug 29, 2012, 10:23:05 AM8/29/12
to slurm-dev, andy....@hp.com

A few seconds

Image removed by sender.

Yinka Adeosun

unread,
Aug 29, 2012, 10:52:04 AM8/29/12
to slurm-dev, andy....@hp.com

Andy,

By the way, the “….unable to register: unable to contact slurm controller(connect failure)”  issue is resolved but the error/out files are still not being created, please.

Yinka.

Image removed by sender.

Andy Riebs

unread,
Aug 29, 2012, 11:21:09 AM8/29/12
to slurm-dev, slurm-dev
I've never actually used sview, but I suspect that a few second delay is normal. Anyone else care to comment?



On 08/29/2012 09:30 AM, Yinka Adeosun wrote:

A few seconds

 

From: Andy Riebs [mailto:andy....@hp.com]
Sent: Wednesday, August 29, 2012 9:26 AM
To: slurm-dev
Subject: [slurm-dev] RE: sview problem, was Re: Output and Error Files

 

On the sview problem, how long does sview report an incorrect state: a few seconds, a few minutes, until slurm is restarted?

On 08/28/2012 01:24 PM, Yinka Adeosun wrote:

I started here less than a week but I’d answer to the best of my knowledge.

 

1.       What version of SLURM? (If you configured and built it, what options did you use?) – 2.1.6-1
2. What OS? What hardware? – RHEL5. Dell Clusters(HPC)


3. Are user directories shared across the cluster? --- Yes
4. Do you use automount for the user directories? -- Yes
5. Can you include a copy of your slurm.conf? --- Yes, attached

6. Do you know where the user output *is* going? –Created where the job is run
7. Do the slurmctld.log and slurmd.log log files report any errors? Yes like from nodes “…unable to register: unable to contact slurm controller (connect failure)


Thanks,

Yinka

 

From: Andy Riebs [mailto:andy....@hp.com]
Sent: Tuesday, August 28, 2012 10:50 AM
To: slurm-dev
Subject: [slurm-dev] Re: Output and Error Files

 

Yinka,

The following information would help considerably in identifying the nature of your problem:

1. What version of SLURM? (If you configured and built it, what options did you use?)
2. What OS? What hardware?
3. Are user directories shared across the cluster?
4. Do you use automount for the user directories?
5. Can you include a copy of your slurm.conf?
6. Do you know where the user output *is* going?
7. Do the slurmctld.log and slurmd.log log files report any errors?

Andy

On 08/28/2012 09:55 AM, Yinka Adeosun wrote:

Hi,

 

I am having issue with slurm not creating the output(xxx.out) and error (xxx.err) file In the directory it was run. Also, sview reporting that the job is running but it’s actually not running.

Aaron Knister

unread,
Aug 29, 2012, 11:50:06 AM8/29/12
to slurm-dev
Hi Yinka,

I'm actually the one that built the environment you're working on :) I'll take a stab at your questions later today. 

-Aaron

P.S. bluefish *desperately* needs a SLURM upgrade. It's running 2.1.6 IIRC.
--
Aaron Knister
Systems Administrator
Division of Information Technology
University of Maryland, Baltimore County
aar...@umbc.edu
image001.jpg

Aaron Knister

unread,
Aug 29, 2012, 12:19:06 PM8/29/12
to slurm-dev
Hi Yinka,

I've tried to send a couple e-mails this morning to the list and they didn't seem to make it so I apologize for duplicate messages.

I set up the cluster you're working on and have periodically jumped in to help out over the years. Usually this problem results from communication failures between the headnode and the compute nodes. Can you send along the output of sinfo? Can you also attempt to resolve each compute node's name from the head node (ie dig @localhost bluefish1). Ensure you can resolve every compute node by name, then attempt to ping them.

Best,
Aaron
image002.jpg
image001.jpg

Aaron Knister

unread,
Aug 29, 2012, 12:48:04 PM8/29/12
to slurm-dev, yade...@chesapeakebay.net
Sending this again but with Yinka on the CC. My original mail is nowhere to be found :(
image001.jpg
image002.jpg

Yinka Adeosun

unread,
Aug 29, 2012, 1:17:05 PM8/29/12
to slurm-dev

Aaron,

 

Glad to hear from you. I’ve heard good praises of you from everyone.

 

Sinfo:

 

bart@prometheus ]# sinfo

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST

active*      up   infinite      1  idle* prometheus1

active*      up   infinite      3   idle prometheus[2-4]

debug        up   infinite      1  idle* prometheus6

debug        up   infinite      1   idle prometheus5

 

Dig from headnode to nodes is not resolving. Interesting too, the nodes are configured for dhcp.

 

Thanks,

Yinka.

Yinka Adeosun

unread,
Aug 29, 2012, 1:46:06 PM8/29/12
to slurm-dev

Aaron,

 

Thanks. I feel much better knowing I am talking to the master. I can ping/ssh to the nodes from headnode but does not resolve with dig. Also, fyi prometheus=bluefishJ

 

Regards,

Yinka

Error! Filename not specified.




--
Aaron Knister
Systems Administrator
Division of Information Technology
University of Maryland, Baltimore County
aar...@umbc.edu
Image removed by sender.

Aaron Knister

unread,
Aug 29, 2012, 2:15:05 PM8/29/12
to slurm-dev, slurm-dev
Aw shucks, thanks :)

Lets take this offline since I don't think it's going to end up being a SLURM issue-- it's probably an xCAT/named/network oddity. If it ends up being a SLURM issue we could definitely report it back to the list.

Best,
Aaron
image002.jpg
image001.jpg
Reply all
Reply to author
Forward
0 new messages