Job status not reporting

320 views
Skip to first unread message

Jeff Cleverley

unread,
Jul 10, 2014, 7:34:20 PM7/10/14
to isilon-u...@googlegroups.com
Greetings,

For some reason it appears jobs such as SmartPools are running, but they don't show up in either the web interface, or from the command line. 

I've started jobs from both the command line and the gui and I can see job activity in the /var/log/isi_job_d.log file. 

I've looked at all the services that may be associated with this and everything seems to be enabled.  I've disabled/enabled several of them, including isi_job_d, but it doesn't have any effect.

We had similar issues in the past with events and the celogs, but those look OK from what I can tell, and the processes are running also.

Any ideas before I go through the extended support debugging mode?

Thanks,

Jeff

--
Jeff Cleverley
Unix Systems Administrator
4380 Ziegler Road
Fort Collins, Colorado 80525
970-288-4611

Peter Serocka

unread,
Jul 11, 2014, 12:03:08 AM7/11/14
to isilon-u...@googlegroups.com
Do the jobs show in the output of:

isi_gconfig -t job-status

Can you open the file

/ifs/.ifsvar/run/isi_job_d.lock  

(Just cat ..., it's empty but it shouldn't be locked all time)

Both should be checked on all nodes.



--
You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Peter Serocka
CAS-MPG Partner Institute for Computational Biology (PICB)
Shanghai Institutes for Biological Sciences (SIBS)
Chinese Academy of Sciences (CAS)
320 Yue Yang Rd, Shanghai 200031, China





Jeff Cleverley

unread,
Jul 11, 2014, 4:22:53 PM7/11/14
to isilon-u...@googlegroups.com
Peter,

It appears the locks are OK since none of the systems fail or give errors.  The isi_gconfig command returns the same on all nodes, but from what I can tell it is only showing 1 job.  I've truncated the output a little bit because all of the history.* messages pretty much the same. 

Here is the output of the isi job status plus part of the isi_gconfig.  The jobid 9390 is much higher than the last one listed in the job status.

>>
isilon-4# isi job status
The job engine is running.

No running or queued jobs.

Recent finished jobs:
ID   Type           State     Time              
-------------------------------------------------
9269 SnapshotDelete Succeeded 2014-07-02T21:10:49
9270 SnapshotDelete Succeeded 2014-07-02T21:40:27
9271 SmartPools     Succeeded 2014-07-02T23:08:23
9273 SnapshotDelete Succeeded 2014-07-02T23:11:23
9272 FSAnalyze      Succeeded 2014-07-02T23:19:35
9274 SnapshotDelete Succeeded 2014-07-02T23:40:24
9275 SnapshotDelete Succeeded 2014-07-03T00:11:40
9276 SnapshotDelete Succeeded 2014-07-03T01:13:02
9277 SnapshotDelete Succeeded 2014-07-03T08:12:41
9278 SnapshotDelete Succeeded 2014-07-03T09:10:44
-------------------------------------------------
Total: 10                                       
>>
isilon-4# isi_gconfig -t job-status
[root] {version:1}                
next_jid (uint32) = 9390          
last_lin_count.last_lin_count (uint64) = 0
last_lin_count.last_lin_count_date (uint64) = 1405105594
last_group_id.devid (ifs_devid_t) = 13                 
last_group_id.serial (gmp_group_serial_t) = 735        
(empty dir jobs)                                       
(empty dir failed)                                     
history.0.type (int) = 9                               
history.0.last_success_start_time (time_t) = 1404957072
history.0.last_success_end_time (time_t) = 1404990366  
history.0.last_system_cancel_time (time_t) = 0         
history.0.last_user_cancel_time (time_t) = 0           
.....
coordinator.devid (ifs_devid_t) = 8                    
coordinator.state (char*) =                            
coordinator.num_outstanding_tasks (uint32) = 0         
coordinator.connected (bool) = true                    
coordinator.unconnected (char*) = {}                   
coordinator.degraded (bool) = false                    
coordinator.down_or_read_only (bool) = false           
coordinator.stats_ready (bool) = true                  
coordinator.initial_config_time (time_t) = 1358202764  
(empty dir managers)                                   
(empty dir restriping_devices)                         

I honestly don't know what to make of this output :-)

Thanks,

Jeff

Peter Serocka

unread,
Jul 13, 2014, 9:34:55 AM7/13/14
to isilon-u...@googlegroups.com
Jeff,

this is a bit strange:
> (empty dir jobs)
which indicates that isi_gconfig is not
aware of any running jobs.

You might check when the file
/ifs/.ifsvar/modules/jobengine/status.gc
has been recently updated, but we are
going into "extended debugging mode” here already ;-)

Frankly, better have support checking this.

There is a KB article (89145) about SnapshotDelete
job status getting stuck in 6.5.4, but it
explicitly warns not to mess around in
/ifs/.ifsvar/modules/jobengine/cp/… for other jobs.

Cheers, and best of luck,
— Peter

Jeff Cleverley

unread,
Jul 13, 2014, 1:30:35 PM7/13/14
to isilon-u...@googlegroups.com
Peter,

I'll get a case opened for it.  The status.gc file is pretty current:

-rw-r-----       1 root  wheel    4361 Jul 13 08:09 status.gc

Thanks for the suggestions.

Jeff


On Sun, Jul 13, 2014 at 7:34 AM, Peter Serocka <pser...@picb.ac.cn> wrote:
Jeff,

this is a bit strange:
> (empty dir jobs)
which indicates that isi_gconfig is not
aware of any running jobs.

You might check when the file
/ifs/.ifsvar/modules/jobengine/status.gc
has been recently updated, but we are
going into "extended debugging mode" here already ;-)

Frankly, better have support checking this.

There is a KB article  (89145) about SnapshotDelete
job status getting stuck in 6.5.4, but it
explicitly warns not to mess around in
/ifs/.ifsvar/modules/jobengine/cp/... for other jobs.


Cheers, and best of luck,
-- Peter
Reply all
Reply to author
Forward
0 new messages