Jobs get stuck in terminating stage and do not disappear from VM after being killed.

Hanna ten Brink

unread,

Sep 24, 2018, 8:22:16 AM9/24/18

to gc3pie

Dear GC3Pie Team,

I recently encountered two (related) problems with GC3Pie.

I have multiple jobs that run in parallel. The jobs run in the cloud, and the output is saved on my local computer

Sometimes, a job gets stuck in the terminating stage, and keeps on saving its output on my local computer, resulting in many folders with the same files in it (problem 1). I have no idea why this happens, it seems to happen randomly.

If I then manually kill these jobs ("gselect -s SessionName --state TERMINATING | xargs gkill -s SessionName"), the jobs are killed and get the label 'failed'. The run stops saving output from these jobs to the local computer. However, these jobs are not removed from the cloud and occupy some of the chores. Therefore the progress of the session slows down a lot because it can not make full use of the available resources (problem 2). Can someone give me advice on what to do?

Thanks a lot,

Hanna

Riccardo Murri

unread,

Sep 24, 2018, 10:32:57 AM9/24/18

to gc3...@googlegroups.com

Hello Hanna,

> I recently encountered two (related) problems with GC3Pie.

Lucky you :-) I have encountered many more ;-)

> Sometimes, a job gets stuck in the terminating stage, and keeps on saving
> its output on my local computer, resulting in many folders with the same
> files in it (problem 1). I have no idea why this happens, it seems to
> happen randomly.

The only reason I can imagine is that the downloading is considered
"unsuccessful" for some reason, so it is attempted again during the next
cycle, and then again, and so on.

It would help to see if the DEBUG level logs have something to say.
Can you collect the DEBUG logs from such a problem situation?

To get the DEBUG logs: look into file `$HOME/.gc3/debug.log` or run
your session-based script adding the `-vvvv` option and save the console
output. For instance::

./my-script.py -s session -vvvv 2>&1 | tee debug.log

> If I then manually kill these jobs ("gselect -s SessionName --state
> TERMINATING | xargs gkill -s SessionName"), the jobs are killed and get the
> label 'failed'. The run stops saving output from these jobs to the local
> computer. However, these jobs are not removed from the cloud and occupy
> some of the chores. Therefore the progress of the session slows down a lot
> because it can not make full use of the available resources (problem 2).

Can you please post the output of `gcloud list --verbose` after killing
the problem jobs?

If no instance is running any job, it is safe to delete them all (e.g.,
via `gcloud terminate` or via the Science Cloud web interface) and then
restart your GC3Pie session-based script.

Ciao,
R

--
Riccardo Murri / Email: riccard...@gmail.com / Tel.: +41 77 458 98 32

Hanna ten Brink

unread,

Oct 1, 2018, 6:55:39 AM10/1/18

to gc3pie

Hi Riccardo,

Sorry for the late reply, the error didn't occur for a while after my initial post.

'Luckily' it happened again this morning, so I can finally give some more information.

Each job consists of 7 calculations, each producing 3 output files, so 21 files in total for each job. Somehow, the calculation stops, and therefore only a few output files are produced. When Gc3pie downloads the output, it can not find all files in the folder, and gives an error. It is not clear to me if the calculation really stops, or if GC3pie somehow terminates the calculation. The calculation runs without trouble on my local machine.

[It would help to see if the DEBUG level logs have something to say.
Can you collect the DEBUG logs from such a problem situation? ]

The debug log is huge. So I have only printed below the part where one of the jobs get into trouble

[2018-10-01 10:21:05] gc3.gc3libs DEBUG : About to update state of application: MatlabApp.354662 (currently: RUNNING)

[2018-10-01 10:21:05] gc3.gc3libs DEBUG : SshTransport running `ps -p 1465 -o state=`...

[2018-10-01 10:21:05] paramiko.transport DEBUG : [chan 123] Max packet in: 32768 bytes

[2018-10-01 10:21:05] paramiko.transport DEBUG : [chan 123] Max packet out: 32768 bytes

[2018-10-01 10:21:05] paramiko.transport DEBUG : Secsh channel 123 opened.

[2018-10-01 10:21:05] paramiko.transport DEBUG : [chan 123] Sesch channel 123 request ok

[2018-10-01 10:21:05] paramiko.transport DEBUG : [chan 123] EOF received (123)

[2018-10-01 10:21:05] gc3.gc3libs DEBUG : Executed command 'ps -p 1465 -o state=' on host '172.23.86.21'; exit code: 1

[2018-10-01 10:21:05] gc3.gc3libs DEBUG : Process with PID 1465 not found, assuming task MatlabApp.354662 has finished running.

[2018-10-01 10:21:05] gc3.gc3libs DEBUG : Calling state-transition handler 'terminating' on MatlabApp.354662 ...

[2018-10-01 10:21:05] paramiko.transport DEBUG : [chan 123] EOF sent (123)

[2018-10-01 10:21:05] gc3.gc3libs DEBUG : Updating job info file for pid 1465

[2018-10-01 10:21:05] paramiko.transport.sftp DEBUG : [chan 0] open('/home/ubuntu/.gc3/shellcmd.d/1465', 'wb')

[2018-10-01 10:21:05] paramiko.transport.sftp DEBUG : [chan 0] open('/home/ubuntu/.gc3/shellcmd.d/1465', 'wb') -> 00000000

[2018-10-01 10:21:05] paramiko.transport.sftp DEBUG : [chan 0] close(00000000)

[2018-10-01 10:21:05] gc3.gc3libs DEBUG : Reading resource utilization from wrapper file `/home/ubuntu/gc3libs.Wcavth/.gc3pie_shellcmd/resource_usage.txt` for task MatlabApp.354662 ...

[2018-10-01 10:21:05] paramiko.transport.sftp DEBUG : [chan 0] open('/home/ubuntu/gc3libs.Wcavth/.gc3pie_shellcmd/resource_usage.txt', 'r')

[2018-10-01 10:21:05] paramiko.transport.sftp DEBUG : [chan 0] open('/home/ubuntu/gc3libs.Wcavth/.gc3pie_shellcmd/resource_usage.txt', 'r') -> 00000000

[2018-10-01 10:21:05] paramiko.transport.sftp DEBUG : [chan 0] close(00000000)

and later on:

[2018-10-01 11:03:25] gc3.gc3libs DEBUG : Ignored error in fecthing output of task 'MatlabApp.354662': TransportError: Could not download '/home/ubuntu/gc3libs.Wcavth/timefile16734.txt' on host '172.23.86.21' to '/Volumes/HtB_Data/Results_Competition1C/run16731/timefile16734.txt': TransportError: Could not stat() file '/home/ubuntu/gc3libs.Wcavth/timefile16734.txt' on host '172.23.86.21': IOError: [Errno 2] No such file

[2018-10-01 11:03:25] gc3.gc3libs DEBUG : (Original traceback follows.)

Traceback (most recent call last):

File "/Users/hanna/gc3pie/src/gc3libs/core.py", line 1874, in progress

changed_only=self.retrieve_changed_only)

File "/Users/hanna/gc3pie/src/gc3libs/core.py", line 606, in fetch_output

app, download_dir, overwrite, changed_only, **extra_args)

File "/Users/hanna/gc3pie/src/gc3libs/core.py", line 674, in __fetch_output_application

raise ex

TransportError: Could not download '/home/ubuntu/gc3libs.Wcavth/timefile16734.txt' on host '172.23.86.21' to '/Volumes/HtB_Data/Results_Competition1C/run16731/timefile16734.txt': TransportError: Could not stat() file '/home/ubuntu/gc3libs.Wcavth/timefile16734.txt' on host '172.23.86.21': IOError: [Errno 2] No such file

This timefile16734 doesn't exist, so I can understand this last error.

[Can you please post the output of `gcloud list --verbose` after killing

the problem jobs? ]

There is no strange message here, just the same messages as before killing the problem jobs

[If no instance is running any job, it is safe to delete them all (e.g.,

via `gcloud terminate` or via the Science Cloud web interface) and then

restart your GC3Pie session-based script. ]

The problem is, is that the terminating jobs are not all on the same instance. So some jobs are running fine on a certain instance and some are stuck. If I terminate the instance, it will then also kill the successful runs. This is not a huge problem, but it is a bit annoying.

I hope this information helps,

Hanna

Op maandag 24 september 2018 16:32:57 UTC+2 schreef Riccardo Murri:

Riccardo Murri

unread,

Oct 5, 2018, 3:16:47 AM10/5/18

to gc3...@googlegroups.com

Dear Hanna,

sorry for the delay in replying. Based on what you write, my guess as
to what's happening is the following:

1. Yours jobs start correctly, possibly multiple jobs per instance
2. Sometimes a job would fill up the available disk space on a VM --
in this case part of the output will *not* be generated
3. For those jobs, GC3Pie will error out while retrieving the output:
the job never moves out of TERMINATING state and attempts to download
its outbox are done again and again.

Can you please check if this is the case, i.e., if the instances where
the jobs are failing have (near) 100% disk utilization?

Ciao,
R

Hanna ten Brink

unread,

Nov 30, 2018, 9:13:33 AM11/30/18

to gc3pie

Dear Riccardo,

Sorry for my late answer, I didn't have any runs for a while.

It indeed seems to be a memory issue, one of the jobs runs out of memory and the rest of the jobs on that instance terminate as well. I can cancel those jobs with 'gkill'. However, the jobs are not removed from the instance. And although GC3Pie sends new jobs to such an instance, it doesn't make full use of it (e.g. only 3 instead of 4 jobs are running). Except from increasing the memory per core, is there something I can do to at least probably remove the terminated jobs?

Thanks

Op vrijdag 5 oktober 2018 09:16:47 UTC+2 schreef Riccardo Murri:

Riccardo Murri

unread,

Dec 6, 2018, 4:35:55 AM12/6/18

to gc3...@googlegroups.com

Dear Hanna,

sorry for my late reply -- I was busy moving to a new apartment...

Coming to your GC3Pie issue:

It indeed seems to be a memory issue, one of the jobs runs out of memory and the rest of the jobs on that instance terminate as well. I can cancel those jobs with 'gkill'. However, the jobs are not removed from the instance. And although GC3Pie sends new jobs to such an instance, it doesn't make full use of it (e.g. only 3 instead of 4 jobs are running). Except from increasing the memory per core, is there something I can do to at least probably remove the terminated jobs?

If a job is marked as TERMINATED, it will be automatically removed on the next scheduling cycle. The reason a VM is not fully utilized has probably more to do with memory requirements: if you're running on UZH Science Cloud, then every CPU has theoretically max 4GB of memory, but the practical limit is a bit lower, as memory is used by Linux and the OS' background processes. So if you start your jobs with, say a requirement for 4000MB of memory each that might work well when the VMs are fresh booted but may fail later on as, e.g., only 3750MB are free for the last job... If this is the case, you should see an explicit message in this regard in the DEBUG level log.

It is also possible that your jobs create many processes and some of them are still running after the main application has been killed; you can check the output of command `ps fauxww` to verify if this is the case (post it here if you do not know how to interpret it). If this is the case, you would see the VMs' available memory diminish over time, as more and more jobs are executed. Rebooting a VM (when no jobs are running) would solve it.

Ciao,

R

Hanna ten Brink

unread,

Dec 10, 2018, 4:08:35 AM12/10/18

to gc3pie

Dear Riccardo,

Thank you for your explanation, however, GC3Pie does not remove jobs that fail because of memory issues or because they ran longer than the requested wall time.

For example, I had two jobs on the same VM that ran too long and were terminated by gc3pie. They stayed in 'TERMINATING' state and their (unfinished) output was repeatedly sent to my computer. After I manually killed them, their state changed to 'FAILED' and GC3pie stopped with sending their output to the local machine. However, the folders with their output remain on the VM and GC3PIE thinks that there are 4 jobs, instead of 2, running on this particular VM, even though two are clearly not running (verified with ps fauxww) . Manually removing the folders does not help.

ginfo gives me the following information about one of these jobs:

history:

- Submitting to 'sciencecloud' at Sat Dec 8 03:37:38 2018

- Transition from state NEW to state SUBMITTED at Sat Dec 8 03:37:47 2018

- Submitted to 'sciencecloud' at Sat Dec 8 03:37:47 2018

- Transition from state SUBMITTED to state RUNNING at Sat Dec 8 03:39:45 2018

- Transition from state RUNNING to state TERMINATING at Sat Dec 8 11:39:33 2018

- Execution failed on resource: sciencecloud at Sat Dec 8 11:39:33 2018

- Remote job terminated by signal 15 at Sat Dec 8 11:39:33 2018

- Transition from state TERMINATING to state TERMINATED (returncode: 65295) at Mon Dec 10 08:31:05 2018 <- I cancelled the job this morning