Hi Riccardo,
Sorry for the late reply, the error didn't occur for a while after my initial post.
'Luckily' it happened again this morning, so I can finally give some more information.
Each job consists of 7 calculations, each producing 3 output files, so 21 files in total for each job. Somehow, the calculation stops, and therefore only a few output files are produced. When Gc3pie downloads the output, it can not find all files in the folder, and gives an error. It is not clear to me if the calculation really stops, or if GC3pie somehow terminates the calculation. The calculation runs without trouble on my local machine.
[It would help to see if the DEBUG level logs have something to say.
Can you collect the DEBUG logs from such a problem situation? ]
The debug log is huge. So I have only printed below the part where one of the jobs get into trouble
[2018-10-01 10:21:05] gc3.gc3libs DEBUG : About to update state of application: MatlabApp.354662 (currently: RUNNING)
[2018-10-01 10:21:05] gc3.gc3libs DEBUG : SshTransport running `ps -p 1465 -o state=`...
[2018-10-01 10:21:05] paramiko.transport DEBUG : [chan 123] Max packet in: 32768 bytes
[2018-10-01 10:21:05] paramiko.transport DEBUG : [chan 123] Max packet out: 32768 bytes
[2018-10-01 10:21:05] paramiko.transport DEBUG : Secsh channel 123 opened.
[2018-10-01 10:21:05] paramiko.transport DEBUG : [chan 123] Sesch channel 123 request ok
[2018-10-01 10:21:05] paramiko.transport DEBUG : [chan 123] EOF received (123)
[2018-10-01 10:21:05] gc3.gc3libs DEBUG : Executed command 'ps -p 1465 -o state=' on host '172.23.86.21'; exit code: 1
[2018-10-01 10:21:05] gc3.gc3libs DEBUG : Process with PID 1465 not found, assuming task MatlabApp.354662 has finished running.
[2018-10-01 10:21:05] gc3.gc3libs DEBUG : Calling state-transition handler 'terminating' on MatlabApp.354662 ...
[2018-10-01 10:21:05] paramiko.transport DEBUG : [chan 123] EOF sent (123)
[2018-10-01 10:21:05] gc3.gc3libs DEBUG : Updating job info file for pid 1465
[2018-10-01 10:21:05] paramiko.transport.sftp DEBUG : [chan 0] open('/home/ubuntu/.gc3/shellcmd.d/1465', 'wb')
[2018-10-01 10:21:05] paramiko.transport.sftp DEBUG : [chan 0] open('/home/ubuntu/.gc3/shellcmd.d/1465', 'wb') -> 00000000
[2018-10-01 10:21:05] paramiko.transport.sftp DEBUG : [chan 0] close(00000000)
[2018-10-01 10:21:05] gc3.gc3libs DEBUG : Reading resource utilization from wrapper file `/home/ubuntu/gc3libs.Wcavth/.gc3pie_shellcmd/resource_usage.txt` for task MatlabApp.354662 ...
[2018-10-01 10:21:05] paramiko.transport.sftp DEBUG : [chan 0] open('/home/ubuntu/gc3libs.Wcavth/.gc3pie_shellcmd/resource_usage.txt', 'r')
[2018-10-01 10:21:05] paramiko.transport.sftp DEBUG : [chan 0] open('/home/ubuntu/gc3libs.Wcavth/.gc3pie_shellcmd/resource_usage.txt', 'r') -> 00000000
[2018-10-01 10:21:05] paramiko.transport.sftp DEBUG : [chan 0] close(00000000)
and later on:
[2018-10-01 11:03:25] gc3.gc3libs DEBUG : Ignored error in fecthing output of task 'MatlabApp.354662': TransportError: Could not download '/home/ubuntu/gc3libs.Wcavth/timefile16734.txt' on host '172.23.86.21' to '/Volumes/HtB_Data/Results_Competition1C/run16731/timefile16734.txt': TransportError: Could not stat() file '/home/ubuntu/gc3libs.Wcavth/timefile16734.txt' on host '172.23.86.21': IOError: [Errno 2] No such file
[2018-10-01 11:03:25] gc3.gc3libs DEBUG : (Original traceback follows.)
Traceback (most recent call last):
File "/Users/hanna/gc3pie/src/gc3libs/core.py", line 1874, in progress
changed_only=self.retrieve_changed_only)
File "/Users/hanna/gc3pie/src/gc3libs/core.py", line 606, in fetch_output
app, download_dir, overwrite, changed_only, **extra_args)
File "/Users/hanna/gc3pie/src/gc3libs/core.py", line 674, in __fetch_output_application
raise ex
TransportError: Could not download '/home/ubuntu/gc3libs.Wcavth/timefile16734.txt' on host '172.23.86.21' to '/Volumes/HtB_Data/Results_Competition1C/run16731/timefile16734.txt': TransportError: Could not stat() file '/home/ubuntu/gc3libs.Wcavth/timefile16734.txt' on host '172.23.86.21': IOError: [Errno 2] No such file
This timefile16734 doesn't exist, so I can understand this last error.
[Can you please post the output of `gcloud list --verbose` after killing
the problem jobs? ]
There is no strange message here, just the same messages as before killing the problem jobs
[If no instance is running any job, it is safe to delete them all (e.g.,
restart your GC3Pie session-based script. ]
The problem is, is that the terminating jobs are not all on the same instance. So some jobs are running fine on a certain instance and some are stuck. If I terminate the instance, it will then also kill the successful runs. This is not a huge problem, but it is a bit annoying.
I hope this information helps,
Hanna
Op maandag 24 september 2018 16:32:57 UTC+2 schreef Riccardo Murri: