Missing task state in gcylc

157 views
Skip to first unread message

azure...@gmail.com

unread,
Feb 20, 2018, 10:52:24 PM2/20/18
to cylc
Hi,

I upgraded the software to rose 2018.02.0, cylc-7.6.0.
I am testing the software to see if it works well.
Running the sample suite runs the task and logs the job.status file as SUCCEEDED, but gcylc shows it as submitted.
It is suspected that there is a problem in communication between processes.
How can I solve this problem?

Matt Shin

unread,
Feb 21, 2018, 4:33:01 AM2/21/18
to cylc
Hi,

You should try running the suite in --debug mode. Inspect both the suite log "~/cylc-run/SUITE/log/suite/log" and the "job.err" files to see what issues you have with communications.

Matt

Hilary Oliver

unread,
Feb 21, 2018, 6:56:58 AM2/21/18
to cy...@googlegroups.com
Also, were you previously running 7.5.0 or an earlier version?

--

---
You received this message because you are subscribed to the Google Groups "cylc" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cylc+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

azure...@gmail.com

unread,
Feb 21, 2018, 7:03:35 PM2/21/18
to cylc
My sample suite has worked very well with rose-2017.10.0 and cylc-7.5.0.


2018년 2월 21일 수요일 오후 8시 56분 58초 UTC+9, Hilary Oliver 님의 말:
To unsubscribe from this group and stop receiving emails from it, send an email to cylc+uns...@googlegroups.com.

Hilary Oliver

unread,
Feb 21, 2018, 7:31:53 PM2/21/18
to cy...@googlegroups.com
OK, please do as Matt suggests: run your suite in debug mode, and take a look at the suite log, and in particular the job.err file (it should report what the communication problem is).
Hilary

To unsubscribe from this group and stop receiving emails from it, send an email to cylc+unsubscribe@googlegroups.com.

azure...@gmail.com

unread,
Feb 21, 2018, 7:50:52 PM2/21/18
to cylc
I rerun the sample suite with the --debug option.
The following is a part of log.
------------------------------------------------------------------------------------------------------------------
# job.status
CYLC_BATCH_SYS_NAME=pbs
CYLC_BATCH_SYS_JOB_ID=2355319.nuri
CYLC_BATCH_SYS_JOB_SUBMIT_TIME=2018-02-22T00:05:48Z
CYLC_JOB_PID=44672
CYLC_JOB_INIT_TIME=2018-02-22T00:06:34Z
CYLC_JOB_EXIT=SUCCEEDED
CYLC_JOB_EXIT_TIME=2018-02-22T00:07:14Z

# job.err
2018-02-22T00:07:15Z WARNING - Send message: try 1 of 7 failed: Cannot connect: http://ngate6.cm.cluster:43052/put_message?message=succeeded+at+2018-02-22T00%3A07%3A14Z&severity=NORMAL&task_id=glu_start_cold.20150615T0000Z: ('Connection aborted.', error(111, 'Connection refused'))
   retry in 5.0 seconds, timeout is 30.0
....
2018-02-22T00:07:46Z WARNING - Send message: try 7 of 7 failed: Cannot connect: http://ngate6.cm.cluster:43052/put_message?message=succeeded+at+2018-02-22T00%3A07%3A14Z&severity=NORMAL&task_id=glu_start_cold.20150615T0000Z: ('Connection aborted.', error(111, 'Connection refused'))
2018-02-22T00:07:46Z WARNING - MESSAGE SEND FAILED
------------------------------------------------------------------------------------------------------------------

For your reference,
1) my sample suite works very well for rose-2017.10.0 and cylc-7.5.0, as noted below.
2) I am using [communication] method = http in global.rc.


2018년 2월 21일 수요일 오후 6시 33분 1초 UTC+9, Matt Shin 님의 말:
Message has been deleted

Hilary Oliver

unread,
Feb 22, 2018, 2:37:42 AM2/22/18
to cy...@googlegroups.com
Are you able to try "method = https" (which is the default) - it is possible that plain http is not as well tested at this point.
Hilary


On 22 February 2018 at 11:53, <azure...@gmail.com> wrote:
I rerun the sample suite with the --debug option.
The following is a part of the log file.
------------------------------------------------------------------------------------------------------------------------------------------------------------
# job.status
CYLC_BATCH_SYS_NAME=pbs
CYLC_BATCH_SYS_JOB_ID=2355319.nuri
CYLC_BATCH_SYS_JOB_SUBMIT_TIME=2018-02-22T00:05:48Z
CYLC_JOB_PID=44672
CYLC_JOB_INIT_TIME=2018-02-22T00:06:34Z
CYLC_JOB_EXIT=SUCCEEDED
CYLC_JOB_EXIT_TIME=2018-02-22T00:07:14Z

# job.err
2018-02-22T00:07:15Z WARNING - Send message: try 1 of 7 failed: Cannot connect: http://ngate6.cm.cluster:43052/put_message?message=succeeded+at+2018-02-22T00%3A07%3A14Z&severity=NORMAL&task_id=glu_start_cold.20150615T0000Z: ('Connection aborted.', error(111, 'Connection refused'))
   retry in 5.0 seconds, timeout is 30.0
.....

2018-02-22T00:07:46Z WARNING - Send message: try 7 of 7 failed: Cannot connect: http://ngate6.cm.cluster:43052/put_message?message=succeeded+at+2018-02-22T00%3A07%3A14Z&severity=NORMAL&task_id=glu_start_cold.20150615T0000Z: ('Connection aborted.', error(111, 'Connection refused'))
2018-02-22T00:07:46Z WARNING - MESSAGE SEND FAILED
------------------------------------------------------------------------------------------------------------------------------------------------------------

For your reference,
1) my sample suite works very well for rose-2017.10.0 and cylc-7.5.0
2) I am using [communication] method = http in global.rc.





2018년 2월 21일 수요일 오후 6시 33분 1초 UTC+9, Matt Shin 님의 말:
Hi,

azure...@gmail.com

unread,
Feb 22, 2018, 4:11:14 AM2/22/18
to cylc
My system does not support https.
So I set global.rc.
If I set https(which is the default), it occurs the following error.

----------------------------------------------------------------------------------

[INFO] kmt-aa535: will run on localhost
[FAIL] cylc run kmt-aa535  # return-code=1, stderr=
[FAIL] 2018-02-20T02:12:33Z INFO - Suite shutting down - ERROR: "No HTTPS support. Configure user's global.rc to use HTTP."
[FAIL] 2018-02-20T02:12:33Z INFO - DONE
----------------------------------------------------------------------------------


2018년 2월 22일 목요일 오후 4시 37분 42초 UTC+9, Hilary Oliver 님의 말:
To unsubscribe from this group and stop receiving emails from it, send an email to cylc+uns...@googlegroups.com.

Matt Shin

unread,
Feb 22, 2018, 4:47:10 AM2/22/18
to cylc
OK. Let's do this systematically. Start up a suite on hold with 7.5.0 and a suite on hold with 7.6.0. (cylc run --hold) Have a look at their contact files "~/cylc-run/SUITE/.service/contact". Notice any difference to the host name?

Use a simple suite that only runs a background job on the localhost. Does communication work?

If it does, try:
* Run a background job on a remote host (in your local network).
* Run a PBS job on the localhost, if possible. Try both shared and compute nodes, if relevant.
* Run a PBS job on a remote host (in your local network), if possible.

azure...@gmail.com

unread,
Feb 22, 2018, 7:44:42 PM2/22/18
to cylc
The host names in the contact files of 7.5.0 and 7.6.0 are the same.

7.6.0 on the local host:
1. The background task is communicated. (Displayed as 'succeeded' on the gcylc)
2. The PBS jobs using shared and compute nodes are not communicating.

For your reference, my system does not have a remote host.





2018년 2월 22일 목요일 오후 6시 47분 10초 UTC+9, Matt Shin 님의 말:

Matt Shin

unread,
Feb 23, 2018, 5:41:52 AM2/23/18
to cylc
Well. At least we know that communication is possible, but broken via batch jobs. It sounds obvious, but can you spot any other glaring differences between the contact files? It is also worth ensuring that you are definitely running matching versions of cylc in all your environments.

A 'Connection refused' is caused by either of these:
* No server program is listening on the port. (This should not be the case if the suite is running and contactable via localhost.)
* Connection is blocked by some firewall setting. (But why would it let a cylc-7.5.0 message to go through?) Note: The URL has changed between 7.5.0 and 7.6.0 in this way:
  * http://ngate6.cm.cluster:43052/put_message?... (7.6.0)
  * http://ngate6.cm.cluster:43052/message/put?... (7.5.0)

I am puzzled.

Hilary Oliver

unread,
Feb 23, 2018, 7:16:51 AM2/23/18
to cy...@googlegroups.com
I'm puzzled too! 

One idea: sounds like the jobs are executed on the suite host, not a remote host, so a background job may execute in the same environment as the suite daemon, whereas the PBS job probably starts from a clean environment? (We'll, I've forgotten now if the process pool divorces even background jobs from the original environment ...)

To unsubscribe from this group and stop receiving emails from it, send an email to cylc+unsubscribe@googlegroups.com.

Shin, Matthew

unread,
Feb 23, 2018, 7:25:19 AM2/23/18
to cy...@googlegroups.com
These days, background jobs launch using something like "nohup bash -c exec $0 ..." in its own process group, so it is effectively a double fork.

That remains me, the other change between 7.5.0 and 7.6.0 is that we now have a "#!/bin/bash -l" header instead of sourcing "/etc/profile" and "~/.profile". So this may be the issue as well.

________________________________________
From: cy...@googlegroups.com <cy...@googlegroups.com> on behalf of Hilary Oliver <hilary....@gmail.com>
Sent: 23 February 2018 12:16:49
To: cy...@googlegroups.com
Subject: Re: [cylc-dev] Re: Missing task state in gcylc

I'm puzzled too!

One idea: sounds like the jobs are executed on the suite host, not a remote host, so a background job may execute in the same environment as the suite daemon, whereas the PBS job probably starts from a clean environment? (We'll, I've forgotten now if the process pool divorces even background jobs from the original environment ...)

On 23/02/2018 11:41 PM, "Matt Shin" <matthe...@metoffice.gov.uk<mailto:matthe...@metoffice.gov.uk>> wrote:
Well. At least we know that communication is possible, but broken via batch jobs. It sounds obvious, but can you spot any other glaring differences between the contact files? It is also worth ensuring that you are definitely running matching versions of cylc in all your environments.

A 'Connection refused' is caused by either of these:
* No server program is listening on the port. (This should not be the case if the suite is running and contactable via localhost.)
* Connection is blocked by some firewall setting. (But why would it let a cylc-7.5.0 message to go through?) Note: The URL has changed between 7.5.0 and 7.6.0 in this way:
* http://ngate6.cm.cluster:43052/put_message?... (7.6.0)
* http://ngate6.cm.cluster:43052/message/put?... (7.5.0)

I am puzzled.

To unsubscribe from this group and stop receiving emails from it, send an email to cylc+uns...@googlegroups.com<mailto:cylc+uns...@googlegroups.com>.
For more options, visit https://groups.google.com/d/optout.

--

---
You received this message because you are subscribed to the Google Groups "cylc" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cylc+uns...@googlegroups.com<mailto:cylc+uns...@googlegroups.com>.
Message has been deleted
Message has been deleted
Message has been deleted

azure...@gmail.com

unread,
Mar 12, 2018, 4:36:27 AM3/12/18
to cylc
I'm sorry.
I have been on a long business trip.

I consulted our system team about this problem.
He gave the following first answer.
---------------------------------------------------------
When I opened the session with http protocol from Requests (urllib3), url was not found.
Modify the lib/cylc/network/httpserver.py file as follows:

[Before]
132 line: cherrypie.config ["server.socket_host"] = get_host()

[after]
132 line: cherrypie.config ["server.socket_host"] = '0.0.0.0'
---------------------------------------------------------

I modified lib/cylc/network/httpserver.py file as above, the problem was solved.


He gave the following second answer.
---------------------------------------------------------
Do not modify the lib/cylc/network/httpserver.py source code but add the following options to the conf/global.rc file.

[suite host self-identification]
   method = address
---------------------------------------------------------
I added an options to the conf/global.rc file as above, the problem was solved, too.

However, I hope that you refer for the next development.

Thank you.

Hilary Oliver

unread,
Mar 12, 2018, 2:34:20 PM3/12/18
to cy...@googlegroups.com
Th

--

---
You received this message because you are subscribed to the Google Groups "cylc" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cylc+unsubscribe@googlegroups.com.

Hilary Oliver

unread,
Mar 12, 2018, 2:40:25 PM3/12/18
to cy...@googlegroups.com
(oops - ignore previous send error!)

OK, good to hear you've solved the problem.  It would be interesting to know why `get_host()` failed (that method is defined in `lib/cylc/hostuserutil.py`).  Does Python's `socket.getfqdn()` return an invalid host name in your environment??

Hilary

azure...@gmail.com

unread,
Mar 12, 2018, 8:52:58 PM3/12/18
to cylc
Yes. 
Python code returns a url of the form "http://nid00412.cm.cluster:43012/....".
This type of url is not available on my system.
So I modified the global.rc file to return the IP address.

Next is what the system team told me.
---------------------------------------------------------------
It seems to refer to /etc/hosts when importing hostname.

In the /etc/hosts file, IP, HOST, and Alias are listed. In the case of computing nodes, HOST contains the Node ID.

Presumably it is not able to use cylc web service (socket communication) with Node ID (nid00xxxx).

In short, it is caused by the inability to use the web service (eg, http: //nid00412.cm.cluster: 43012 / ...) as the Node ID in cylc's cherrypy.
---------------------------------------------------------------



2018년 2월 21일 수요일 오후 12시 52분 24초 UTC+9, azure...@gmail.com 님의 말:
Reply all
Reply to author
Forward
0 new messages