can't restart suite

33 views
Skip to first unread message

Jonny Williams

unread,
Oct 10, 2018, 7:03:52 PM10/10/18
to cylc
hi there

i can't seem to restart the suite illustrated by the attached screen grab of its gcylc window.

it was running fine until just now.

I've tried a warm restart...

rose suite-run -n u-aw889-90  -- --warm 18600401

... but this doesn't boot up the suite and the output of `cylc log u-aw889-90` gives...


w-clim01.maui.niwa.co.nz|Wed Oct 10|23:01:28|u-aw889> cylc log u-aw889-90
2018-10-10T22:52:11Z INFO - Suite starting: server=w-cylc01.maui.niwa.co.nz:43078 pid=12055
2018-10-10T22:52:11Z INFO - Cylc version: 7.7.2
2018-10-10T22:52:11Z INFO - Run mode: live
2018-10-10T22:52:11Z INFO - Initial point: 18500101T0000Z
2018-10-10T22:52:11Z INFO - Start point: 18600401T0000Z
2018-10-10T22:52:11Z INFO - Final point: 19500101T0000Z
2018-10-10T22:52:11Z INFO - Warm Start 18600401T0000Z
2018-10-10T22:52:13Z INFO - [coupled.18600401T0000Z] -submit-num=1, owner@host=login.maui.niwa.co.nz
2018-10-10T22:52:13Z INFO - Suite shutting down - ERROR: unable to open database file
2018-10-10T22:52:13Z INFO - DONE

Any ideas!?

@hilary, if you're reading this, this isn't the same suite that I'd messed around with that we discussed in person!

Thanks a lot :)

Jonny
Capture.PNG

Hilary Oliver

unread,
Oct 10, 2018, 7:10:46 PM10/10/18
to cy...@googlegroups.com
Hmm, the salient point here is "ERROR: unable to open database file" (refers to the suite daemon's sqlite DB) - I've not seen that before.  I'm am around at the moment so I'll log in for a quick look...
Hilary

--

---
You received this message because you are subscribed to the Google Groups "cylc" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cylc+uns...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jonny Williams

unread,
Oct 10, 2018, 9:00:49 PM10/10/18
to cylc
thanks :)

Jonny Williams

unread,
Oct 10, 2018, 11:17:59 PM10/10/18
to cylc
sorted now, thanks hilary!

Matt Shin

unread,
Oct 11, 2018, 3:49:01 AM10/11/18
to cylc
Worth adding a --debug to your start up to give us a bit more error + traceback.

Hilary Oliver

unread,
Oct 11, 2018, 6:04:25 AM10/11/18
to cy...@googlegroups.com

Matt,

This was a strange one.  On the system involved I was not able to run cylc itself through a debugger (for ... reasons), and the traceback from `cylc run --debug` didn't help as the error message already identified a unique code block.  Even deleting the existing suite DBs did not fix the error (and the db file permissions and sizes were identical to those of another suite warm-started to a different run directory, which ran perfectly). Jonny did some run directory renaming, I think, that got the suite running - but why that worked is not clear to me.  As we've never seen this before, at this point - until proven wrong - I'm blaming some obscure filesystem glitch :-)

Hilary


--

Shin, Matthew

unread,
Oct 11, 2018, 10:02:23 AM10/11/18
to cy...@googlegroups.com
In which case, definitely looks like a local obscure file system glitch. (At least not something we need to worry about - for now.)

---
Dr Matt Shin Expert Scientific Software Engineer
Met Office FitzRoy Road Exeter EX1 3PB United Kingdom
Tel: +44 (0)1392 884790
matthe...@metoffice.gov.uk http://www.metoffice.gov.uk/

________________________________________
From: cy...@googlegroups.com <cy...@googlegroups.com> on behalf of Hilary Oliver <hilary....@gmail.com>
Sent: 11 October 2018 11:04:10
To: cy...@googlegroups.com
Subject: Re: [cylc-dev] Re: can't restart suite


Matt,

This was a strange one. On the system involved I was not able to run cylc itself through a debugger (for ... reasons), and the traceback from `cylc run --debug` didn't help as the error message already identified a unique code block. Even deleting the existing suite DBs did not fix the error (and the db file permissions and sizes were identical to those of another suite warm-started to a different run directory, which ran perfectly). Jonny did some run directory renaming, I think, that got the suite running - but why that worked is not clear to me. As we've never seen this before, at this point - until proven wrong - I'm blaming some obscure filesystem glitch :-)

Hilary


On Thu, 11 Oct 2018 at 20:49, Matt Shin <matthe...@metoffice.gov.uk<mailto:matthe...@metoffice.gov.uk>> wrote:
Worth adding a --debug to your start up to give us a bit more error + traceback.

On Thursday, 11 October 2018 00:03:52 UTC+1, Jonny Williams wrote:
hi there

i can't seem to restart the suite illustrated by the attached screen grab of its gcylc window.

it was running fine until just now.

I've tried a warm restart...

rose suite-run -n u-aw889-90 -- --warm 18600401

... but this doesn't boot up the suite and the output of `cylc log u-aw889-90` gives...


w-clim01.maui.niwa.co.nz<http://w-clim01.maui.niwa.co.nz>|Wed Oct 10|23:01:28|u-aw889> cylc log u-aw889-90
2018-10-10T22:52:11Z INFO - Suite starting: server=w-cylc01.maui.niwa.co.nz:43078<http://w-cylc01.maui.niwa.co.nz:43078> pid=12055
2018-10-10T22:52:11Z INFO - Cylc version: 7.7.2
2018-10-10T22:52:11Z INFO - Run mode: live
2018-10-10T22:52:11Z INFO - Initial point: 18500101T0000Z
2018-10-10T22:52:11Z INFO - Start point: 18600401T0000Z
2018-10-10T22:52:11Z INFO - Final point: 19500101T0000Z
2018-10-10T22:52:11Z INFO - Warm Start 18600401T0000Z
2018-10-10T22:52:13Z INFO - [coupled.18600401T0000Z] -submit-num=1, owner@host=login.maui.niwa.co.nz<http://login.maui.niwa.co.nz>
2018-10-10T22:52:13Z INFO - Suite shutting down - ERROR: unable to open database file
2018-10-10T22:52:13Z INFO - DONE

Any ideas!?

@hilary, if you're reading this, this isn't the same suite that I'd messed around with that we discussed in person!

Thanks a lot :)

Jonny

--

---
You received this message because you are subscribed to the Google Groups "cylc" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cylc+uns...@googlegroups.com<mailto:cylc+uns...@googlegroups.com>.
For more options, visit https://groups.google.com/d/optout.

--

---
You received this message because you are subscribed to the Google Groups "cylc" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cylc+uns...@googlegroups.com<mailto:cylc+uns...@googlegroups.com>.
Message has been deleted

Jonny Williams

unread,
Nov 4, 2018, 9:06:19 PM11/4/18
to cylc
just to add some more info on this, i had the same issue last week and bizarrely renaming the cylc-run suite dir and then restarting with the new name via 'cylc restart ...' seemed to fix this after getting numerous file permission errors on the .service/db file.

just now an ensemble job of mine showed the 'stopped with running' attachment in gcylc. as you can see it says stopped with running in the bottom left.

however when i try to restart it using 'rose suite-run [suite] --restart' or 'cylc restart [suite]' the 'stopped with succeeded' attachment results.

the data for each of the ensemble members does seem to have been produced but i don't understand the 'stopped with running' behaviour.

we think that all this may be a result of obscure file permissions issues in out new file system but now sure yet.

thanks

jonny
stopped with succeeded.PNG
stopped with running.PNG

Hilary Oliver

unread,
Nov 4, 2018, 9:34:12 PM11/4/18
to cy...@googlegroups.com
Hi Jonny,

The "stopped with xxx" status makes sense.  If you hit the database errors again (did you? - you should see the db error before the suite daemon shuts itself down, in the suite log).  If there were jobs running at the time the suite shuts down, the suite log will report orphaned jobs, and the GUI will say "stopped with running" in the status bar. If you try to do a restart a bit later, the suite will poll the orphaned jobs, and if it finds that they finished successfully while the daemon was down, the GUI will then say "stopped with succeeded".  (But, from previous experience with your occasional db error, the suite will not be able to start up again properly because the db still can't be updated ... until you rename its parent directory, or whatever the fix was - definitely some kind of FS problem, it seems).

Hilary


To unsubscribe from this group and stop receiving emails from it, send an email to cylc+uns...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages