Ehnancement suggestion: Count fractional failures.

12 views
Skip to first unread message

bruce

unread,
Jul 3, 2019, 3:51:53 PM7/3/19
to hfm-net
The "Failed" currently keeps track of WUs which fail ... and it does a good job of that.  There are also projects which have fractional failures, such as
18:35:10:WU02:FS02:0x22:Completed 4200000 out of 5000000 steps (84%)
18:38:03:WU02:FS02:0x22:Completed 4250000 out of 5000000 steps (85%)
18:40:55:WU02:FS02:0x22:Completed 4300000 out of 5000000 steps (86%)
18:43:52:WU02:FS02:0x22:Completed 4350000 out of 5000000 steps (87%)
18:46:44:WU02:FS02:0x22:Completed 4400000 out of 5000000 steps (88%)
18:49:36:WU02:FS02:0x22:Completed 4450000 out of 5000000 steps (89%)
18:50:55:WU02:FS02:0x22:Bad State detected... attempting to resume from last good checkpoint. Is your system overclocked?
18:50:55:WU02:FS02:0x22:Following exception occured: Particle coordinate is nan
18:53:48:WU02:FS02:0x22:Completed 4050000 out of 5000000 steps (81%)
18:56:40:WU02:FS02:0x22:Completed 4100000 out of 5000000 steps (82%)
18:59:32:WU02:FS02:0x22:Completed 4150000 out of 5000000 steps (83%)
19:02:24:WU02:FS02:0x22:Completed 4200000 out of 5000000 steps (84%)
19:05:16:WU02:FS02:0x22:Completed 4250000 out of 5000000 steps (85%)
19:08:08:WU02:FS02:0x22:Completed 4300000 out of 5000000 steps (86%)
19:11:00:WU02:FS02:0x22:Completed 4350000 out of 5000000 steps (87%)
19:13:51:WU02:FS02:0x22:Completed 4400000 out of 5000000 steps (88%)
19:16:43:WU02:FS02:0x22:Completed 4450000 out of 5000000 steps (89%)
19:19:35:WU02:FS02:0x22:Completed 4500000 out of 5000000 steps (90%)

You'll notice that the WU is well on its way to a successful completion even though it had to retry the processing beween about 4000001 steos and about 4450000.  A WU may be partially retried from the past checkpoint and may ultimately succeed or fail.   (The number of retries is intentionally limited.)

There are several ways you might handle this.  I suggest you update the number of failures for this WU to 0.1 although that will require careful management of the fractions to avoid contamination of the actual failure count that's currently working. (i.e. 0.3 might turn into either 1.0 or 0.0 when the WU finishes.)  Managing and reporting a separate count might be better.
Reply all
Reply to author
Forward
This conversation is locked
You cannot reply and perform actions on locked conversations.
0 new messages