Simple watchdog timer

Kai Backman

unread,

Feb 4, 2011, 5:14:47 PM2/4/11

to golang-nuts

I recently implemented the client part of our lockserver and needed to
handle some tricky cache and lock expiration conditions. Nothing in
time felt like a good fit so I wrote a simple watchdog timer, code
below.

Here is the semantics:

- A freshly created watchdog starts in the idle state and is activated
by a call to Reset.
- Every call to Reset will eventually cause an send on the Timeouts
channel unless a subsequent Reset follows
- When the watchdog times out it will only send a single bool on the
Timeouts channel
- Once the timeout has occurred but before it has been handled calling
Reset does not affect state.
- Once the timeout has been handled the watchdog returns to the idle state.

Comments and ideas welcome. As I posted I noticed a bug that sometimes
causes timeouts to occur too late, see if you can spot it too. If
there is interest I could clean this up for inclusion in time.

Kai

type Watchdog struct {
resets chan int64
Timeouts chan bool
}

func NewWatchdog() *Watchdog {
wd := &Watchdog{
resets: make(chan int64, 50),
Timeouts: make(chan bool),
}
go wd.loop()
return wd
}

func (wd *Watchdog) loop() {
var t0, t1 int64
var ok bool
idle:
t0 = <-wd.resets
t1, ok = <-wd.resets
for ok {
t0 = t1
t1, ok = <-wd.resets
}

loop:
time.Sleep(t0)
t1, ok = <-wd.resets
if !ok {
wd.Timeouts <- true
goto idle
}
for ok {
t0 = t1
t1, ok = <-wd.resets
}
goto loop
}

func (wd *Watchdog) Reset(timeoutNS int64) {
wd.resets <- timeoutNS
}

--
Kai Backman, programmer
http://tinkercad.com - The unprofessional solid CAD

Kai Backman

unread,

Feb 7, 2011, 9:14:11 AM2/7/11

to golang-nuts

Here is the fixed version for the archives. It uses the new select
style for asynchronous receive and times out correctly with multiple
resets in flight.

Kai

type watchdog struct {
resets chan int64
timeouts chan bool
}

func newWatchdog() *watchdog {
wd := &watchdog{
resets: make(chan int64, 200),
timeouts: make(chan bool),
}
go wd.loop()
return wd
}

func (wd *watchdog) pump(t0 int64) (t1 int64) {
t1 = t0
for {
select {
case t := <- wd.resets:
if t > t0 {
t1 = t
}
default:
return
}
}
panic("unreachable")
}

func (wd *watchdog) loop() {
var t0 int64
idle:
t0 = <-wd.resets
t0 = wd.pump(t0)

loop:
time.Sleep(t0 - time.Nanoseconds())
now := time.Nanoseconds()
t0 = wd.pump(now)
if t0 == now {
wd.timeouts <- true
goto idle
}
goto loop
}

func (wd *watchdog) reset(timeoutNS int64) {
wd.resets <- timeoutNS+time.Nanoseconds()
}

On Sat, Feb 5, 2011 at 12:14 AM, Kai Backman <kai.b...@gmail.com> wrote:
> I recently implemented the client part of our lockserver and needed to
> handle some tricky cache and lock expiration conditions. Nothing in
> time felt like a good fit so I wrote a simple watchdog timer, code
> below.

--

Gustavo Niemeyer

unread,

Feb 7, 2011, 9:24:42 AM2/7/11

to Kai Backman, golang-nuts

Hi Kai,

> I recently implemented the client part of our lockserver and needed to
> handle some tricky cache and lock expiration conditions. Nothing in
> time felt like a good fit so I wrote a simple watchdog timer, code
> below.

Have you played with Timer:

http://golang.org/pkg/time/#Timer

It looks very similar in essence.

--
Gustavo Niemeyer
http://niemeyer.net
http://niemeyer.net/blog
http://niemeyer.net/twitter

Kai Backman

unread,

Feb 7, 2011, 3:48:31 PM2/7/11

to Gustavo Niemeyer, golang-nuts

Hi Gustavo,

On Mon, Feb 7, 2011 at 4:24 PM, Gustavo Niemeyer <gus...@niemeyer.net> wrote:
> Have you played with Timer:
>
> http://golang.org/pkg/time/#Timer

I was aware of Timer. I'm curious, could you give me an example how
you would achieve identical semantics using Timer?

Kai

Gustavo Niemeyer

unread,

Feb 7, 2011, 4:22:37 PM2/7/11

to Kai Backman, golang-nuts

> I was aware of Timer. I'm curious, could you give me an example how
> you would achieve identical semantics using Timer?

I can't because you haven't really described the problem you're trying
to solve, but the primitives you describe seem to be available in Timer.

If you can describe why Watchdog solves your problem when Timer
does not, maybe it'd be easier to figure what's the delta between
them and if there's a way to merge the two.

Kai Backman

unread,

Feb 8, 2011, 3:28:06 AM2/8/11

to Gustavo Niemeyer, golang-nuts

On Monday, February 7, 2011, Gustavo Niemeyer <gus...@niemeyer.net> wrote:
> If you can describe why Watchdog solves your problem when Timer
> does not, maybe it'd be easier to figure what's the delta between
> them and if there's a way to merge the two.

A watchdog has two parts to the API: a reset function and a channel
for timeouts. It contains an internal timer that is decrementing while
the watchdog is active. Each call to reset sets the timer to a new
value and if the timer ever hits 0 we need to send a notification on
the channel. Once the timer has tripped but before the notifican has
been acknowledged reset should not cause the notification to be lost.

There is some similarity to how watchdog timers work in embedded
systems. I'm using the watchdog to identify when a lockserv client
should disable it's cache and enter session jeopardy and later when to
drop locks. Each successful KeepAlive from the master resets the
watchdog to the session lease time given by the master. There is a
more detailed explanation of the problem in the Chubby paper by Mike
Burrows, specifically in the section on client caching.

Gustavo Niemeyer

unread,

Feb 8, 2011, 7:18:53 AM2/8/11

to Kai Backman, golang-nuts

> A watchdog has two parts to the API: a reset function and a channel
> for timeouts. It contains an internal timer that is decrementing while
> the watchdog is active. Each call to reset sets the timer to a new
> value and if the timer ever hits 0 we need to send a notification on
> the channel.

Up to here it's pretty much a matter of, on each Reset, stopping the
prior Timer and then creating a new Timer.

> Once the timer has tripped but before the notifican has
> been acknowledged reset should not cause the notification to be lost.

This sounds like the actual distinction. Timer won't work like this. A call
to Reset will cancel the notification delivery it not yet sent.

I get a fuzzy feeling out of the idea that the logic is synchronizing solely
on the clock. Time will pass between the calls to Nanosecond() and the
use of the values, so it feels like false determinism. I don't know enough
about your context to tell if that's a problem or not, though.

roger peppe

unread,

Feb 8, 2011, 7:23:10 AM2/8/11

to Kai Backman, Gustavo Niemeyer, golang-nuts

it's always interesting to see a different use case.
the main thing that makes this one a little bit more difficult
is the stipulation that notification events should not be sent
when a notification is being handled.

it seems to me that your implementation doesn't quite
do this because sends to the Timeouts channel only
block until the notification handler is ready to *start*
processing the notification, not until it has finished.

for instance, the following code will print
"notification received" twice, not once,
even though Reset it called while the first event
is being processed:

w := NewWatchdog()
go func() {
for _ = range w.Timeouts {
fmt.Println("notification received")
time.Sleep(0.5e9)
}
}()
w.Reset(0.2e9)
time.Sleep(0.3e9)
w.Reset(0.1e9)
time.Sleep(1e9)

to get around this, you could add a back channel to say
when processing is complete; or you could use
a function call instead of a channel.

func NewWatchdog(timeout func())

i don't really know if Watchdog belongs in the time package.
it seems a little specialist to me.

here's a version that uses the existing time functionality.
it's a little shorter, but not as pretty as your version.

type Watchdog struct {
mu sync.Mutex
handling bool
deadline int64
Timeouts chan bool
timer *time.Timer
}

func NewWatchdog() *Watchdog {
return &Watchdog{Timeouts: make(chan bool)}
}

func (w *Watchdog) Reset(t int64) {
t += time.Nanoseconds()
w.mu.Lock()
if t <= w.deadline {
return
}
if w.timer != nil {
w.timer.Stop()
}
w.timer = time.AfterFunc(t-time.Nanoseconds(), func() {
// If previous timeout is still being handled, then
// ignore this timeout.
w.mu.Lock()
if w.handling {
w.mu.Unlock()
return
}
w.handling = true
w.mu.Unlock()

w.Timeouts <- true

w.mu.Lock()
w.handling = false
w.mu.Unlock()
})
w.deadline = t
w.mu.Unlock()
}

roger peppe

unread,

Feb 8, 2011, 7:29:14 AM2/8/11

to Kai Backman, golang-nuts

nice code, BTW, although one has to wonder with any
instance of make(chan int64, 200): "why not 201?"

On 7 February 2011 14:14, Kai Backman <kai.b...@gmail.com> wrote:
> if t > t0 {

if t > t1?

> time.Sleep(t0 - time.Nanoseconds())

i find it interesting that any code dealing with time
in a non-trivial way ends up using absolute time
(using relative time being one source of bugs in
your original version).

that's true of the time package functions internally too.

every time one switches back and forth between
relative and absolute time, there's a risk of some slippage.

i wonder if it wouldn't be useful to have
absolute time versions of the functions in the
time package (time.SleepAbs, time.AfterAbs etc)
to help with this.

Gustavo Niemeyer

unread,

Feb 8, 2011, 7:39:24 AM2/8/11

to roger peppe, Kai Backman, golang-nuts

> every time one switches back and forth between
> relative and absolute time, there's a risk of some slippage.

s/some/more/. It is there either way.

> i wonder if it wouldn't be useful to have
> absolute time versions of the functions in the
> time package (time.SleepAbs, time.AfterAbs etc)
> to help with this.

Why? It's not precise no matter what.

roger peppe

unread,

Feb 8, 2011, 8:17:29 AM2/8/11

to Gustavo Niemeyer, Kai Backman, golang-nuts

On 8 February 2011 12:39, Gustavo Niemeyer <gus...@niemeyer.net> wrote:
>> every time one switches back and forth between
>> relative and absolute time, there's a risk of some slippage.
>
> s/some/more/. It is there either way.

using absolute times means that we can get closer
to the theoretical optimum for the system.

on my system, each time you do:

dt := t - time.Nanoseconds()
t = time.Nanoseconds() + dt

t loses between 1.2 and 5 microseconds.

when Kai's code runs, that transition happens 2.5 times,
so that's *at least* an unnecessary 3 microseconds of inaccuracy
with every call.

>> i wonder if it wouldn't be useful to have
>> absolute time versions of the functions in the
>> time package (time.SleepAbs, time.AfterAbs etc)
>> to help with this.
>
> Why? It's not precise no matter what.

mostly as a matter of convenience.

i'd prefer time.SleepAbs(t) to time.Sleep(t - time.Nanoseconds())

it's less error-prone if you can work in absolute times throughout.

Gustavo Niemeyer

unread,

Feb 8, 2011, 8:33:13 AM2/8/11

to roger peppe, Kai Backman, golang-nuts

> when Kai's code runs, that transition happens 2.5 times,
> so that's *at least* an unnecessary 3 microseconds of inaccuracy
> with every call.

You wanna save 3ms on a call to sleep. That's as close to early
optimization as I've seen.

roger peppe

unread,

Feb 8, 2011, 8:56:22 AM2/8/11

to Gustavo Niemeyer, Kai Backman, golang-nuts

On 8 February 2011 13:33, Gustavo Niemeyer <gus...@niemeyer.net> wrote:
>> when Kai's code runs, that transition happens 2.5 times,
>> so that's *at least* an unnecessary 3 microseconds of inaccuracy
>> with every call.
>
> You wanna save 3ms on a call to sleep. That's as close to early
> optimization as I've seen.

to look at it another way, a call to syscall.Sleep takes about 6us on my system.
in that context 3µs is not negligible.

nonetheless, i still think the strongest argument is the second one -
it's less error prone to work in absolute time throughout. it simplifies
the code.

Gustavo Niemeyer

unread,

Feb 8, 2011, 9:28:11 AM2/8/11

to roger peppe, Kai Backman, golang-nuts

> to look at it another way, a call to syscall.Sleep takes about 6us on my system.
> in that context 3µs is not negligible.

If *sleeping* for 3us more when a goroutine wakes up is not
negligible, Go is not suitable.

> nonetheless, i still think the strongest argument is the second one -
> it's less error prone to work in absolute time throughout. it simplifies
> the code.

Sure, that's a different argument. Pretty much every sleep I see is a
delta, but you can argue about different use cases of course.

Kai Backman

unread,

Feb 8, 2011, 3:11:19 PM2/8/11

to Gustavo Niemeyer, golang-nuts

On Tue, Feb 8, 2011 at 2:18 PM, Gustavo Niemeyer <gus...@niemeyer.net> wrote:
> I get a fuzzy feeling out of the idea that the logic is synchronizing solely
> on the clock. Time will pass between the calls to Nanosecond() and the
> use of the values, so it feels like false determinism. I don't know enough
> about your context to tell if that's a problem or not, though.

In this particular context depending on the clock is acceptable. We do
note require different servers to have synchronized clocks, we only
require them to advance time at roughly the same rate. For the
jeopardy timeout both the server and client treat it conservatively,
clocks could be advancing with a speed difference of almost 5% and
things would still work out correctly.

On Tue, Feb 8, 2011 at 2:23 PM, roger peppe <rogp...@gmail.com> wrote:
> is the stipulation that notification events should not be sent
> when a notification is being handled.
> it seems to me that your implementation doesn't quite
> do this

This is correct. I also happen to know that new reset calls wont
happen while the notification is processed, which makes the
implementation slightly simpler but also more domain dependent.

> i don't really know if Watchdog belongs in the time package.
> it seems a little specialist to me.

I've come to the same conclusion as we are having this discussion.
There are clearly needs for more complicated timing constructs but as
we are dissecting this one example it is obvious how domain specific
it is. I'm assuming that other solutions will be equally domain
specific, but in a different way.

> if t > t1?

fixed, thanks.

> i find it interesting that any code dealing with time
> in a non-trivial way ends up using absolute time

I agreed. I've learnt this several times over the years but still the
original delta time version was somehow temptingly "easier". And
buggy.

Thanks for the feedback both of you.

Reply all

Reply to author

Forward