How to disable DANGER?

65 views
Skip to first unread message

FMoreira

unread,
Aug 29, 2014, 10:14:46 AM8/29/14
to gd...@googlegroups.com
Hi,

I would like to have only 2 states: UP and DOWN.

Is it possible to bypass DANGER?

Thanks!

FM

Brandon Black

unread,
Aug 29, 2014, 2:59:26 PM8/29/14
to gdnsd on behalf of FMoreira
On Fri, Aug 29, 2014 at 9:14 AM, FMoreira via gdnsd <gdnsd+noreply-APn2wQd6lEZBmUT6n9...@googlegroups.com> wrote:
I would like to have only 2 states: UP and DOWN.
Is it possible to bypass DANGER?

I'm not really sure what you mean (there could be several ways in which you mean that).  What result are you actually trying to achieve in your configuration?

The DANGER state is actually gone in the 2.x code (unreleased), but only the label is gone and the functionality remains basically the same.  DANGER in gdnsd 1.x is basically a special-case of UP where one or more isolated failures have happened, but not enough to cross the threshold into the DOWN state.  The 2.x code doesn't expose a separate DANGER state, but still goes through the same thresholding process.  The original 3 states UP/DANGER/DOWN probably would've been more clear if they had been renamed PERFECT/UP/DOWN instead :)

So, are you wanting a translating of the label, or are you wanting your resources to become DOWN on a single monitoring failure, or some other sort of scenario?

FMoreira

unread,
Aug 29, 2014, 8:05:51 PM8/29/14
to gd...@googlegroups.com



I Would like the resources to become DOWN/UP on a single monitoring failure/success.


Historically gdnsd decides resource status monitoring a single route, what is fine for LAN but not that great for WAN -- it is common a resource that seems perfect from local monitoring POV but actually it is not available from some locations.

Using EXTMON you are empowered to build an application that uses heuristics / several rules to decide if a resource should be UP or DOWN.

Scenarios:

1. Using multiple external monitoring stations. Checking the resource availability from several routes may give a level of confidence that a resource must be turned active or inactive immediately.

2. Resource usage monitoring or server management may decide that new requests must be processed by another server (e.g. DoS, high load, high latency, packet loss, emergency maintenance, etc). It already may be done with elegance using "http_status" and returning "ok_codes" meaning UP or DOWN but DANGER precludes immediate action.

3. LAN. Failover of master->slave database servers and controlled failback slave->master.

4. CDNS. Static objects. If you have redundancy, capacity, network proximity, etc, resolving to healthy resources instead to the "dangerous" ones.

Thank you for your attention and time. You developed a great software!

Best Regards,

FM

On Friday, August 29, 2014 3:59:26 PM UTC-3, blblack wrote:

Brandon Black

unread,
Aug 29, 2014, 8:33:04 PM8/29/14
to gdnsd on behalf of FMoreira
On Fri, Aug 29, 2014 at 7:05 PM, FMoreira via gdnsd <gdnsd+noreply-APn2wQd6lEZBmUT6n9...@googlegroups.com> wrote:
I Would like the resources to become DOWN/UP on a single monitoring failure/success.

Ok, gotcha...
 
Historically gdnsd decides resource status monitoring a single route, what is fine for LAN but not that great for WAN -- it is common a resource that seems perfect from local monitoring POV but actually it is not available from some locations.

Yeah.  You can mitigate this somewhat in many common scenarios by at least approximately co-locating DNS servers with service servers.  e.g. run 3x datacenters on 3 continents and host an authdns server at each.  The rough idea there is that in the case of network splits, the DNS servers will have the same view of the world as the clients that can reach them and the local servers.  This obviously isn't ideal and breaks down if you have large numbers of datacenters in play.
 
Using EXTMON you are empowered to build an application that uses heuristics / several rules to decide if a resource should be UP or DOWN. [...]

I totally agree.  Beyond a certain point of complexity, it makes more sense to externalize the monitoring and just have gdnsd grabbing state results from elsewhere.  The built-in monitors for e.g. http are mostly for simple cases, and extmon is for the complex cases.  Unfortunately the software isn't quite there yet on the gdnsd end...

With the way the internals work for the latest stable release, version 1.11.4 (and earlier), using standard monitoring plugins like extmon, there isn't any way to do it perfectly.  When starting from the UP state, the state machine code always moves through the DANGER state before the DOWN state.  It is possible to make the DOWN->UP transition in a single step, though.  To get as close as possible, you want to define the service_type with the parameters up_thresh=1, ok_thresh=1, down_thresh=2.  With that configuration, it will take 2 monitor-fails to get from UP -> DANGER -> DOWN, and 1 monitor success to get from DOWN->UP or DANGER->UP.  In reviewing the code just now, I thought about what would happen if down_thresh=1, but even though that's legally allowed in the configuration, apparently it would not work correctly.  That's arguably a bug that should be fixed by restricting down_thresh to values of 2 or higher (anything else would change/break behavior others might be relying on).  It's also possible to write a custom plugin for 1.11.4 that acts directly and uses not-very-documented/stable interfaces, but I think that would be a waste of effort at this point in time.

The 2.x code (still not released just yet...) has a lot of differences in these areas, though.  The new core monitoring code does explicitly allow for monitoring plugins to provide direct UP/DOWN state at any time, without going through the anti-flap state machine thresholds.  There's also a new plugin called "extfile" in 2.x that can pull results from a textfile generated by outside tools, and it was given a parameter "direct = true" to allow for immediate UP/DOWN, which would accomplish what you need here.  The 2.x version of extmon doesn't have that currently, but it would be easy to add "direct = true" to that as well, and I'll do that.

So, to recap:
1) Best you can do right now on 1.x: up_thresh=1,ok_thresh=1,down_thresh=2 - but it will take 2x monitor fails to go DOWN.
2) I need to do a 1.x bugfix to disallow configuring down_thresh=1, as that causes broken behavior (it would never switch to the DOWN state at all).
3) I need to add "direct = true" to the 2.x version of extmon, and hurry up and release 2.x.

Thanks,
-- Brandon

FMoreira

unread,
Aug 29, 2014, 9:40:42 PM8/29/14
to gd...@googlegroups.com


Yeah.  You can mitigate this somewhat in many common scenarios by at least approximately co-locating DNS servers with service servers.  e.g. run 3x datacenters on 3 continents and host an authdns server at each.  The rough idea there is that in the case of network splits, the DNS servers will have the same view of the world as the clients that can reach them and the local servers.  This obviously isn't ideal and breaks down if you have large numbers of datacenters in play.

Currently I'm running gdsnd in 20 locations, 5 continents (America, Europe, Asia, Africa, Oceania). All 20 are authdns but I have global, regional and local maps. I'm using the "same view of the world" concept, and it works fine with IPv4, but IPv6 is another animal, with lots of route issues and the best I can do is move the traffic to a "healthy" server if just one check fails.

 
The 2.x code (still not released just yet...) has a lot of differences in these areas, though.  The new core monitoring code does explicitly allow for monitoring plugins to provide direct UP/DOWN state at any time, without going through the anti-flap state machine thresholds.  There's also a new plugin called "extfile" in 2.x that can pull results from a textfile generated by outside tools, and it was given a parameter "direct = true" to allow for immediate UP/DOWN, which would accomplish what you need here.  The 2.x version of extmon doesn't have that currently, but it would be easy to add "direct = true" to that as well, and I'll do that.

Fantastic! 


1) Best you can do right now on 1.x: up_thresh=1,ok_thresh=1,down_thresh=2 - but it will take 2x monitor fails to go DOWN.
2) I need to do a 1.x bugfix to disallow configuring down_thresh=1, as that causes broken behavior (it would never switch to the DOWN state at all).

Trying to reduce the time in the danger state I used threshold=1 for everything and ... the thing stuck in danger. I saw it as flapping. :-)

Surely I can live with down_thresh=2  :-)

Thank you very much!


Reply all
Reply to author
Forward
0 new messages