I think that CAS does an API ping to check availability. At least that's
what I saw in the code when I took a peak a couple of months ago. We're
also on DUO1, and that was succeeding during the DUO1 outages. In fact,
it immediately succeeded. I even called out the fact that this was
happening to Duo, and they directed me to the contingency plan document.
The issue during those outages was that all of the requests were queuing
up, and so the connection was alive, but nothing useful was happening.
Not sure what sort of timeouts are involved, but they are quite long.
During the first outage, it was failing for us at the Duo widget display
screen. During the second outage, it was failing after credentials, but
before the widget screen. I haven't looked, but I'm guessing that CAS is
doing some sort of pre-auth or other API check after login, and before
widget display to decide if that should be invoked. Given that the page
rendered, but the widget didn't, it looks like the failure from the
first outage (at least for us) was a timeout on the browser side, which
isn't something that CAS will be able to detect. Being able to timeout
on the CAS side quickly would have helped with the second time. I don't
think we're configured to fail open, but I also don't think that would
have helped.
How are you deciding to do Duo? We're doing a AD group check. So our
contingency was/is to change the config file to any group that doesn't
match an existing group. Then the local check fails and it doesn't even
try Duo. That requires human intervention. But a quick touch of the
config file after change causes it to immediately reload and take the
changes into effect. We do something similar with Shibboleth IdP for
InCommon.