[Shib-Users] DNS Caching Duration?

Chris Peters

unread,

Jun 16, 2010, 3:23:34 PM6/16/10

to shibbole...@internet2.edu

Shibboleth Experts,

We are attempting to upgrade from Shib 1.3 to Shib 2.1.5. Our methodology is thus, create a new Shib 2.x server that is in all ways the same as the Shib 1 server (has the same entityID, certs, responds to the same endpoints). Then using DNS and a short TTL we will update the CNAME pointing to our Shib1 server and redirect it to our Shib2 server. We expect to see sessions in progress disrupted, but all in all not a horrible amount of downtime. We'll pick a relatively low impact timeframe and voila!

However, last night we attempted to cut over and we ran in to a peculiar problem: the SPs seemed to pick up the DNS change for the SSO request, but when they switched to an AttributeQuery they went back to the old IP. Is there some sort of IP address caching happening within the SP software here? If so, what is the default duration on that caching? And how can we undermine it? Is this not a Shibboleth related issue, maybe just something we need to work out within DNS?

Thanks for your input!

Chris

Jim Fox

unread,

Jun 16, 2010, 3:42:11 PM6/16/10

to shibbole...@internet2.edu

You might automatically push attributes to all the SPs. That way you
never get those attribute-query callbacks and you don't have to worry
what the SP has cached. (well, almost never.)

Jim

On Wed, 16 Jun 2010, Chris Peters wrote:

> Date: Wed, 16 Jun 2010 12:23:34 -0700
> From: Chris Peters <cjpe...@uci.edu>
> To: "shibbole...@internet2.edu" <shibbole...@internet2.edu>
> Reply-To: "shibbole...@internet2.edu" <shibbole...@internet2.edu>
> Subject: [Shib-Users] DNS Caching Duration?

Scott Cantor

unread,

Jun 16, 2010, 3:53:07 PM6/16/10

to shibbole...@internet2.edu

> We are attempting to upgrade from Shib 1.3 to Shib 2.1.5. Our methodology
> is thus, create a new Shib 2.x server that is in all ways the same as the
> Shib 1 server (has the same entityID, certs, responds to the same
> endpoints). Then using DNS and a short TTL we will update the CNAME
> pointing to our Shib1 server and redirect it to our Shib2 server. We
expect
> to see sessions in progress disrupted, but all in all not a horrible
amount
> of downtime. We'll pick a relatively low impact timeframe and voila!

And I would suggest you reconsider, and reverse that. Make your 1.3 IdP look
like a 2.x IdP, move it to the new box and then change DNS. Then you can
swap in 2.x for 1.x or go back, any time you want, and there's zero
disruption.

For one thing, it is very complex to make 2.x look like 1.x and very easy to
do the reverse.

> However, last night we attempted to cut over and we ran in to a peculiar
> problem: the SPs seemed to pick up the DNS change for the SSO request,
but
> when they switched to an AttributeQuery they went back to the old IP.

SPs don't talk to the IdP to make a SSO request, that's the client. The
queries are direct communication, and they do heavy HTTP connection caching,
as well as inheriting whatever DNS processing libcurl does on the platform.

> Is there some sort of IP address caching happening within the SP software
here?

Every DNS client caches resolutions.

> If so, what is the default duration on that caching? And how can we
> undermine it? Is this not a Shibboleth related issue, maybe just
something
> we need to work out within DNS?

I think your approach is the problem here. If you have to even think about
the SPs, you're making the upgrade drastically more complex than it needs to
be.

But the simplest way around it is as Jim suggested, switch to push, at least
during the migration.

-- Scott

Rod Widdowson

unread,

Jun 17, 2010, 4:40:19 AM6/17/10

to shibbole...@internet2.edu

Scott said:

> For one thing, it is very complex to make 2.x look like 1.x and very
> easy to
> do the reverse.

Last time I looked it was impossible (and I looked quite hard). In case it
helps I did an upgrade some time ago and wrote the details up in a step by
step guide up:
http://www.ukfederation.org.uk/content/Documents/RollingIdPUpgrade. However
Scott's document https://spaces.internet2.edu/display/SHIB2/IdPUpgrades has
a better overview.

The key thing about doing it like this is that if things go wrong you can
fall back from Shib2 to Shib1. Most other methods that people espouse
require you to persuade the federation to revert the metadata and then
persuade SPs to upgrades. Some sites can bear this sort of disruption, some
sites cannot.

> However, last night we attempted to cut over and we ran in to a
> peculiar problem: the SPs seemed to pick up the DNS change for the SSO
> request, but when they switched to an AttributeQuery they went back to

> the old IP. Is there some sort of IP address caching happening within
> the SP software here? If so, what is the default duration on that

> caching? And how can we undermine it? Is this not a Shibboleth related
> issue, maybe just something we need to work out within DNS?

This can happen if there is a WAYF (not a DS) in the way. The SP doesn't
update its metadata, but sends the request to the WAYF which has updated.
The WAYF send the request to the end data point and thence to the SP. The
SP then approaches the IdP at the old end point (because it hasn't
upgraded). Thanks go to Andy Swiffen for pointing this one out (after he
was bitten by some SPs which tooks weeks to upgrade).

Rod

a...@ucop.edu

unread,

Jul 22, 2010, 2:16:15 PM7/22/10

to shibbole...@internet2.edu

We have a similar problem. Only this time, this is not about the upgrade. We
have an IDP server of which we have clones - as backup. So, if we were to do
maintenance on one, we would switch the DNS entry to the backup box. Now, when
we tried this last week, we noticed the same problem that Chris had.

Some SPs failed - their services were not available. The error that they had
in the logs was:

2010-07-19 15:15:34 ERROR Shibboleth.AttributeResolver.Query [60]: exception
during SAML query to https://shibidp.ucop.edu:8443/shibboleth-idp/AA:
CURLSOAPTransport failed while contacting SOAP endpoint
(https://shibidp.ucop.edu:8443/shibboleth-idp/AA): couldn't connect to host
2010-07-19 15:15:34 ERROR Shibboleth.AttributeResolver.Query [60]: unable to
obtain a SAML response from attribute authority

So, because I was aware of what Chris went through, I asked the SP admins to
restart their service provider and it started to work.

---

In this case, there is no upgrade and there is no making 1.x look like 2.x
etc. There is a simple failure on the SP server to connect to the new IDP host
on port 8443 when we move the "shibidp.ucop.edu" alias from one server to
another. So, the Shib SP does load either the "libcurl" DNS setting or some
other DNS timeout parameter which caches the DNS entry for the IDP and which
can only be refreshed after a restart or by reducing a timeout value
somewhere. And no, there is no firewall problem here.

Not all SPs have this problem. Some of the SP servers have some timeout value
where after about 10-15 minutes, the SP server refreshes its cache or looks up
the DNS entry again and resolves correctly. Some even work instantly as well.
And that absolves any problems on the SP code itself. It has to be something
external to SHIB service provider but definitely something that the SP uses to
caches the DNS entry. Note that as soon as the SP daemon is restarted, it
starts to work fine which is why it leads me to believe there has to be some
dependency here.

Does anyone know what the SP uses to resolve the DNS, and if it has a DNS
cache timeout, where it is set and how to change it on the server? We are
running this on SUSE Linux - any help is greatly appreciated - perhaps its the
libcurl - any idea where we would change that? Perhaps its the "Apache" server
tied to the SP? Any timeout parameter you can think of there?

Scott Cantor

unread,

Jul 22, 2010, 2:32:10 PM7/22/10

to shibbole...@internet2.edu

> Does anyone know what the SP uses to resolve the DNS, and if it has a DNS
> cache timeout, where it is set and how to change it on the server? We are
> running this on SUSE Linux - any help is greatly appreciated - perhaps its
> the libcurl

It is libcurl in every case except for a few exceptions when loading remote
XML files like metadata, which varies by version and in some cases is Xerces
NetAccessor code that is usually native socket code on the platform. I doubt
those cases would matter much here, and certainly not for any SOAP issues.

I doubt very much it exposes much control over DNS lookup but you're welcome
to ask. Even if it did, I can't imagine how it would help. How and when
would the SP know to use them?

Note that the SP will do client side failover of SOAP endpoints if you
supply more than one. So honestly, if you're going to rely on DNS for
failover, you should expose each endpoint at its own DNS address and let the
SP use all of them. If they fail fast, it would work better than trying to
mess with DNS.

-- Scott

Peter Schober

unread,

Jul 22, 2010, 3:44:30 PM7/22/10

to shibbole...@internet2.edu

* a...@ucop.edu <a...@ucop.edu> [2010-07-22 20:16]:

> We have a similar problem. Only this time, this is not about the
> upgrade. We have an IDP server of which we have clones - as
> backup. So, if we were to do maintenance on one, we would switch the
> DNS entry to the backup box. Now, when we tried this last week, we
> noticed the same problem that Chris had.

Did you reduce the TTL on the DNS zone early enough in advance?
Even then some clients or resolver libraries might not care and still
cache the entry for a while.

Using DNS for failover/switchover is prone to cause these kind of
problems (as can also be read from Scott's reply) and hence generally
isn't recommended. There's even a bit about this in the Shib wiki, in
the context of an IdP cluster (more than one active node):
https://spaces.internet2.edu/display/SHIB2/IdPClusterIntro

But if this is planned maintenance, as you said, you might as well
move the IP address of the active server over to the other machine
(possibly on a second or virtual interface) and avoid DNS changes
altogether. Does not change much in your procedure (so all user IdP
sessions will still be lost etc) but avoids one of the drawbacks of
such a method.
-peter

a...@ucop.edu

unread,

Jul 22, 2010, 5:36:37 PM7/22/10

to shibbole...@internet2.edu

Peter and Scott,

For our zones, based on our TTL settings, the "worst" case, 15-20 minutes we
should be able to resolve to the new IP. In fact on the SP host, the SP team
was able to almost instantaneously ping/telnet to the new host. Yet, until I
finally asked them to restart the SP, the service was down.

I found a very old discussion on cURL from where it seems like DNS changes
never get registered with cURL -
http://curl.haxx.se/mail/lib-2002-04/0029.html

We are using version curl-7.11.0-39.20. If the SP uses libcurl to find the
IDP, could it be that the SP only gets what libcurl knows at start up time and
is that why DNS changes are not encouraged? Anyway, the only resolution I have
for now (if I make a DNS change) is to restart the SP. And I get asked, why
does the SP restart make it work - which is why I posted it here.

--

I agree we could transfer the IP itself for any scheduled outage and avoid a
DNS change. We will test that and see. I am just surprised that some of the
SPs work and the others don’t with a DNS change and that probably depends on
each SP's stack - OS, versions of libraries etc.

Thanks for your suggestions and leads.
Abhinav@AIG, UCOP

Scott Cantor

unread,

Jul 22, 2010, 5:44:22 PM7/22/10

to shibbole...@internet2.edu

> For our zones, based on our TTL settings, the "worst" case, 15-20 minutes we
> should be able to resolve to the new IP. In fact on the SP host, the SP team
> was able to almost instantaneously ping/telnet to the new host. Yet, until I
> finally asked them to restart the SP, the service was down.

The cache is not generally going to be cross-process, though, so what you can ping really makes no difference.

> I found a very old discussion on cURL from where it seems like DNS changes
> never get registered with cURL -
> http://curl.haxx.se/mail/lib-2002-04/0029.html

Could be, but that’s pretty old.

> We are using version curl-7.11.0-39.20.

Is that something that a Linux distribution is still shipping? That seems awfully old.

> If the SP uses libcurl to find the
> IDP, could it be that the SP only gets what libcurl knows at start up time
> and is that why DNS changes are not encouraged?

I really have no idea. The only people that know that kind of detail would be on the curl list, or possibly even people that write OS network stacks.

> I agree we could transfer the IP itself for any scheduled outage and avoid a
> DNS change. We will test that and see. I am just surprised that some of the
> SPs work and the others don’t with a DNS change and that probably depends on
> each SP's stack - OS, versions of libraries etc.

Certainly. That's exactly what I would expect.

-- Scott

Russ Allbery

unread,

Jul 22, 2010, 6:54:41 PM7/22/10

to shibbole...@internet2.edu

Peter Schober <peter....@univie.ac.at> writes:
> * a...@ucop.edu <a...@ucop.edu> [2010-07-22 20:16]:

>> We have a similar problem. Only this time, this is not about the
>> upgrade. We have an IDP server of which we have clones - as
>> backup. So, if we were to do maintenance on one, we would switch the
>> DNS entry to the backup box. Now, when we tried this last week, we
>> noticed the same problem that Chris had.

> Did you reduce the TTL on the DNS zone early enough in advance? Even
> then some clients or resolver libraries might not care and still cache
> the entry for a while.

Some versions of nscd, for instance, are notorious for caching things for
up to an hour without regard for the TTL.

--
Russ Allbery (r...@stanford.edu) <http://www.eyrie.org/~eagle/>

Reply all

Reply to author

Forward