It's caching DNS forwards through a couple of private
chained DNS servers before hitting the public's DNS servers.
Some of the public's DNS servers are occasionally a bit slow to respond,
and some of those use short expiry intervals, so the MX and A data isn't
always cached locally.
Am I correct in that Postfix itself has no configurable DNS timeout
parameters?
Also, am I correct that Postfix invokes res_query(), which uses
BIND's resolver library, which uses RES_TIMEOUT, which is defined in
/usr/include/resolv.a (5 seconds). And therefore statically linked
Postfix binaries would need to be relinked with a new libresolv.a
library whose RES_TIMEOUT has been increased?
--
Greg
Sounds like a helpful component would be a dns cache that could
be configured to set a rather high minimum TTL it would honor. I
don't know of one, but dnscache, from djbdns, ought to be easy to
patch. A quick grep for ttl shows a couple of places where ttls are
being clamped to an upper value; the pattern I settled on after a
little looking about was 'ttl > 604800', where djbdns clamps the TTL
to an upper value of 1 week. I expect if each of the two matching
instances were preceeded by something like
if (ttl < 1800) ttl = 1800;
or thereabouts, that should cause the cache to refuse to honor ttls
shorter than a half hour. Season to match your postfix retry
interval and such a hackup should get deliveries to sites with flaky
DNS to work on the second try.
An alternative view is to say, if someone has DNS sufficiently
poorly configured that many or most initial queries timeout, and
further sets their TTL to shorter than a typical MTA retry interval,
they are the sort of misanthrope who doesn't feel a need to be
reachable via email, so why should you worry on their behalf?
-Bennett
--yaap9KN+GmBP785v
Content-Type: application/pgp-signature
Content-Disposition: inline
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)
iD8DBQE+PrywHZWg9mCTffwRAiubAJ9Wr7v9RP0LXgL6rZ4ihNhiUcKAtACgpsmt
+X+oG13EqOSWRd6+6hXFyR0=
=zQXA
-----END PGP SIGNATURE-----
--yaap9KN+GmBP785v--
Because the most problematic case is skytel.com (Skytel emergency
paging via email)
Some of it is our fault.
When querying them directly, worse case is about 4.9
seconds. But after traversing 2 chained DNS forwarders, it's a little
over 5 seconds on some occasions.
--
Greg
2003-02-03T14:26:48 Greg Hackney:
> >....if someone has DNS sufficiently
> > ...poorly configured that many or most initial queries timeout....
> > ...why should you worry on their behalf?
> > -Bennett
>=20
> Because the most problematic case is skytel.com (Skytel emergency
> paging via email)
Ahh, thanks for the additional info.
If I owned your problem myself, here's what _I_ would do about it.
I'd identify this domain that's critical to my operations, and whose
DNS is marginal, and would set up a script that (a) automatically
checks say once/hour to make sure the data hasn't changed, and if so
kicks off a rebuild of the (b) private mirror I kept, served by a
little localhost-bound tinydns instance, that's offering
authoritative MX and corresponding A data for skytel.com, which is
in turn (c) used by the dnscache instance local to the mail server
for its lookups. djbdns (like recent versions of bind) allows a
caching nameserver to be configured to divert queries for certain
domains to specific servers, overriding the normal
delegation-from-root path.
-Bennett
--sBcizk6cgRZY6rnJ
Content-Type: application/pgp-signature
Content-Disposition: inline
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.7 (GNU/Linux)
iD8DBQE+PsP8HZWg9mCTffwRAu9TAJ4kiJR8V6WE5SCWJAx5xXwC4hQuRwCgqRpi
VfwmVKfidOxYwh2OrvBP2cQ=
=IjFu
-----END PGP SIGNATURE-----
--sBcizk6cgRZY6rnJ--
Correct. Postfix invokes the client routines in resolver(3) which
hide all the gory details of querying a server.
> Also, am I correct that Postfix invokes res_query(), which uses
> BIND's resolver library, which uses RES_TIMEOUT, which is defined in
> /usr/include/resolv.a (5 seconds). And therefore statically linked
> Postfix binaries would need to be relinked with a new libresolv.a
> library whose RES_TIMEOUT has been increased?
Neither FreeBSD 4.7 nor 7.3 document RES_TIMEOUT.
Wietse
> > Postfix here is having occasional DNS timeout problems.
> >
> > It's caching DNS forwards through a couple of private
> > chained DNS servers before hitting the public's DNS servers.
> >
> > Some of the public's DNS servers are occasionally a bit slow to respond,
> > and some of those use short expiry intervals, so the MX and A data isn't
> > always cached locally.
...
> > And therefore statically linked
> > Postfix binaries would need to be relinked with a new libresolv.a
> > library whose RES_TIMEOUT has been increased?
>
> Neither FreeBSD 4.7 nor 7.3 document RES_TIMEOUT.
Furthermore, I think it's a bad idea for many sending sites to be making
adjustments to accomodate a receiving site that is configured in an
unreliable manner. Any mail-receiving site that sets its DNS ttl values
to less than a few hours is essentially declaring that it it doesn't
want to receive mail from everybody. I don't think we should encourage
this attitude. I set my ttl values low sometimes -- when I am in the
middle of making many changes --- and then only as low as 1 hour. Then
they go back up to 12 or 24 hours.
Very low ttl values should only be set for load-balanced intractive
destinations such as web sites where there is no queuing mechanism and
uesrs don't mind an occasional timeout and a manual reload. It makes
absolutely no sense to set ttl values for MX records to any lower than a
few hours, and perhaps one hour temporarily.
Rahul
Thanks for the info.
> Neither FreeBSD 4.7 nor 7.3 document RES_TIMEOUT.
I ran across this in the BIND 8 resolver.5 manual:
timeout:n
sets the amount of time the resolver will wait for a
response from a remote name server before retrying the
query via a different name server. Measured in sec-
onds, the default is RES_TIMEOUT (see <resolv.h>).
--
Greg
Obviously, the systems that I refer to do not ship with BIND 8
client libraries.
Wietse