Don't make your DNS TTLs too short

DNS is a complex, distributed database built on top of an unreliable network that spans the world.

Recursive Queries

The core of it all are the thirteen DNS root servers, a.root-servers.net through m.root-servers.net. That’s where every DNS lookup starts, and they will reply to almost every request with “I don’t know, ask that DNS server over there instead”.

So you ask the server it points you to, and so on. At each step you may not know the IP address of the server you’re trying to query, so you’ll have to start again with the root servers to find the IP address of that hostname, and so on.

Even an apparently trivial DNS query, such as “What’s the SPF record for turscar.ie” can take five back-to-back queries to authoritative servers. If it’s slightly more complex, say mail being sent through an ESP then your return path may be a CNAME to a TXT record managed by your ESP. That’ll add another three or four queries. Each of these queries depends on the response for the one before it, so you have to do them one at a time.

Some of those DNS queries will be to complex geographically distributed DNS clusters that’ll route queries to the geographically closest server and you might get a response back in 40 milliseconds. Others will be to a single DNS server on the far side of the world and might take 200 milliseconds.

That time adds up, and even in the best case the time taken to query DNS for authentication records can dominate the time taken to receive the email.

And DNS runs over the Internet, which is not always a 100% reliable network. We don’t tend to notice that because everything we do has error correction and automatic retries built in to it. DNS is a pretty old protocol, and doesn’t usually do anything too fancy. The simplest strategy it has is just to wait a couple of seconds for a reply and if it doesn’t get one it’ll send the request again.

So if a packet gets lost, or the network flaps, or the authoritative DNS server stumbles, or any number of other rare failures that could easily stretch a DNS request out to two or three seconds or longer.

Caching

All that DNS work would make our use of the Internet incredibly slow, so we don’t do it. Or, rather, we don’t do it all the time. We send all our DNS request to a local “recursive resolver” which will do all that DNS resolution stuff for us. And it also caches the answers it’s received before.

Ask the same question again - “What’s the text record for turscar.ie?” - and it will give you the answer it already has. (It’ll also use those cached response for intermediate steps in the DNS work, which helps a lot).

But if it remembered answers forever you’d never be able to change your DNS, as there might still be some dusty recursive resolver out there that remembered the old value and would return it to queries. So every bit of information in the DNS is tagged with a “time to live”, it’s “TTL”. That’s the number of seconds that answer should be cached for.

Once a recursive resolver has an answer it starts ticking the TTL down. Once it reaches zero then the cached answer will be deleted and any new queries for it will have to do all the work to get the answer from the global DNS again.

TTL Tradeoffs

If you have a really short TTL, say thirty seconds, then you know that anyone using that record will have to reach all the way to your authoritative server for an answer after no more than thirty seconds.

That means that if you change your DNS you know it’ll be visible to the whole planet after no more than about thirty seconds.

But it also means that if your DNS is heavily used every recursive resolver on the planet will have to hit your authoritative server for the answer every thirty seconds. And every time each one of them does that there’s a risk of something going wrong, causing the DNS query to be delayed or to fail.

If there’s a one in a hundred chance of a query being particularly slow then a 30 second TTL means that each recursive resolver may see a slow query once or twice an hour. A 3600 second TTL would mean once or twice a week. An 86400 second TTL would mean three or four times a year.

I care about slow queries why?

Large scale inbound mailservers deal with a lot of mail at a time. A lot.

They handle a lot of mail in parallel, but there’s a limit on how many concurrent deliveries they can handle. If a single email takes a long time to handle - if, say, you’re waiting for DNS results - then that will increase the load on that mailserver. If it happens for many mails it’ll increase it a lot.

It’s a decent engineering decision to just decide that you’ll give up on any DNS request that takes too long to answer (at aboutmy.email, which tries to be more tolerant than production servers to that sort of thing we have a hard limit of five seconds). Or just to give each email a fixed limit for how long it has to do all it’s authentication.

If email authentication is just being used to give better delivery to mail with a good reputation then if the occasional email doesn’t get the chance to show off it’s reputation it’s not really going to matter.

If you switch to requiring either SPF or DKIM authentication or you’ll reject the email then it’s still not a bad engineering decision. You might sometimes have an email where SPF times out, or where DKIM times out, but having both time out will be fairly rare. (The failures of SPF and DKIM won’t be entirely independent, but common failure modes are likely to affect one or the other rather than both a lot of the time).

And why is this post tagged “Microsoft”?

I’ve seen slightly sketchy authentication results at Microsoft in the past - it’s sometimes looked like transient DNS failures. My understanding is that they’ve made the “decent engineering decision” to limit the time to check authentication to a few seconds.

And Microsoft have tightened up their policy to require, in some cases, both SPF and DKIM authentication to pass. And folks have started to see a moderate level of apparently random rejections.

550 5.7.515 Access denied, sending domain DOMAIN.TLD doesn’t meet the required authentication level. The sender’s domain in the 5322.From address doesn’t meet the authentication requirements defined for the sender.

Overly short TTLs on your authentication records (TXT records for SPF and DKIM, CNAMEs pointing at your ESP) will make DNS timeouts more common, and so potentially be part of the reason for these rejections.

There are other reasons too, both specific to Microsoft’s handling of inbound mail and general misconfiguration or slight violations of the SMTP RFCs.

But DNS timeout related authentication failures are going to be much more random and unreproducible than many of the other problems. They’ll make diagnosing the broader issue harder, so the more you can avoid DNS issues the clearer other causes of delivery problems might become.

What’s a good TTL?

TTLs only really apply when you’re modifying or deleting records, not when you’re adding new ones. So it’s when you’re changing things or migrating or doing maintenance that you need to care.

Some DNS records benefit from the agility given by short TTLs. DNS for email, especially DNS for email authentication, aren’t that.

It’s pretty rare you’re going to need to modify your email authentication at a moments notice.

If you’re going to start sending from new IP addresses you can add those to your SPF record as part of your migration process. If you’re switching to a new ESP you’re probably going to have a period of overlap when you’re using both, so you can have CNAMEs and SPF includes for the new ESP in place well in advance.

Unless you’ve got infrastructure that can benefit from it, and an engineering team that can support it, 86400 seconds - one day - is a perfectly good default TTL.

You’re not Google or AOL. You’re not going to be hacker typing to move network infrastructure around in an emergency. And you don’t have the engineering team and monitoring infrastructure to ensure that your anycast DNS cluster is as close to perfect as is humanly possible.

So, 86400 seconds.

If you get pushback and can’t convince your network administrator that long TTLs are good for stable infrastructure at least talk them out of doing anything shorter than 3600 seconds, one hour.

Related Posts

Microsoft and SPF

Many deliverability folks stopped recommending publishing SPF records for the 5322.from address to get delivery to Microsoft. I even remember Microsoft saying they were stopping doing SenderID style checking. A discussion on the emailgeeks slack channel has me rethinking that.

Read More

SenderID is dead

A question came up on the email geeks slack channel (Join Here) about SenderID. They recently had a customer ask for SenderID authentication.

Read More

Authentication at Office365

This is a followup from a post a few weeks ago about authentication changes at Office365. We have some more clarity on what is going on there. This is all best information we have right now.

Read More