The trouble with CNAMEs

When you query DNS for something you ask your local DNS recursive resolver for all answers it has about a hostname of a certain type. If you’re going to a website your browser asks your resolver for all records for “google.com” of type “A”1or “AAAA”, but that’s not important right now and it will either return all the A records for google.com it has cached, or it will do the complex process of looking up the results from the authoritative servers, cache them for as long as the TTL field for the reply says it should, then return them to you.

There are dozens of different types of records, AAAA for IPv6 IP addresses, MX for mailservers, TXT for arbitrary text, mostly used for various sorts of authentication (including SPF, DKIM and DMARC). And then there’s CNAME.

CNAME stands for “Canonical Name” and means “Go and ask this different question instead”. If you have a DNS record that looks like “www.example.com CNAME example.net” then any time you ask your DNS resolver for records of any type for www.example.com it will see that there’s a CNAME record and do a query of the same type for example.net instead. So queries for “www.example.com A” will return whatever the answer for “example.net A” is, queries for “www.example.com MX” will return the same thing as “example.net MX”.

For a long time the main use you saw for CNAMEs was making “www.” hostnames work for webhosting, with “www.example.com CNAME example.com” records so that the www version of your website resolved to the same IP address as the non-www version.

One important thing about CNAMEs is that you should never have both CNAME records and any other sort of record for the same hostname. It breaks things, and now that we rely on DNS for more and more complex configuration and authentication it can break things in complex, inconsistent and hard to diagnose ways.

The concrete example of this today was diagnosing why SPF was failing, despite DNS apparently being set up correctly.

Two return paths – email1.example.com and email2.example.com. Both of them for use at same ESP, one that uses CNAMEs to make user onboarding easy.

email1.example.com 3600 CNAME esp.com
email2.example.com 3600 CNAME esp.com
esp.com             300 TXT "v=spf1 exists:%{i}._spf.esp.com"

Identical DNS configured for both hostnames. Doing a dig from the command line gave the correct SPF record for both hostnames. And yet email2 randomly failed SPF, while email1 always passed SPF, while they were both being sent from the same IP address. That … shouldn’t happen.

My first thought was that there was some misconfiguration at esp.com such that it wasn’t handling email2 properly. But the only macro in that SPF record is “%{i}”, the IP address. So the ESP doesn’t know anything other than the sending IP address when answering that query, so it can’t give different answers for different hostnames2%{h} is the SPF macro for that, if you do need that.

After poking at the eight authoritative nameservers for the example.com zone, and being sidetracked by some other misconfigurations in their DNS, I found the answer. And, despite causing such weird symptoms, it was surprisingly simple.

Someone had added a google-site-verification TXT record for email2.example.com. That breaks the rule that you should never have a CNAME and any other DNS record for the same hostname. The failure works like this:

If I ask my DNS resolver for the SPF TXT record for email2 – “email2.example.com TXT” – and it doesn’t have it cached, then it will go ask one of the authoritative servers – ns04.example.com, say – for “email2.example.com TXT”. ns04 is being asked for a TXT record, and it has a matching TXT record, so it ignores the CNAME and returns the Google site validation record:

email2.example.com 300 TXT "google-site-verification=ZbTqQmfwO0C4..."

There’s no SPF TXT record in that response, so SPF fails. The resolver will hang on to that record for the next 300 seconds, and SPF will fail all that time.

But what if I query for something else, the MX record for email2.example.com – “email2.example.com MX”? Again, my resolver will go ask ns04 for the answer and it’ll get back something like this:

email2.example.com 3600 CNAME esp.com
esp.com            300  MX    mail.esp.com

The resolver will then cache that result, keeping the CNAME around for the next hour, so if I now ask for a TXT record again “email2.example.com TXT” my resolver will find the CNAME record in it’s cache and go “Alright, there’s a CNAME response so I should follow it to get the answer!”

email2.example.com 3600 CNAME esp.com
esp.com            300 TXT "v=spf1 exists:%{i}._spf.esp.com"

So now the answer I get has a validly formatted SPF TXT record in the response and so SPF passes for the message.

This means that depending on the history of queries the recursive resolver at a mailbox provider has seen recently it may have the (incorrect) TXT record cached, and return that, or it may have the (correct) CNAME record cached, and return that along with the (correct) set of TXT records. From the outside it looks like you get one or the other set of answers kind of at random3and just to make it more fun, different DNS resolvers may handle this in different ways.

So the morals of this story are:

  • Avoid CNAMEs when you can
  • Never have CNAMEs on the same hostname as any other sort of DNS record4which does mean you can never put them at the root of a zone, as they’ll always clash there
  • If you have weird flaky maybe DNS related failures and a CNAME is involved, check for a clashing record

You can check for clashes like this, assuming you’re expecting to ask foo.example.com for a TXT record:

$ dig +short example.com ns
ns01.example.com
ns02.example.com

$ dig +short foo.example.com txt @ns01.example.com
foo.example.com 3600 CNAME esp.com

This is the response you hope to get – just a CNAME response, meaning there’s no conflicting TXT record. If instead you don’t get a CNAME response but do get a TXT record then that TXT record conflicts.

Related Posts

SPF and TXT records and Go

A few days ago Laura noticed a bug in one of our in-house tools – it was sometimes marking an email as SPF Neutral when it should have been a valid SPF pass. I got around to debugging it today and traced it back to a bug in the Go standard library.

Read More

Are they using DKIM?

It’s easy to tell if a domain is using SPF – look up the TXT record for the domain and see if any of them begin with “v=spf1”. If one does, they’re using SPF. If none do, they’re not. (If more than one does? They’re publishing invalid SPF.)
AOL are publishing SPF. Geocities aren’t.
For DKIM it’s harder, as a DKIM key isn’t published at a well-known place in DNS. Instead, each signed email includes a “selector” and you look up a record by combining that selector with the fixed string “._domainkey.” and the domain.
If you have DKIM-signed mail from them then you can find the selector (s=) in the DKIM-Signature header and look up the key. For example, Amazon are using a selector of “taugkdi5ljtmsua4uibbmo5mda3r2q3v”, so I can look up TXT records for “taugkdi5ljtmsua4uibbmo5mda3r2q3v._domainkey.amazon.com“, see that there’s a TXT record returned and know there’s a DKIM key.
That’s a particularly obscure selector, probably one they’re using to track DKIM lookups to the user the mail was sent to, but even if a company is using a selector like “jun2016” you’re unlikely to be able to guess it.
But there’s a detail in the DNS spec that says that if a hostname exists, meaning it’s in DNS, then all the hostnames “above” it in the DNS tree also exist (even if there are no DNS records for them). So if anything,_domainkey.example.com exists in DNS, so does _domainkey.example.com. And, conversely, if _domainkey.example.com doesn’t exist, no subdomain of it exists either.
What does it mean for a hostname to exist in DNS? That’s defined by the two most common responses you get to a DNS query.
One is “NOERROR” – it means that the hostname you asked about exists, even if there are no resource records returned for the particular record type you asked about.
The other is “NXDOMAIN” – it means that the hostname you asked about doesn’t exist, for any record type.
So if you look up _domainkey.aol.com you’ll see a “NOERROR” response, and know that AOL have published DKIM public keys and so are probably using DKIM.
(This is where Steve tries to find a domain that isn’t publishing DKIM keys … Ah! Al’s blog!)
If you look up _domainkey.spamresource.com you’ll see an “NXDOMAIN” response, so you know Al isn’t publishing any DKIM public keys, so isn’t sending any DKIM signed mail using that domain.
This isn’t 100% reliable, unfortunately. Some nameservers will (wrongly) return an NXDOMAIN even if there are subdomains, so you might sometimes get an NXDOMAIN even for a domain that is publishing DKIM. shrug
Sometimes you’ll see an actual TXT record in response – e.g. Yahoo or EBay – that’s detritus left over from the days of DomainKeys, a DomainKeys policy record, and it means nothing today.

Read More

SPF Fail: too many DNS lookups

I’ve had a couple folks come to me recently for help troubleshooting SPF failures. The error messages said the SPF record was invalid, but by all checks it was valid.
Eventually, we tracked the issue down to how many include files were in the SPF record.
The SPF specification specifically limits the number of lookups that can happen during a SPF check.

Read More