The trouble with CNAMEs
When you query DNS for something you ask your local DNS recursive resolver for all answers it has about a hostname of a certain type. If you’re going to a website your browser asks your resolver for all records for “google.com” of type “A”1or “AAAA”, but that’s not important right now and it will either return all the A records for google.com it has cached, or it will do the complex process of looking up the results from the authoritative servers, cache them for as long as the TTL field for the reply says it should, then return them to you.
There are dozens of different types of records, AAAA for IPv6 IP addresses, MX for mailservers, TXT for arbitrary text, mostly used for various sorts of authentication (including SPF, DKIM and DMARC). And then there’s CNAME.
CNAME stands for “Canonical Name” and means “Go and ask this different question instead”. If you have a DNS record that looks like “www.example.com CNAME example.net” then any time you ask your DNS resolver for records of any type for www.example.com it will see that there’s a CNAME record and do a query of the same type for example.net instead. So queries for “www.example.com A” will return whatever the answer for “example.net A” is, queries for “www.example.com MX” will return the same thing as “example.net MX”.
For a long time the main use you saw for CNAMEs was making “www.” hostnames work for webhosting, with “www.example.com CNAME example.com” records so that the www version of your website resolved to the same IP address as the non-www version.
One important thing about CNAMEs is that you should never have both CNAME records and any other sort of record for the same hostname. It breaks things, and now that we rely on DNS for more and more complex configuration and authentication it can break things in complex, inconsistent and hard to diagnose ways.
The concrete example of this today was diagnosing why SPF was failing, despite DNS apparently being set up correctly.
Two return paths – email1.example.com and email2.example.com. Both of them for use at same ESP, one that uses CNAMEs to make user onboarding easy.
email1.example.com 3600 CNAME esp.com email2.example.com 3600 CNAME esp.com esp.com 300 TXT "v=spf1 exists:%{i}._spf.esp.com"
Identical DNS configured for both hostnames. Doing a dig from the command line gave the correct SPF record for both hostnames. And yet email2 randomly failed SPF, while email1 always passed SPF, while they were both being sent from the same IP address. That … shouldn’t happen.
My first thought was that there was some misconfiguration at esp.com such that it wasn’t handling email2 properly. But the only macro in that SPF record is “%{i}”, the IP address. So the ESP doesn’t know anything other than the sending IP address when answering that query, so it can’t give different answers for different hostnames2%{h} is the SPF macro for that, if you do need that.
After poking at the eight authoritative nameservers for the example.com zone, and being sidetracked by some other misconfigurations in their DNS, I found the answer. And, despite causing such weird symptoms, it was surprisingly simple.
Someone had added a google-site-verification TXT record for email2.example.com. That breaks the rule that you should never have a CNAME and any other DNS record for the same hostname. The failure works like this:
If I ask my DNS resolver for the SPF TXT record for email2 – “email2.example.com TXT” – and it doesn’t have it cached, then it will go ask one of the authoritative servers – ns04.example.com, say – for “email2.example.com TXT”. ns04 is being asked for a TXT record, and it has a matching TXT record, so it ignores the CNAME and returns the Google site validation record:
email2.example.com 300 TXT "google-site-verification=ZbTqQmfwO0C4..."
There’s no SPF TXT record in that response, so SPF fails. The resolver will hang on to that record for the next 300 seconds, and SPF will fail all that time.
But what if I query for something else, the MX record for email2.example.com – “email2.example.com MX”? Again, my resolver will go ask ns04 for the answer and it’ll get back something like this:
email2.example.com 3600 CNAME esp.com esp.com 300 MX mail.esp.com
The resolver will then cache that result, keeping the CNAME around for the next hour, so if I now ask for a TXT record again “email2.example.com TXT” my resolver will find the CNAME record in it’s cache and go “Alright, there’s a CNAME response so I should follow it to get the answer!”
email2.example.com 3600 CNAME esp.com esp.com 300 TXT "v=spf1 exists:%{i}._spf.esp.com"
So now the answer I get has a validly formatted SPF TXT record in the response and so SPF passes for the message.
This means that depending on the history of queries the recursive resolver at a mailbox provider has seen recently it may have the (incorrect) TXT record cached, and return that, or it may have the (correct) CNAME record cached, and return that along with the (correct) set of TXT records. From the outside it looks like you get one or the other set of answers kind of at random3and just to make it more fun, different DNS resolvers may handle this in different ways.
So the morals of this story are:
- Avoid CNAMEs when you can
- Never have CNAMEs on the same hostname as any other sort of DNS record4which does mean you can never put them at the root of a zone, as they’ll always clash there
- If you have weird flaky maybe DNS related failures and a CNAME is involved, check for a clashing record
You can check for clashes like this, assuming you’re expecting to ask foo.example.com for a TXT record:
$ dig +short example.com ns ns01.example.com ns02.example.com $ dig +short foo.example.com txt @ns01.example.com foo.example.com 3600 CNAME esp.com
This is the response you hope to get – just a CNAME response, meaning there’s no conflicting TXT record. If instead you don’t get a CNAME response but do get a TXT record then that TXT record conflicts.