Message-ID Syntax
Our friends in the Email Geeks slack have recently started seeing sporadic rejections due to invalid Message-ID headers.
550-5.7.1 [10.11.12.13] Messages missing a valid Message-ID header are not
550-5.7.1 accepted. For more information, go to
550-5.7.1 https://support.google.com/mail/?p=RfcMessageNonCompliant and review
550 5.7.1 RFC 5322 specifications. a640c23a62f3a-af635861f6esi307480166b.114
Nothing has changed about how they’ve been generating Message-IDs, and some mail is being accepted, but some isn’t.
Skipping over the diagnosis steps the problem was exactly what Gmail’s rejection message said it was - an invalid Message-ID header.
It looks like Google are gradually ratcheting up their enforcement of RFC compliance, enforcing it on a small fraction of email to encourage senders to comply with those requirements. I’d guess that they’ll continue to do that until they’re rejecting all mail that violates that particular RFC requirement.
What was invalid about the Message-ID was it contained colons. They were being used to separate different bits of encoded metadata:
Message-ID: <customer:some-id:some-other-id@something.isp-domain.example>
That’s a nice format, but unfortunately violates RFC 5322.
How do we know it violates RFC 5322? Well, that’s a long trek through how an RFC works.
If you don’t want to go through all that then the short version is that a Message-ID header
- starts with a
<
- followed by a bunch of characters (the left hand side, or LHS) which
- can be any of
A-Z
,a-z
,0-9
, or any of!
,#
,$
,%
,&
,'
,*
,+
,-
,/
,=
,?
,^
,_
,`
,{
,|
,}
,~
or.
- but cannot start or end with a
.
- and cannot have more than one
.
in a row
- can be any of
- followed by an
@
- followed by a bunch more characters (the right hand side, or RHS) which
- have the same technical requirements as the LHS …
- … but should almost always be a hostname in a domain you control
- ends with a
>
If you follow those requirements your Message-ID will be compliant with RFC 5322, and also follow good practices for Message-ID generation.
The “hostname” that makes up the RHS doesn’t need to align with any other hostname in the email. It doesn’t need to accept email, it doesn’t even need to exist in DNS at all. But it should be syntactically a hostname - words separated by dots - and it should end in your domain, or any domain you control.
If you don’t want to know about the eldritch horror that is Augmented Backus-Naur Form you can safely stop reading now.
RFC Syntax
In the RFCs that specify how the internet is plumbed together any text-based protocol tends to be specified using “Augmented Backus-Naur Form”, ABNF, which itself is defined in RFC 5234. If you’re relying on RFCs to understand the details of a protocol you’ll have to pick up some ABNF sooner or later.
ABNF for Message-ID headers
If we look at section 3.6.4 of RFC 5322 we discover some snippets of ABNF that define the message-ids and the headers that use them. Each of these is a “rule”, consisting of a “rulename”, an equals sign and the elements that make up the rule.
message-id = "Message-ID:" msg-id CRLF
in-reply-to = "In-Reply-To:" 1*msg-id CRLF
references = "References:" 1*msg-id CRLF
msg-id = [CFWS] "<" id-left "@" id-right ">" [CFWS]
id-left = dot-atom-text / obs-id-left
id-right = dot-atom-text / no-fold-literal / obs-id-right
no-fold-literal = "[" *dtext "]"
Looking at the first of these seven rules:
message-id = "Message-ID:" msg-id CRLF
This says that the rule message-id
consists of the literal string “Message-ID:”,
followed by the element msg-id
followed by the element CRLF
. The literal string
is the header name followed by a colon, exactly what we see in email headers. The CRLF
element is defined elsewhere, but it’s just the end of line, consisting of a carriage-return,
line-feed pair.
In turn, the msg-id
element is defined by a later rule:
msg-id = [CFWS] "<" id-left "@" id-right ">" [CFWS]
In ABNF anything in square brackets is optional. So [CFWS]
means that you can have
a CFWS
element there, but it’s not required. CFWS
is again defined elsewhere, but
it’s just whitespace (spaces or tabs), specifically the sort of whitespace where mail software is allowed
to break the header into multiple lines or to add (comments in parentheses)
.
So the msg-id
consists of optional whitespace, a literal “<”, an id-left
, a literal “@”, an id-right
, a
literal “>” and some more optional whitespace.
So, what’s an id-left
? That’s defined in the next rule down:
id-left = dot-atom-text / obs-id-left
The /
means that you must have one of the separated elements, but only one. So this means that an id-left
is either a dot-atom-text
element or an obs-id-left
element. All the elements that begin with “obs-” are
obsolete - so while mail software receiving mail should understand and accept them, no mail software should
generate messages using them. So we’re going to say that our id-left
is just a dot-atom-text
.
Atoms and atext
“Atoms” are a special sort of word, used in a lot of internet protocols. Historically they started in email, so they’re defined in RFC 5322 too, in section 3.2.3.
Several productions in structured header field bodies are simply strings of certain basic characters. Such productions are called atoms.
Some of the structured header field bodies also allow the period character (".", ASCII value 46) within runs of atext. An additional “dot-atom” token is defined for those purposes.
There’s some more ABNF to define atoms and atom-text:
atext = ALPHA / DIGIT / ; Printable US-ASCII
"!" / "#" / ; characters not including
"$" / "%" / ; specials. Used for atoms.
"&" / "'" /
"*" / "+" /
"-" / "/" /
"=" / "?" /
"^" / "_" /
"`" / "{" /
"|" / "}" /
"~"
atom = [CFWS] 1*atext [CFWS]
dot-atom-text = 1*atext *("." 1*atext)
dot-atom = [CFWS] dot-atom-text [CFWS]
This is easier to explain starting with the smallest element and working outwards, so I’m going to start with
the atext
rule.
Just like we saw in the id-left
rule this consists of a list of elements, separated with a /
. We can
choose one and only one of those to make an atext
element. Most of our choices are literal strings, each
one character long, such as "!"
or "?"
, so an atext
can be any of those. It can also be a ALPHA
or
DIGIT
element. Those are basic elements, used by a lot of protocols, so they - along with CRLF
- are
defined in the ABNF RFC, RFC 5234. ALPHA
can be
any single character string, “A” to “Z” or “a” to “z”, while DIGIT
is a single character string “0” to “9”.
So all this says that an atext
element is a single character long string that’s either alphanumeric or
one of the ascii characters specified explicitly in the atext
rule.
We can then use that definition to create an dot-atom-text
element:
dot-atom-text = 1*atext *("." 1*atext)
We need to know a bit more ABNF to decipher this. “1*
” means “one or more of the following token”.
“*
” means zero or more of the following token. And a series of tokens surrounded by parentheses
are treated as a single token.
1*atext
means one or more atext
tokens concatenated together - so it’s a string that’s one or more
characters long, consisting solely of the ascii characters allowed for atext
. “wtf?” or “Alice” or “hunter2”
would all match 1*atext
.
So ("." 1*atext)
means a literal “.
” followed by a string of atext characters.
And 1*atext *("." 1*atext)
means a string of atext characters followed by zero or more literal “.”, atext string
pairs. Or, in english, it’s one or more strings separated by periods. “Alice.Bob.Carol” or “gmail.com” or
“10.11.hatstand?.13” are all dot-atom-text
s.
That a dot-atom-text
is defined as one or more words separated by periods is why it can’t begin or end with
a period or have more than one period in a row.
Back to Message-ID
So, now we know what a dot-atom-text
is we can use it to build a syntactically correct message-id.
message-id = "Message-ID:" msg-id CRLF
msg-id = [CFWS] "<" id-left "@" id-right ">" [CFWS]
id-left = dot-atom-text / obs-id-left
id-right = dot-atom-text / no-fold-literal / obs-id-right
We’re going to ignore the obs-id-left
and obs-id-right
because they’re obsolete. And we’re going to
ignore no-fold-literal
because it’s not very interesting, nor commonly used.
So to be syntactically correct the contents of a Message-ID header must be
- Some optional whitespace. We almost always put a single space here, because it looks odd otherwise.
- A literal “<” character
- A
dot-atom-text
- A literal “@” character
- A
dot-atom-text
- A literal “>”
- Some more even more optional whitespace
- A carriage return, linefeed to end the line
In the short version at the top of this post we said that the RHS, after the “@”, should be a hostname. Where does that come from?
Uniqueness
Every message-id should be globally unique. Every email created should have a message-id that has never been used before, so it can be used by mail systems to identify the mail. (This isn’t theoretical - there are mail systems that use Message-ID to eliminate duplicates, for instance if someone is sent an email both directly and via a mailing list one of the duplicates may be removed).
Even though message-ids are being generated by billions of pieces of software that don’t coordinate with each other it’s not that difficult to ensure they’re unique. There are countless ways to do this, but the algorithm recommended by RFC 5322 is to generate the two parts of the message-id in different ways such that when they’re combined they’re guaranteed to be unique.
For the RHS you pick a hostname in a domain you control. It’s often the hostname of the mailserver generating the Message-ID header, but it doesn’t have to be - it’s just a hostname you control. Because you control it, you can be sure that nobody else using this uniqueness algorithm will pick the same RHS.
Then you have your mailserver generate the LHS such that it never generates the same LHS twice. How it does that doesn’t matter, but it’s often done with timestamps and process identifiers. If you’re a bulk mailer you can use unique IDs you’re already generating for other purposes (e.g. VERP) instead. Doesn’t really matter.
Combining the unique-to-you RHS and the unique-on-your-system LHS gives a globally unique message-id.
As long as your message-id generation guarantees uniqueness you don’t have to use this algorithm - but the vast majority of senders do it this way, and for bulk email we tend to want to do things the “normal” way.
That’s just a short trip into ABNF specifications and protocol RFCs, but it’s everything you need to know how “RFC compliance” for a Message-ID header is defined, and where to start when looking at other RFC compliance requirements.