Message-ID Syntax

Our friends in the Email Geeks slack have recently started seeing sporadic rejections due to invalid Message-ID headers.

550-5.7.1 [10.11.12.13] Messages missing a valid Message-ID header are not 
550-5.7.1 accepted. For more information, go to 
550-5.7.1 https://support.google.com/mail/?p=RfcMessageNonCompliant and review 
550 5.7.1 RFC 5322 specifications. a640c23a62f3a-af635861f6esi307480166b.114

Nothing has changed about how they’ve been generating Message-IDs, and some mail is being accepted, but some isn’t.

Skipping over the diagnosis steps the problem was exactly what Gmail’s rejection message said it was - an invalid Message-ID header.

It looks like Google are gradually ratcheting up their enforcement of RFC compliance, enforcing it on a small fraction of email to encourage senders to comply with those requirements. I’d guess that they’ll continue to do that until they’re rejecting all mail that violates that particular RFC requirement.

What was invalid about the Message-ID was it contained colons. They were being used to separate different bits of encoded metadata:

Message-ID: <customer:some-id:some-other-id@something.isp-domain.example>

That’s a nice format, but unfortunately violates RFC 5322.

How do we know it violates RFC 5322? Well, that’s a long trek through how an RFC works.

If you don’t want to go through all that then the short version is that a Message-ID header

  • starts with a <
  • followed by a bunch of characters (the left hand side, or LHS) which
    • can be any of A-Z, a-z, 0-9, or any of !, #, $, %, &, ', *, +, -, /, =, ?, ^, _, `, {, |, }, ~ or .
    • but cannot start or end with a .
    • and cannot have more than one . in a row
  • followed by an @
  • followed by a bunch more characters (the right hand side, or RHS) which
    • have the same technical requirements as the LHS …
    • … but should almost always be a hostname in a domain you control
  • ends with a >

If you follow those requirements your Message-ID will be compliant with RFC 5322, and also follow good practices for Message-ID generation.

The “hostname” that makes up the RHS doesn’t need to align with any other hostname in the email. It doesn’t need to accept email, it doesn’t even need to exist in DNS at all. But it should be syntactically a hostname - words separated by dots - and it should end in your domain, or any domain you control.

If you don’t want to know about the eldritch horror that is Augmented Backus-Naur Form you can safely stop reading now.

Cthulu

RFC Syntax

In the RFCs that specify how the internet is plumbed together any text-based protocol tends to be specified using “Augmented Backus-Naur Form”, ABNF, which itself is defined in RFC 5234. If you’re relying on RFCs to understand the details of a protocol you’ll have to pick up some ABNF sooner or later.

ABNF for Message-ID headers

If we look at section 3.6.4 of RFC 5322 we discover some snippets of ABNF that define the message-ids and the headers that use them. Each of these is a “rule”, consisting of a “rulename”, an equals sign and the elements that make up the rule.

message-id      =   "Message-ID:" msg-id CRLF

in-reply-to     =   "In-Reply-To:" 1*msg-id CRLF

references      =   "References:" 1*msg-id CRLF

msg-id          =   [CFWS] "<" id-left "@" id-right ">" [CFWS]

id-left         =   dot-atom-text / obs-id-left

id-right        =   dot-atom-text / no-fold-literal / obs-id-right

no-fold-literal =   "[" *dtext "]"

Looking at the first of these seven rules:

message-id      =   "Message-ID:" msg-id CRLF

This says that the rule message-id consists of the literal string “Message-ID:”, followed by the element msg-id followed by the element CRLF. The literal string is the header name followed by a colon, exactly what we see in email headers. The CRLF element is defined elsewhere, but it’s just the end of line, consisting of a carriage-return, line-feed pair.

In turn, the msg-id element is defined by a later rule:

msg-id          =   [CFWS] "<" id-left "@" id-right ">" [CFWS]

In ABNF anything in square brackets is optional. So [CFWS] means that you can have a CFWS element there, but it’s not required. CFWS is again defined elsewhere, but it’s just whitespace (spaces or tabs), specifically the sort of whitespace where mail software is allowed to break the header into multiple lines or to add (comments in parentheses).

So the msg-id consists of optional whitespace, a literal “<”, an id-left, a literal “@”, an id-right, a literal “>” and some more optional whitespace.

So, what’s an id-left? That’s defined in the next rule down:

id-left         =   dot-atom-text / obs-id-left

The / means that you must have one of the separated elements, but only one. So this means that an id-left is either a dot-atom-text element or an obs-id-left element. All the elements that begin with “obs-” are obsolete - so while mail software receiving mail should understand and accept them, no mail software should generate messages using them. So we’re going to say that our id-left is just a dot-atom-text.

Atoms and atext

“Atoms” are a special sort of word, used in a lot of internet protocols. Historically they started in email, so they’re defined in RFC 5322 too, in section 3.2.3.

Several productions in structured header field bodies are simply strings of certain basic characters. Such productions are called atoms.

Some of the structured header field bodies also allow the period character (".", ASCII value 46) within runs of atext. An additional “dot-atom” token is defined for those purposes.

There’s some more ABNF to define atoms and atom-text:

atext           =   ALPHA / DIGIT /    ; Printable US-ASCII
                       "!" / "#" /        ;  characters not including
                       "$" / "%" /        ;  specials.  Used for atoms.
                       "&" / "'" /
                       "*" / "+" /
                       "-" / "/" /
                       "=" / "?" /
                       "^" / "_" /
                       "`" / "{" /
                       "|" / "}" /
                       "~"

   atom            =   [CFWS] 1*atext [CFWS]

   dot-atom-text   =   1*atext *("." 1*atext)

   dot-atom        =   [CFWS] dot-atom-text [CFWS]

This is easier to explain starting with the smallest element and working outwards, so I’m going to start with the atext rule.

Just like we saw in the id-left rule this consists of a list of elements, separated with a /. We can choose one and only one of those to make an atext element. Most of our choices are literal strings, each one character long, such as "!" or "?", so an atext can be any of those. It can also be a ALPHA or DIGIT element. Those are basic elements, used by a lot of protocols, so they - along with CRLF - are defined in the ABNF RFC, RFC 5234. ALPHA can be any single character string, “A” to “Z” or “a” to “z”, while DIGIT is a single character string “0” to “9”.

So all this says that an atext element is a single character long string that’s either alphanumeric or one of the ascii characters specified explicitly in the atext rule.

We can then use that definition to create an dot-atom-text element:

dot-atom-text   =   1*atext *("." 1*atext)

We need to know a bit more ABNF to decipher this. “1*” means “one or more of the following token”. “*” means zero or more of the following token. And a series of tokens surrounded by parentheses are treated as a single token.

1*atext means one or more atext tokens concatenated together - so it’s a string that’s one or more characters long, consisting solely of the ascii characters allowed for atext. “wtf?” or “Alice” or “hunter2” would all match 1*atext.

So ("." 1*atext) means a literal “.” followed by a string of atext characters.

And 1*atext *("." 1*atext) means a string of atext characters followed by zero or more literal “.”, atext string pairs. Or, in english, it’s one or more strings separated by periods. “Alice.Bob.Carol” or “gmail.com” or “10.11.hatstand?.13” are all dot-atom-texts.

That a dot-atom-text is defined as one or more words separated by periods is why it can’t begin or end with a period or have more than one period in a row.

Back to Message-ID

So, now we know what a dot-atom-text is we can use it to build a syntactically correct message-id.

message-id      =   "Message-ID:" msg-id CRLF

msg-id          =   [CFWS] "<" id-left "@" id-right ">" [CFWS]

id-left         =   dot-atom-text / obs-id-left

id-right        =   dot-atom-text / no-fold-literal / obs-id-right

We’re going to ignore the obs-id-left and obs-id-right because they’re obsolete. And we’re going to ignore no-fold-literal because it’s not very interesting, nor commonly used.

So to be syntactically correct the contents of a Message-ID header must be

  • Some optional whitespace. We almost always put a single space here, because it looks odd otherwise.
  • A literal “<” character
  • A dot-atom-text
  • A literal “@” character
  • A dot-atom-text
  • A literal “>”
  • Some more even more optional whitespace
  • A carriage return, linefeed to end the line

In the short version at the top of this post we said that the RHS, after the “@”, should be a hostname. Where does that come from?

Uniqueness

Every message-id should be globally unique. Every email created should have a message-id that has never been used before, so it can be used by mail systems to identify the mail. (This isn’t theoretical - there are mail systems that use Message-ID to eliminate duplicates, for instance if someone is sent an email both directly and via a mailing list one of the duplicates may be removed).

Even though message-ids are being generated by billions of pieces of software that don’t coordinate with each other it’s not that difficult to ensure they’re unique. There are countless ways to do this, but the algorithm recommended by RFC 5322 is to generate the two parts of the message-id in different ways such that when they’re combined they’re guaranteed to be unique.

For the RHS you pick a hostname in a domain you control. It’s often the hostname of the mailserver generating the Message-ID header, but it doesn’t have to be - it’s just a hostname you control. Because you control it, you can be sure that nobody else using this uniqueness algorithm will pick the same RHS.

Then you have your mailserver generate the LHS such that it never generates the same LHS twice. How it does that doesn’t matter, but it’s often done with timestamps and process identifiers. If you’re a bulk mailer you can use unique IDs you’re already generating for other purposes (e.g. VERP) instead. Doesn’t really matter.

Combining the unique-to-you RHS and the unique-on-your-system LHS gives a globally unique message-id.

As long as your message-id generation guarantees uniqueness you don’t have to use this algorithm - but the vast majority of senders do it this way, and for bulk email we tend to want to do things the “normal” way.

That’s just a short trip into ABNF specifications and protocol RFCs, but it’s everything you need to know how “RFC compliance” for a Message-ID header is defined, and where to start when looking at other RFC compliance requirements.

Related Posts

Friendly email addresses

Most of the time when we’re talking about email addresses, we’re talking about the actual user@domain format that’s used to send mail over the wire, but that’s not how we most often see them. When they’re used in a To: or From: header they’re usually associated with a display name – the “real name” of the user with the associated email address. In the From: field that’s often called the “friendly from”, but the syntax used in the To:, Cc: and Bcc: fields is identical.
The display name is important, as it’s shown more in mail clients than the actual email address is. Some mobile clients don’t display the email address at all, just the display name.
There are three ways you can put an email address in a header field.
The best way is to wrap the email address itself in angle brackets, and put the display name in front of it.

Read More

“Friendly From” addresses

When we’re looking at the technical details of email addresses there are two quite different contexts we talk about.

Read More

Don’t break the (RFC) rules

It looks like Microsoft are getting pickier about email address syntax, rejecting mail that uses illegal address formats. That might be what’s causing that “550 5.6.0 CAT.InvalidContent.Exception: DataSourceOperationException, proxyAddress: prefix not supported – ; cannot handle content of message” rejection.

Read More