If you use HTTP then the chances are good that you have to deal with HTTP headers. The syntax of HTTP headers has a long and tortured history, originating from the syntax of email headers. All too often I see headers that don't conform to the specifications. This makes everyone's job a little bit harder. The recent releases of the HTTP specifications have done a fair amount of clarification and consolidation to make getting the syntax right.
Two stage parsing
HTTP Header parsing is broken down into two phases. In phase one, the headers are extracted from the HTTP message into a set of name/value pairs. The name is a case-insensitive token that is defined in the HTTP specification and registered in the Message Header registry, and the value is a string whose syntax is defined specifically for that header name in the HTTP specification. The syntax of a HTTP header always looks like this:
header-field = field-name ":" OWS field-value OWS
OWS means "optional whitespace" and header fields are delimited using CRLF. Note that RFC7230 now recommends that there should be no whitespace between the header name and the colon.
It is a header, that's all I need to know
The ability to generically parse HTTP headers into these name value pairs without concern for the syntax of the field value is critical for performance and extensibility. Not every component needs to look at every header, so the requirement for every intermediary in the path of the HTTP request to parse the header value contents would be wasteful. Also, new headers are created regularly, so needing to update HTTP libraries whenever new header definitions are added would be problematic.
Wrapped lines are no more
There are a couple of additional issues relating to parsing the header lines. In the past it was possible to allow header values to wrap onto the next line by prefixing the wrapped line with a whitespace character[1]. The most recent HTTP header specifications recommend to no longer do this[2].
Multiple header instances
The second major issue is related to headers that contain lists of values separated by a comma. These headers can appear multiple times in a message. A header parsing routine can aggregate these multiple headers into a single list of values. The one exception to this rule is the Set-Cookie
header. Set-Cookie headers are allowed to appear multiple times despite not being a comma separated list.
The building blocks
Splitting the header field names and values is the easy part. Unfortunately, that is where most HTTP frameworks that I've seen seem to give up, or only provide support for the most commonly used headers. They lay the burden of parsing the semantics out of headers on the application developer. This can be painful because every header field-value has its own syntax definition Fortunately, the HTTP specification provides some basic building blocks for defining the syntax of HTTP headers.
Most headers, when simplified down to primitives, are a combination of token
, quoted-string
, comment, OWS
(optional whitespace), RWS
(required whitespace),
and literals.
There are also several rules for defining lists of expressions.
Tokens
are are string of characters that avoid certain delimiter characters.Quoted-strings
are surrounded by double quotes and have a slightly different set of rules for what characters are allowed.Comment
is a string surrounded by parentheses with yet again another set of valid characters.
For a quick review of which characters are allowed and which are not, I created a chart below[3].
Syntax Rules
As i mentioned each header has its own syntax rules. For example, User-Agent is defined as,
User-Agent = product *( RWS ( product / comment ) )
Where,
product = token ["/" product-version] product-version = token
and *(x)
means "zero or more x", (x/y)
means "either x or y" and [x]
means "x is optional". This is a standardized syntax definition language called ABNF. Many of the headers rules are summarized in appendices of the specification in which they are defined.
Lists of things
The most common way that HTTP headers allow you to specify lists of tokens is using the #(x)
syntax[4] which means you can have zero more x delimited with a comma. As you can see from the example above, you can also have whitespace delimited lists. Parameters, which are often lists within a single element of a comma delimited list, will be delimited by semi-colons.
An implementation
The various HTTP frameworks that I have reviewed have varying support for parsing of header values. Considering that it is just a bunch of grammar rules for parsing and production of strings, it would seem useful to me if there was a library that allowed developers to ensure that the headers they are sending conform to the specification without needing to constantly refer to the specification rules.
I believe the key requirements of a .Net framework library for HTTP header parsing and generating are:
- support for all standard headers
- support for creating new headers that use the standard header primitives
- allow for parsing/generating individual headers
generate warnings for invalid headers when parsing and do best guess parsing. - have no external dependencies other than the .net framework
- make efficient use of memory and be fast enough that its usage is negligible on the overall processing time of the message.
I'm currently working on a OSS library to do this, with the hope of being able to use it within OWIN middleware. I'll blog about it when I have something to show.
[1] I suspect the reason header wrapping exists is based on the fact that headers were originally defined to be part of email messages that were limited to a 80 character line length. It's really not relevant for headers in a HTTP message where there is no need to wrap.
[2] When specifications make a change to deprecate certain behaviour it is important to remember Postel's law. When parsing headers, we have to assume that old clients/servers are going to continue doing the line folding, so it is essential that we write components that accept folded headers but we should never do it when generating headers.
[3]
Allowed Characters
Hex |
Chars |
Token |
Quoted String |
Comment |
00-08 | ||||
09 | <tab> | y | y | |
1A-1F | ||||
20 | <Space> | |||
21 | ! | y | y | y |
22 | " | y | ||
23-27 | #$%&' | y | y | y |
28-29 | () | y | ||
2A-2F | *+ | y | y | y |
2C | , | y | y | |
2D-2E | -. | y | y | y |
2F | / | y | y | |
30-39 | <digit> | y | y | |
3A-40 | :;<=>?@@ | y | y | |
41-5A | <ALPHA> | y | y | y |
5B | [ | y | y | |
5C | \ | |||
5D | ] | y | y | |
5E-60 | ^_` | y | y | |
61-7A | <alpha> | y | y | y |
7B-7E | { | y | y | |
7C | | | y | y | y |
7D | } | y | y | |
7E | - | y | y | y |
[4] The #()
list syntax is an extension to the standard ABNF rules that is defined with the HTTP specification.
Image credit: Raise hand https://flic.kr/p/2vgyWN
Image credit: Package https://flic.kr/p/8sgnwu