Jose,
You’re correct insofar as the various components in an HTTP request all come from well defined sources (with the possible exception of determining the hostname of a request, which is a bit tricky). What isn’t so obvious, however, is how these may be combined by bad actors to create undesired request URIs. There are a number of attack vectors which can exploit server URI parsing as a basis for further downstream exploits (see [1], [2], [3]).
My planned approach to manage this in Bandit is to build URIs is roughly as follows
1. Figure out the scheme used for the request - from the perspective of Bandit, this is either http or https depending on the underlying transport. Situations where this may be overridden by forwarding proxies including `X-` headers are explicitly outside the scope of Bandit; we’re only concerned about explicit HTTP semantics.
2. Determine the hostname & port used for the request (by consulting a specific list of sources in Host headers, authority pseudo headers, and other sources). Construct a URI from scheme, host & port & normalize it. Validate that the resulting path is “/“ and that the query string is empty.
3. Determine the path & query string from the request by analyzing the request line / path pseudo header. Construct a URI from this & normalize it. Validate that the resulting scheme, host & port are empty.
4. Merge these two URIs together resulting in one where all fields are known to come from specific sources as above.
In truth I suspect that the full answer here is no doubt a lot longer more nuanced than I’m able to appreciate. My (possibly naive) hope here is to be able to apply some well-defined heuristics to build & normalize a request as early as possible in the request lifecycle, so as to ensure that Plug users can rely on their request parameters at least being valid & sanitized at a protocol level.
In terms of specific validations, I would propose that each field be validated against the grammars defined in RFC 3986 [4]. Concerning normalization heuristics, a number are described in section 6 of the same RFC, though I can think of a few others which would likely be good to include. Specific normalization heuristics used should be called out in documentation.
The question of whether we would want to expose validation and normalization as discrete functions against a URI isn’t one I have a strong opinion on. My hunch here is that there is probably a wide variety of expectations here varying on use cases so it’s probably better to leave them separate.
m.