Are raw header values sanitized?

564 views
Skip to first unread message

David Bertoni

unread,
Sep 5, 2023, 6:50:58 PM9/5/23
to net-dev
Hi,

We have a DCHECK issue in the Extensions layer that occurs when we store the status line from a response in a base::Value. The DCHECK indicates the status line is an invalid UTF-8 byte sequence.

This is the member function we're calling to get the value:


My guess is that this value is not sanitized to guarantee it's a valid UTF-8 byte sequence, but I was hoping someone here could verify my analysis. If it's of any value, this seems to only be happening on Windows clients.

Regards,

Dave

David Benjamin

unread,
Sep 6, 2023, 1:51:17 PM9/6/23
to David Bertoni, net-dev
The HTTP protocol is not based on UTF-8. Field values are just arbitrary byte sequences. See obs-text in https://www.rfc-editor.org/rfc/rfc9110#section-5.5

That is, the issue is not that it wasn't sanitized, but that there isn't anything to sanitize. The type of an HTTP header is simply "byte sequence", not "UTF-8 string". That's also why you'll see how the Fetch spec is careful to define all these fields as byte sequences. Likewise, all the JS APIs use the goofy ByteString IDL type, which is converted to JS strings basically as Latin-1. (Whether you believe this is Latin-1 or if it's stuffing the bytes into 0..255 UTF-16 code units in a JS string is a matter of how you interpret the data. It's the same operation either way.)

So, the DCHECK indicates a bug in the extensions code. Whenever you represent values from HTTP, you must correctly handle the full space of possible values. What "correctly" means will depend on context, but if you're exposing them to JavaScript, aligning with Fetch would be most natural.

--
You received this message because you are subscribed to the Google Groups "net-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to net-dev+u...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/net-dev/CAAvZR3BaJUBLhn0WyBjO2vopm66zU0wNyKOi-XYArXkkcCTu6A%40mail.gmail.com.

David Benjamin

unread,
Sep 7, 2023, 2:57:20 PM9/7/23
to David Bertoni, net-dev
Noticed that code doesn't have many callers... is this the function that's hitting a DCHECK?
https://source.chromium.org/chromium/chromium/src/+/main:extensions/browser/api/web_request/web_request_event_details.cc;drc=eef4762779d05708d5dfc7d5fe4ea16288069a35;l=126

I'm guessing this is indeed to expose to JavaScript in the extension, since it's a base::Value::Dict? base::Value::Dict is a JSON dictionary which carries Unicode strings, not byte strings, hence the DCHECK. So yeah you'll need to define what the right representation of an HTTP header is. I see, for HTTP headers, you all actually do this odd thing:

This works, but keep in mind that it's different from what the web platform does. Matching the web platform seems a much better model (it's weird that Fetch and extensions don't see the same header representation), but it would be a behavior change for all non-ASCII values. If an HTTP status line contains the byte sequence 0xE2 0x98 0x83 (the UTF-8 encoding of U+2603), you all are currently representing it as the Unicode string U+2603, while Fetch represents it as the Unicode string U+00E2 U+0098 U+0083. This is goofy, but it means you can actually represent, say, 0xFF as U+00FF, whereas attempting a UTF-8 decode will fail to represent that at all.

(HTTP header values and reason phrases both support obs-text. Fetch handles them both with the weird ByteString/Latin-1 thing.)

David Bertoni

unread,
Sep 7, 2023, 2:57:24 PM9/7/23
to David Benjamin, net-dev
Hi David,

Thanks for your reply! A few more comments inline below.

On Wed, Sep 6, 2023 at 11:03 AM David Benjamin <davi...@chromium.org> wrote:
Yes, that's the call that's causing the problem. I suspected there was no attempt to coerce the headers into UTF-8, and in the absence of any knowledge of the encoding, anything is going to be a hack.

I do see this in the RFC:

"Historically, HTTP allowed field content with text in the ISO-8859-1 charset [ISO-8859-1], supporting other charsets only through use of [RFC2047] encoding."

So there's no attempt to assume that encoding?
 


I'm guessing this is indeed to expose to JavaScript in the extension, since it's a base::Value::Dict? base::Value::Dict is a JSON dictionary which carries Unicode strings, not byte strings, hence the DCHECK. So yeah you'll need to define what the right representation of an HTTP header is. I see, for HTTP headers, you all actually do this odd thing:

This works, but keep in mind that it's different from what the web platform does. Matching the web platform seems a much better model (it's weird that Fetch and extensions don't see the same header representation), but it would be a behavior change for all non-ASCII values. If an HTTP status line contains the byte sequence 0xE2 0x98 0x83 (the UTF-8 encoding of U+2603), you all are currently representing it as the Unicode string U+2603, while Fetch represents it as the Unicode string U+00E2 U+0098 U+0083. This is goofy, but it means you can actually represent, say, 0xFF as U+00FF, whereas attempting a UTF-8 decode will fail to represent that at all.
I also don't quite understand what you're saying here and it doesn't seem to line up with what our code does.  As long as a header value is structurally-valid UTF-8, we just store it as-is in the dictionary. So why would a string containing 0xE2 0x98 0x83, which is structurally valid, be converted by our code?

Also, our code converts any string that's not structurally-valid UTF-8 into a _list_ of integers, where every value in the list is the integer value of each byte in the string:


That list then becomes the value of the key in the dictionary. What you're describing seems like we're converting UTF-8 byte sequences to Unicode code points, then encoding them as UTF-32. Or am I misunderstanding you?

I do agree with you that consistency with the web platform is the right thing, but we'll need to figure out a plan moving forward, since aligning with Fetch will break compatibility. My preference would be to replace any non-ASCII bytes that aren't part of a valid UTF-8 byte sequence with a replacement character, probably '?'. That would preserve the ASCII portion of the status line.

Thoughts?

Dave

David Benjamin

unread,
Sep 7, 2023, 2:57:27 PM9/7/23
to David Bertoni, net-dev
On Thu, Sep 7, 2023 at 1:48 PM David Bertoni <dber...@chromium.org> wrote:
Hi David,

Thanks for your reply! A few more comments inline below.

On Wed, Sep 6, 2023 at 11:03 AM David Benjamin <davi...@chromium.org> wrote:
Yes, that's the call that's causing the problem. I suspected there was no attempt to coerce the headers into UTF-8, and in the absence of any knowledge of the encoding, anything is going to be a hack.

I do see this in the RFC:

"Historically, HTTP allowed field content with text in the ISO-8859-1 charset [ISO-8859-1], supporting other charsets only through use of [RFC2047] encoding."

So there's no attempt to assume that encoding?

Not in //net, no. Indeed, if we did that, it would just force the Fetch semantics and break compatibility for you all in the same way as discussed below. The representation that Fetch picks is really the same thing as interpreting the bytes as ISO-8859-1. However, at the level of //net and almost every HTTP API I've ever seen, headers are just represented as byte strings and it is the caller's responsibility to decide what they want to do with byte strings.
 
I'm guessing this is indeed to expose to JavaScript in the extension, since it's a base::Value::Dict? base::Value::Dict is a JSON dictionary which carries Unicode strings, not byte strings, hence the DCHECK. So yeah you'll need to define what the right representation of an HTTP header is. I see, for HTTP headers, you all actually do this odd thing:

This works, but keep in mind that it's different from what the web platform does. Matching the web platform seems a much better model (it's weird that Fetch and extensions don't see the same header representation), but it would be a behavior change for all non-ASCII values. If an HTTP status line contains the byte sequence 0xE2 0x98 0x83 (the UTF-8 encoding of U+2603), you all are currently representing it as the Unicode string U+2603, while Fetch represents it as the Unicode string U+00E2 U+0098 U+0083. This is goofy, but it means you can actually represent, say, 0xFF as U+00FF, whereas attempting a UTF-8 decode will fail to represent that at all.
I also don't quite understand what you're saying here and it doesn't seem to line up with what our code does.  As long as a header value is structurally-valid UTF-8, we just store it as-is in the dictionary. So why would a string containing 0xE2 0x98 0x83, which is structurally valid, be converted by our code?

Also, our code converts any string that's not structurally-valid UTF-8 into a _list_ of integers, where every value in the list is the integer value of each byte in the string:


That list then becomes the value of the key in the dictionary. What you're describing seems like we're converting UTF-8 byte sequences to Unicode code points, then encoding them as UTF-32. Or am I misunderstanding you?

"Structurally valid" is meaningless in this context. The DCHECK is a type error in the extensions code, not a problem with the value. In both HTTP itself and //net, the type of a header value and reason string is byte string, not Unicode string. There is no guarantee or expectation that non-ASCII bytes are UTF-8. How to interpret those byte strings is the caller's responsibility, and the caller must be prepared to receive all possible byte strings. It's unfortunate that we have this sharp edge, and that they're just tossed into std::string, and that they look text-like, but sadly HTTP and //net come from a long history of systems being sloppy about what encoding is in use, so here we are.

"Decode as UTF-8 or return list of integers if not UTF-8" is a perfectly injective JSON representation of an HTTP header. It's just not the one that the web platform uses, and has a really weird discontinuity once your byte string doesn't happen to be UTF-8.

As for what I'm describing, that wasn't to encode as UTF-32. I just described the process as a sequence of code points abstractly because JSON contains Unicode strings. We happen to represent those Unicode strings in C++ as UTF-8, but thinking of them as Unicode strings makes it clearer why some byte strings are unrepresentable if you try to decode them as UTF-8. In particular, that JSON data will immediately get sent to JavaScript, where Unicode strings are represented as UTF-16. So describing it as either UTF-8 or UTF-16 would be weird. (So, in that vein, yes your code is ultimately converting 0xE2 0x98 0x83 because it needs to end up in a JavaScript string somehow. Your code turns it into a JavaScript string that represents U+2603, as it happens in UTF-16. The web-platform-consistent representation would be to instead turn it into a different JavaScript string that represents U+00E2 U+0098 U+0083, again in UTF-16 as it happens.)

If you would prefer to phrase it in terms of UTF-8, sure, your current implementation is the identity function restricted to valid UTF-8 encodings, whereas the web-platform-consistent transformation is to decode as Latin-1/ISO-8859-1 and then encode as UTF-8. Just that's a weird way to think of it, because it will still get decoded as UTF-8 and then encoded as UTF-16 when it makes its way to the extension's JavaScript, and it's less clear why your identity function fundamentally needed to be restricted.
 
I do agree with you that consistency with the web platform is the right thing, but we'll need to figure out a plan moving forward, since aligning with Fetch will break compatibility. My preference would be to replace any non-ASCII bytes that aren't part of a valid UTF-8 byte sequence with a replacement character, probably '?'. That would preserve the ASCII portion of the status line.

That is a perfectly well-defined function that can take all possible HTTP values. It's just not injective (i.e. multiple different values will be represented the same) and not what the web platform does. Whether that meets the needs of extensions, I'll defer to you all. It does seem likely to cause developer confusion though. Given all this is an edge case, I think you should switch to the web platform interpretation and skip the intermediate step for reason strings. And then, separately, you all can work on aligning the header values, which will probably be a bit more work.

As for which replacement character, U+FFFD is a bit more common than ?, though neither option will be injective or consistent with the web platform.

David 
Reply all
Reply to author
Forward
0 new messages