Intent to Implement and Ship: Numeric character reference fallback for file upload characters not representable in accept-charset

69 views
Skip to first unread message

Benjamin Wiley Sittler

unread,
Nov 7, 2017, 8:42:44 PM11/7/17
to blink-dev
bsit...@chromium.org None (this aspect of character encoding error handling is not yet specified in the HTML standard) Change <input type="file"> filename encoding in multipart/form-data uploads in forms with non-Unicode accept-charset to use HTML-style numeric character references rather than '?' when a filename the user selects contains characters not representable in the target character encoding. This change would align our behavior with the existing behavior of Firefox and Edge. It is hoped that this behavior can eventually reach cross-browser consensus and standardization.
This character encoding interoperability bugfix in a non-standardized part of the web platform will reduce accidental data loss for some web forms still using character sets not able to represent the full Unicode character repertoire in file uploads when the file the user chooses to upload is named using one or more characters that are not representable in the character encoding specified in the form's accept-charset or inherited from the web page. Once Chrome, Firefox and Edge all have interoperable fallback encoding it should be proposed for inclusion in the standard. Firefox: Shipped (see results in https://crbug.com/661819) Edge: Shipped (see results in https://crbug.com/661819#c4) Safari: No public signals Web developers: No signals

I believe this change poses no interoperability risk given that Edge and Firefox already do this and have for at least a year and quite likely much longer. I believe this change poses very little compatibility risk. To applications the only visible change should be that some form submissions that formerly uploaded files apparently having '?' in their names from Chrome will henceforth instead upload files apparently having '&#NNNNN;' in their names, exactly as if those uploads had been made using Firefox or Edge.
None.

Yes.
None. See https://crbug.com/661819 however.
https://www.chromestatus.com/features/5634575908732928 Yes.

Victor Costan

unread,
Nov 7, 2017, 8:52:31 PM11/7/17
to Benjamin Wiley Sittler, blink-dev
Clarification for the Spec field:
I think that the filename is a field in should be covered by RFC 7578 [1]. However, as the e-mail states, browser implementations deviate from the behaviors recommended in the RFC. I think that we should follow other implementations and get this decision documented.


    Victor


--
You received this message because you are subscribed to the Google Groups "blink-dev" group.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/blink-dev/CAPVbROrjo%3D%2BQ-3HSV26-gVKT6nBfvWucnzRowa%2B1MJizuD7Gxg%40mail.gmail.com.

Benjamin Wiley Sittler

unread,
Nov 7, 2017, 8:57:24 PM11/7/17
to blink-dev, bsit...@chromium.org
The HTML standard specifically overrides the RFC in this case, by specifying that browsers will do fallback character encoding (in an unspecified manner!) for file names prior to following the RFC-defined steps for form data encoding. From html.spec.whatwg.org/multipage/form-control-infrastructure.html#multipart-form-data :

"File names included in the generated multipart/form-data resource (as part of file fields) must use the character encoding selected above, though the precise name may be approximated if necessary (e.g. newlines could be removed from file names, quotes could be changed to "%22", and characters not expressible in the selected character encoding could be replaced by other characters)."

Victor Costan

unread,
Nov 7, 2017, 9:27:33 PM11/7/17
to Benjamin Wiley Sittler, blink-dev
On Tue, Nov 7, 2017 at 5:57 PM, Benjamin Wiley Sittler <bsit...@chromium.org> wrote:
The HTML standard specifically overrides the RFC in this case, by specifying that browsers will do fallback character encoding (in an unspecified manner!) for file names prior to following the RFC-defined steps for form data encoding. From html.spec.whatwg.org/multipage/form-control-infrastructure.html#multipart-form-data :

"File names included in the generated multipart/form-data resource (as part of file fields) must use the character encoding selected above, though the precise name may be approximated if necessary (e.g. newlines could be removed from file names, quotes could be changed to "%22", and characters not expressible in the selected character encoding could be replaced by other characters)."

Awesome! It seems like this is the content for the Spec field, then.

On Tuesday, November 7, 2017 at 5:52:31 PM UTC-8, Victor Costan wrote:
Clarification for the Spec field:
I think that the filename is a field in should be covered by RFC 7578 [1]. However, as the e-mail states, browser implementations deviate from the behaviors recommended in the RFC. I think that we should follow other implementations and get this decision documented.


    Victor


On Tue, Nov 7, 2017 at 5:42 PM, Benjamin Wiley Sittler <bsit...@chromium.org> wrote:
bsit...@chromium.org None (this aspect of character encoding error handling is not yet specified in the HTML standard) Change <input type="file"> filename encoding in multipart/form-data uploads in forms with non-Unicode accept-charset to use HTML-style numeric character references rather than '?' when a filename the user selects contains characters not representable in the target character encoding. This change would align our behavior with the existing behavior of Firefox and Edge. It is hoped that this behavior can eventually reach cross-browser consensus and standardization.
This character encoding interoperability bugfix in a non-standardized part of the web platform will reduce accidental data loss for some web forms still using character sets not able to represent the full Unicode character repertoire in file uploads when the file the user chooses to upload is named using one or more characters that are not representable in the character encoding specified in the form's accept-charset or inherited from the web page. Once Chrome, Firefox and Edge all have interoperable fallback encoding it should be proposed for inclusion in the standard. Firefox: Shipped (see results in https://crbug.com/661819) Edge: Shipped (see results in https://crbug.com/661819#c4) Safari: No public signals Web developers: No signals

I believe this change poses no interoperability risk given that Edge and Firefox already do this and have for at least a year and quite likely much longer. I believe this change poses very little compatibility risk. To applications the only visible change should be that some form submissions that formerly uploaded files apparently having '?' in their names from Chrome will henceforth instead upload files apparently having '&#NNNNN;' in their names, exactly as if those uploads had been made using Firefox or Edge.
None.

Yes.
None. See https://crbug.com/661819 however.
https://www.chromestatus.com/features/5634575908732928 Yes.

--
You received this message because you are subscribed to the Google Groups "blink-dev" group.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/blink-dev/CAPVbROrjo%3D%2BQ-3HSV26-gVKT6nBfvWucnzRowa%2B1MJizuD7Gxg%40mail.gmail.com.

--
You received this message because you are subscribed to the Google Groups "blink-dev" group.

Benjamin Wiley Sittler

unread,
Nov 7, 2017, 9:33:48 PM11/7/17
to Victor Costan, Benjamin Wiley Sittler, blink-dev
chromestatus only allows URLs in the spec field. the behavior change proposed here is in a part not described by the HTML spec, so from that point of view this change has no relevant spec, and Chrome's behavior appears to conform equally both without this change and with it.

Rick Byers

unread,
Nov 8, 2017, 12:28:55 AM11/8/17
to bsit...@chromium.org, Victor Costan, blink-dev
I think it's debatable the extent to which this is web exposed (and so requires an intent).  I.e. there's no way for JS to observe what filename was actually used (except indirectly by, eg., asking the user to upload it), right?  To what extent might it make sense to try to test this behavior with web-platform-tests?

Regardless LGTM1

Anne van Kesteren

unread,
Nov 8, 2017, 3:07:44 AM11/8/17
to Rick Byers, bsit...@chromium.org, Victor Costan, blink-dev
On Wed, Nov 8, 2017 at 6:28 AM, Rick Byers <rby...@chromium.org> wrote:
> I think it's debatable the extent to which this is web exposed (and so
> requires an intent). I.e. there's no way for JS to observe what filename
> was actually used (except indirectly by, eg., asking the user to upload it),
> right? To what extent might it make sense to try to test this behavior with
> web-platform-tests?

Exposed to the server is web-exposed. But also, this is exposed
directly to JavaScript through <input>.files. And some oddities get
exposed through the FormData API and Request and Response objects
(their formData() method).
https://www.w3.org/Bugs/Public/show_bug.cgi?id=16909 has some
information on other issues with multipart/form-data and there's been
various other things that popped up over the years. At some point
someone is going to have to write lots of tests and figure out a
sensible parser and serializer that meet all the various compatibility
requirements.

Using character references by the way seems sensible and is the "html"
error handling of the Encoding Standard.
https://encoding.spec.whatwg.org/#concept-encoding-process has details
(though it might be a little tricky to trace all the callers of that
algorithm). It's the only kind of error handling other than "fatal"
that makes sense for legacy encodings (most of which cannot represent
U+FFFD). (We never defined the "?" way as acceptable and Firefox
recently changed.)


--
https://annevankesteren.nl/

Rick Byers

unread,
Nov 8, 2017, 11:21:07 AM11/8/17
to Anne van Kesteren, bsit...@chromium.org, Victor Costan, blink-dev
On Wed, Nov 8, 2017 at 12:07 AM, Anne van Kesteren <ann...@annevk.nl> wrote:
On Wed, Nov 8, 2017 at 6:28 AM, Rick Byers <rby...@chromium.org> wrote:
> I think it's debatable the extent to which this is web exposed (and so
> requires an intent).  I.e. there's no way for JS to observe what filename
> was actually used (except indirectly by, eg., asking the user to upload it),
> right?  To what extent might it make sense to try to test this behavior with
> web-platform-tests?

Exposed to the server is web-exposed. But also, this is exposed
directly to JavaScript through <input>.files. And some oddities get
exposed through the FormData API and Request and Response objects
(their formData() method).
https://www.w3.org/Bugs/Public/show_bug.cgi?id=16909 has some
information on other issues with multipart/form-data and there's been
various other things that popped up over the years. At some point
someone is going to have to write lots of tests and figure out a
sensible parser and serializer that meet all the various compatibility
requirements.

Ah, I see - thank you.  OK, definitely requires an intent then and so the question becomes should we also try to make some incremental improvement in the spec and web-platform-test state here in order to ship this change in Chrome. Obviously we wouldn't want to block this improvement on a ton of spec/test work, but maybe there's some small thing we can do to start to pave the road for improving this space? Thoughts?

Benjamin Wiley Sittler

unread,
Nov 8, 2017, 12:21:44 PM11/8/17
to Rick Byers, Anne van Kesteren, Benjamin Wiley Sittler, Victor Costan, blink-dev
It is my understanding (perhaps incorrect?) that the FormData API does not expose acceptCharset/"form charset" and always uses an encoding capable of representing the full Unicode repertoire, and so never encounters this fallback representation. Likewise it is my understanding (again, perhaps incorrect?) that the <input>.files reflection only exposes the filenames in their original form, prior to acceptCharset/"form charset" encoding and replacement.

Given that, I'm not convinced it's all that web-visible - or at least, it is only visible when all these conditions are met:
  1. form uses acceptCharset/"form charset" not capable of encoding the complete Unicode repertoire
  2. user selects file(s) with names outside the repertoire representable in the acceptCharset/"form charset"
  3. script uses <input>.files reflection to read the name
  4. server accepting the form submission uses the fallback-encoded name
  5. data from (3) and (4) are somehow compared or used with the expectation of equality
By the way, it appears that other aspects of upload file naming are also not specified and vary across browsers - for instance, the naming used in the upload filename is constructed differently in Firefox (numeric disambiguator added) and Chrome (no numeric disambiguator added).

All that said, I can still see the value in changing the relevant part of HTML to refer to the Encoding Standard.

Anne van Kesteren

unread,
Nov 8, 2017, 12:49:19 PM11/8/17
to Benjamin Wiley Sittler, Rick Byers, Victor Costan, blink-dev
On Wed, Nov 8, 2017 at 6:21 PM, Benjamin Wiley Sittler
<bsit...@chromium.org> wrote:
> It is my understanding (perhaps incorrect?) that the FormData API does not
> expose acceptCharset/"form charset" and always uses an encoding capable of
> representing the full Unicode repertoire, and so never encounters this
> fallback representation.

That's correct, we indeed made the algorithm we have default to UTF-8
and only <form> passes in an override.


> Likewise it is my understanding (again, perhaps
> incorrect?) that the <input>.files reflection only exposes the filenames in
> their original form, prior to acceptCharset/"form charset" encoding and
> replacement.

That's also true. I was thinking setting <input>.files, but FileList
cannot be constructed I think and the only other source would be drag
& drop which also involves the user. The moment we made FileList
constructable though this would be directly observable as a side
effect. Seems unlikely folks would think of that so might be better to
start figure something out here anyway.


> server accepting the form submission uses the fallback-encoded name
> data from (3) and (4) are somehow compared or used with the expectation of
> equality

Can also be a service worker.


> By the way, it appears that other aspects of upload file naming are also not
> specified and vary across browsers - for instance, the naming used in the
> upload filename is constructed differently in Firefox (numeric disambiguator
> added) and Chrome (no numeric disambiguator added).

Yeah, see also that bug I linked to for various other problems with
the format. It has never really become a priority for me and it's kind
of a nice project for someone new to standards. Perhaps I should
promote it more somehow.


--
https://annevankesteren.nl/

Benjamin Wiley Sittler

unread,
Nov 8, 2017, 1:32:28 PM11/8/17
to Anne van Kesteren, Benjamin Wiley Sittler, Rick Byers, Victor Costan, blink-dev
Is there any way to even test this in a web platform test? I have tried writing an automated test for it in Chromium but it seems even there (where it is possible to script a local file upload via drag-n-drop thanks to non-web extended APIs exposed by the test runner) the allowed file name encoding is OS- and locale-dependent, and I have no reason to believe checking in a filename containing the needed non-ASCII characters would even work at checkout. Also I have not found any way to determine the OS locale-specified filename character encoding from inside the test. Does WPT have a way to do this?

--
You received this message because you are subscribed to the Google Groups "blink-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blink-dev+unsubscribe@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/blink-dev/CADnb78ivY8jQR0U5PTaLjvpYJOoC6bn77t%2B-9yiJX72ofAjpQQ%40mail.gmail.com.


Anne van Kesteren

unread,
Nov 8, 2017, 2:00:23 PM11/8/17
to Benjamin Wiley Sittler, James Graham, Geoffrey Sneddon, Rick Byers, Victor Costan, blink-dev
On Wed, Nov 8, 2017 at 7:32 PM, Benjamin Wiley Sittler
<bsit...@chromium.org> wrote:
> Is there any way to even test this in a web platform test? I have tried
> writing an automated test for it in Chromium but it seems even there (where
> it is possible to script a local file upload via drag-n-drop thanks to
> non-web extended APIs exposed by the test runner) the allowed file name
> encoding is OS- and locale-dependent, and I have no reason to believe
> checking in a filename containing the needed non-ASCII characters would even
> work at checkout. Also I have not found any way to determine the OS
> locale-specified filename character encoding from inside the test. Does WPT
> have a way to do this?

Only manual tests, which are not great for this, but we'd at least
have something to convert later on. I looked around for an issue on
automating <input type=file> input and all I found was
https://github.com/w3c/web-platform-tests/issues/5613. Maybe James or
Geoffrey know a more canonical issue for it.


--
https://annevankesteren.nl/

Geoffrey Sneddon

unread,
Nov 8, 2017, 2:15:17 PM11/8/17
to Anne van Kesteren, Benjamin Wiley Sittler, James Graham, Rick Byers, Victor Costan, blink-dev

Benjamin Wiley Sittler

unread,
Nov 8, 2017, 2:33:53 PM11/8/17
to Geoffrey Sneddon, Anne van Kesteren, Benjamin Wiley Sittler, James Graham, Rick Byers, Victor Costan, blink-dev
Is it reasonable to have an OS (file system) encoding/locale requirement in a manual test? The test would only be possible when the used file system encoding allows representation of characters outside the repertoire of the form encoding.

Also, this would all be much easier (and completely automatic) to test if there were a way to construct a FileList from a sequence of Blobs and filenames. Is there some standards and/or security rationale for not allowing this?

--
You received this message because you are subscribed to the Google Groups "blink-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blink-dev+unsubscribe@chromium.org.

Anne van Kesteren

unread,
Nov 9, 2017, 1:21:29 AM11/9/17
to Benjamin Wiley Sittler, Geoffrey Sneddon, James Graham, Rick Byers, Victor Costan, blink-dev
On Wed, Nov 8, 2017 at 8:33 PM, Benjamin Wiley Sittler
<bsit...@chromium.org> wrote:
> Is it reasonable to have an OS (file system) encoding/locale requirement in
> a manual test? The test would only be possible when the used file system
> encoding allows representation of characters outside the repertoire of the
> form encoding.

Most file systems are Unicode these days, so I'd think so.


> Also, this would all be much easier (and completely automatic) to test if
> there were a way to construct a FileList from a sequence of Blobs and
> filenames. Is there some standards and/or security rationale for not
> allowing this?

No, just hasn't happened. For assigning to <input>.files this was
proposed for a while, but nobody has been motivated enough to push it
through and make everyone adopt it. So instead you can only assign a
FileList which you can get from another <input> or drag & drop
operation...


--
https://annevankesteren.nl/

Ojan Vafai

unread,
Nov 10, 2017, 6:29:44 PM11/10/17
to Anne van Kesteren, Benjamin Wiley Sittler, Geoffrey Sneddon, James Graham, Rick Byers, Victor Costan, blink-dev
LGTM2. Agree this is very low compatibility risk and it matches Firefox/Edge.

--
You received this message because you are subscribed to the Google Groups "blink-dev" group.

Benjamin Wiley Sittler

unread,
Nov 10, 2017, 9:07:14 PM11/10/17
to Ojan Vafai, Anne van Kesteren, Benjamin Wiley Sittler, Geoffrey Sneddon, James Graham, Rick Byers, Victor Costan, blink-dev
I have filed https://github.com/w3c/html/issues/1077 requesting that this behavior be standardized in HTML. Please take a look and comment there (or here) if you have any questions, comments or concerns about the proposal.

Yoav Weiss

unread,
Nov 11, 2017, 6:50:43 AM11/11/17
to bsit...@chromium.org, Ojan Vafai, Anne van Kesteren, Geoffrey Sneddon, James Graham, Rick Byers, Victor Costan, blink-dev
LGTM3

To unsubscribe from this group and stop receiving emails from it, send an email to blink-dev+...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/blink-dev/CAPAtzjr_qOB8Q0Wc%2BUaM9KhVdhw%3D63bJszno6ZiUJsGE6tHP4w%40mail.gmail.com.

terenc...@digital.cabinet-office.gov.uk

unread,
Nov 14, 2017, 12:35:22 PM11/14/17
to blink-dev, oj...@chromium.org, ann...@annevk.nl, bsit...@chromium.org, geof...@gmail.com, jgr...@hoppipolla.co.uk, rby...@chromium.org, pwn...@chromium.org
We're happy to accept this into HTML 5.3.  I've added a few clarifying questions on the GitHub Issue.  I'm happy if you want to create a PR to fix the documentation.

Terence (Editor on HTML 5.3)

PhistucK

unread,
Nov 14, 2017, 12:38:47 PM11/14/17
to bsit...@chromium.org, Ojan Vafai, Anne van Kesteren, Geoffrey Sneddon, James Graham, Rick Byers, Victor Costan, blink-dev
Oops, I think it should be filed at the WHATWG HTML repository, rather than the W3C HTML repository (that is the general following in Chrome, as far as I can see) -


PhistucK

To unsubscribe from this group and stop receiving emails from it, send an email to blink-dev+unsubscribe@chromium.org.

Benjamin Wiley Sittler

unread,
Nov 14, 2017, 1:04:38 PM11/14/17
to PhistucK, Benjamin Wiley Sittler, Ojan Vafai, Anne van Kesteren, Geoffrey Sneddon, James Graham, Rick Byers, Victor Costan, blink-dev
Correct, I refiled it at https://github.com/whatwg/html/issues/3223 and closed the original issue when the error was reported to me, but neglected to reply-all when sending the updated link. Since then a W3C HTML editor reopened the original issue and I replied once to a question there before realizing it was in the wrong repository (thanks again to someone else for noticing that issue and telling me!); I have since deleted my reply but a substantially similar clarification is now present in the WHATWG issue.
Reply all
Reply to author
Forward
0 new messages