UTF-8 filename in Content-Disposition of multipart form

1,340 views
Skip to first unread message

Jisoo Park

unread,
Mar 26, 2014, 3:12:01 AM3/26/14
to spray...@googlegroups.com
Hi,

Although there have been similar discussions, I'd like to talk about unicode handling again.

When I upload a file which has a UTF-8 name to a Spray server, `BodyPart#filename` returns a US-ASCII encoded string from its `Content-Disposition` header. As Spray has changed the default encoding to UTF-8, supporting unicode filenames seems natural in my point of view.

What do you think?

Regards,
Jisoo

Johannes Rudolph

unread,
Mar 26, 2014, 5:13:52 AM3/26/14
to spray...@googlegroups.com
Hi Jisoo,

On Wed, Mar 26, 2014 at 8:12 AM, Jisoo Park <xxx...@gmail.com> wrote:
> When I upload a file which has a UTF-8 name to a Spray server,
> `BodyPart#filename` returns a US-ASCII encoded string from its
> `Content-Disposition` header. As Spray has changed the default encoding to
> UTF-8, supporting unicode filenames seems natural in my point of view.

The question is what "natural" means exactly :) As always with
encodings you need a meta-protocol level that describes how a string
is encoded into a sequence of octets (and the other way round). How
are servers encoding those values? By which rules?

It seems for the `Content-Disposition` header (and other MIME related)
there's RFC 2047 which adds support for other encodings for the
parameter values. Is the fileName parameter you encounter encoded like
this?

Johannes

[1] http://tools.ietf.org/html/rfc2047

>
> What do you think?
>
> Regards,
> Jisoo
>
> --
> You received this message because you are subscribed to the Google Groups
> "spray.io User List" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to spray-user+...@googlegroups.com.
> Visit this group at http://groups.google.com/group/spray-user.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/spray-user/3c6fb980-1367-4fc3-a10e-6303705bd53a%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
Johannes

-----------------------------------------------
Johannes Rudolph
http://virtual-void.net

Jisoo Park

unread,
Mar 26, 2014, 10:28:02 PM3/26/14
to spray...@googlegroups.com, johannes...@googlemail.com
Hi Johannes,

Thanks for your detailed response. 

What I found 'natural'  came from the previous discussion [1] about changing the default encoding. It's not a 100% same case, however:

1) Many browsers and libraries tend to set the filename field without proper encoding parameter, regardless of the RFC.

2) UTF-8 is a superset of ISO-8859-1

These observations are the same as of [1], and that's why I would prefer UTF-8 in this case either. Please let me know if there's something I'm missing.

Regards,
Jisoo


Johannes Rudolph

unread,
Mar 27, 2014, 4:55:57 AM3/27/14
to Jisoo Park, spray...@googlegroups.com
Hi Jisoo,

On Thu, Mar 27, 2014 at 3:28 AM, Jisoo Park <xxx...@gmail.com> wrote:
> Hi Johannes,
>
> Thanks for your detailed response.
>
> What I found 'natural' came from the previous discussion [1] about changing
> the default encoding. It's not a 100% same case, however:
>
> 1) Many browsers and libraries tend to set the filename field without proper
> encoding parameter, regardless of the RFC.

indeed, that's what I also noticed when running a quick test. So, what
the browser I tested this with seemed to do is to try an ISO-8859-1
encoding and replace all characters outside of that encoding with '?'.
So, probably that's a consequence of what I've described here:

https://github.com/spray/spray/issues/526

So, if you want to support the full range of UTF8 characters *the HTML
page containing the form* must be encoded using UTF8. Have you tried
that? (I quickly tried but it doesn't decode correctly.)

> 2) UTF-8 is a superset of ISO-8859-1

Superset in which regard? UTF8 may support a superset of ISO-8859-1's
characters but it's not a superset in the sense that every valid
ISO-8859-1 encoding is also a valid UTF8 encoding (ISO-8859-1 uses 8
bit encoding, while in UTF8 the highest-order bit has always special
meaning).

Johannes Rudolph

unread,
Mar 27, 2014, 5:33:22 AM3/27/14
to Jisoo Park, spray...@googlegroups.com
On Thu, Mar 27, 2014 at 9:55 AM, Johannes Rudolph
<johannes...@googlemail.com> wrote:
> On Thu, Mar 27, 2014 at 3:28 AM, Jisoo Park <xxx...@gmail.com> wrote:
> So, if you want to support the full range of UTF8 characters *the HTML
> page containing the form* must be encoded using UTF8. Have you tried
> that? (I quickly tried but it doesn't decode correctly.)

Quick follow up: the problem is that we currently use the mimepull
library to do multipart parsing which has ISO-8859-1 encoding
hardcoded [1]. Could you file an issue on spray, Jisoo? Sooner than
later we will want to write our own multipart parser and then we can
improve the situation.

[1] Line 72 in https://java.net/projects/mimepull/sources/svn/content/trunk/src/main/java/org/jvnet/mimepull/MIMEParser.java?rev=209

Jisoo Park

unread,
Mar 28, 2014, 2:10:12 AM3/28/14
to spray...@googlegroups.com, Jisoo Park, johannes...@googlemail.com
Indeed, I used 'superset' in a loose manner. Thanks for correcting :)
Just opened a github issue: https://github.com/spray/spray/issues/840 
Thanks for the quick and kind help as usual!
Reply all
Reply to author
Forward
0 new messages