decode tool doesn't decode filename of uploaded files

114 views
Skip to first unread message

Nicolas Grilly

unread,
May 1, 2007, 1:56:18 PM5/1/07
to cherryp...@googlegroups.com
Hello again,

It seems the decode tool doesn't decode the filename of uploaded files.

Is it the intented behavior? If not, how to change it to correctly
decode the filename, and thus convert it to unicode?

I guess we should add the decoding code in the function decode_params
in file lib/encoding.py.

Thanks for your advice about this issue,

-- Nicolas Grilly

Robert Brewer

unread,
May 1, 2007, 2:19:56 PM5/1/07
to cherryp...@googlegroups.com
Nicolas Grilly wrote:
> It seems the decode tool doesn't decode the filename of
> uploaded files.
>
> Is it the intented behavior? If not, how to change it to correctly
> decode the filename, and thus convert it to unicode?
>
> I guess we should add the decoding code in the function decode_params
> in file lib/encoding.py.

I wouldn't mind decoding the filename, as long as the charset is either
1) unambiguously declared in the payload, or 2) explicitly declared by
the developer.

AFAIK, declaring charset in the payload is already defined for
multipart/form-data. http://www.ietf.org/rfc/rfc2388.txt says:

The original local file name may be supplied as well, either as a
"filename" parameter either of the "content-disposition: form-data"
header or, in the case of multiple files, in a "content-disposition:
file" header of the subpart. The sending application MAY supply a
file name; if the file name of the sender's operating system is not
in US-ASCII, the file name might be approximated, or encoded using
the method of RFC 2231.

and http://www.ietf.org/rfc/rfc2231.txt says:

Specifically, an asterisk at the end of a parameter name acts as an
indicator that character set and language information may appear at
the beginning of the parameter value. A single quote is used to
separate the character set, language, and actual value information in
the parameter value string, and an percent sign is used to flag
octets encoded in hexadecimal. For example:

Content-Type: application/x-stuff;
title*=us-ascii'en-us'This%20is%20%2A%2A%2Afun%2A%2A%2A

...so implementing that should be straightforward. However, I'd be
surprised if your user-agent is doing this. Is it?

In the absence of explicit declaration in the payload, the only hope is
for the developer to use or override the default of US-ASCII. If you
just use the default, there's little point in adding this to CP, since
unicode(val) uses the default encoding for Python, which tends to be
ASCII anyway:

>>> unicode('\xF3')
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf3
in position 0: ordinal not in range(128)

Robert Brewer
System Architect
Amor Ministries
fuma...@amor.org

Nicolas Grilly

unread,
May 1, 2007, 7:34:46 PM5/1/07
to cherryp...@googlegroups.com
Robert Brewer wrote:
> AFAIK, declaring charset in the payload is already defined for
> multipart/form-data. http://www.ietf.org/rfc/rfc2388.txt says:
> ...

> ...so implementing that should be straightforward. However, I'd be
> surprised if your user-agent is doing this. Is it?

You're right: my user-agents don't respect the RFCs (I've checked with
Firefox 2.0 and Internet Explorer 7.0).

I've looked at the data sent by the user-agents. As expected, the
filename is given in the Content-Disposition header, but is not
encoded according to the RFCs. Here is a sample:

-----------------------------7d71f41c450754
Content-Disposition: form-data; name="your_file"; filename="L'été est beau.pdf"
Content-Type: application/pdf

I did some tests and observed the filename encoding depends on the
encoding declared in the HTML page. For example, if the page is
encoded in ISO-8859-1, the filename is encoded in ISO-8859-1 too, and
if the page is encoded in UTF-8, the filename is encoded in UTF-8 too.

Can you confirm this behavior with your own user-agents? If most
user-agents behave like IE 7.0 and Firefox 2.0, we can change the
decode tool to decode the filename, using the encoding explicitly
specified by the developer when initializing the decode tool.

-- Nicolas Grilly

Robert Brewer

unread,
May 1, 2007, 7:51:43 PM5/1/07
to cherryp...@googlegroups.com
Nicolas Grilly wrote:
> Robert Brewer wrote:
> > AFAIK, declaring charset in the payload is already defined for
> > multipart/form-data. http://www.ietf.org/rfc/rfc2388.txt says:
> > ...
> > ...so implementing that should be straightforward. However, I'd be
> > surprised if your user-agent is doing this. Is it?
>
> You're right: my user-agents don't respect the RFCs (I've checked with
> Firefox 2.0 and Internet Explorer 7.0).
>
> I've looked at the data sent by the user-agents. As expected, the
> filename is given in the Content-Disposition header, but is not
> encoded according to the RFCs. Here is a sample:
>
> -----------------------------7d71f41c450754
> Content-Disposition: form-data; name="your_file";
> filename="L'été est beau.pdf"
> Content-Type: application/pdf
>
> I did some tests and observed the filename encoding depends on the
> encoding declared in the HTML page. For example, if the page is
> encoded in ISO-8859-1, the filename is encoded in ISO-8859-1 too, and
> if the page is encoded in UTF-8, the filename is encoded in UTF-8 too.

That's been my experience in the past.

> Can you confirm this behavior with your own user-agents? If most
> user-agents behave like IE 7.0 and Firefox 2.0, we can change the
> decode tool to decode the filename, using the encoding explicitly
> specified by the developer when initializing the decode tool.

Right. I'd add a separate decode_multipart_headers function (with its own try/finally and fallback to ISO-8859-1) so that a decoding error there doesn't stop correct decoding of the params.

Nicolas Grilly

unread,
May 1, 2007, 9:37:57 PM5/1/07
to cherryp...@googlegroups.com
> Right. I'd add a separate decode_multipart_headers function
> (with its own try/finally and fallback to ISO-8859-1) so that a
> decoding error there doesn't stop correct decoding of the
> params.

Why not simply change decode_params this way:

def decode_params(encoding):
decoded_params = {}
for key, value in cherrypy.request.params.items():
if isinstance(value, cgi.FieldStorage):
value = copy.copy(value)
value.filename = value.filename.decode(encoding, 'replace')
decoded_params[key] = value
elif isinstance(value, list):
# value is a list: decode each element
decoded_params[key] = [v.decode(encoding) for v in value]
elif isinstance(value, unicode):
pass
else:
# value is a regular string: decode it
decoded_params[key] = value.decode(encoding)

# Decode all or nothing, so we can try again on error.
cherrypy.request.params = decoded_params


That solves the issue by just adding/changing three lines. The
'replace' parameter guarantees decode won't raise an issue if the
filename is undecodable.

After some googling, I noticed Paste uses the same technique in its
UnicodeMultiDict class. Follow this link and look at line 245:
http://trac.pythonpaste.org/pythonpaste/browser/Paste/trunk/paste/util/multidict.py

-- Nicolas

Reply all
Reply to author
Forward
0 new messages