Get multipart file upload in UTF-8 encoding?

3,603 views
Skip to first unread message

Andrew Burleson

unread,
Apr 4, 2017, 4:31:35 AM4/4/17
to Roda
I ran into an interesting issue, I'm not sure if this is something where Roda has better options available, or if this is just Rack behavior I'm stuck with...

I have a form that uploads multipart data, the form is set up for UTF-8 correctly, e.g.:

<form method="POST" action="/foo" enctype="multipart/form-data" accept-charset="UTF-8">

But no matter what I do, when I receive a file it comes in encoded in ASCII-8BIT.

In my route I access the file from the params like so:

file = params[:upload_file][:tempfile]


Then I do some stuff with the file - BUT - it panics if there are any special characters in it, and it turns out the reason is the upload process has changed the encoding.

I'm trying to figure out where in the call stack I could insert something to make that file stay UTF8. Any suggestions?

Thanks,
Andrew

Jeremy Evans

unread,
Apr 4, 2017, 10:51:15 AM4/4/17
to Roda
I'm not sure if this is part of the rack spec.  POST data is uploaded in the request body, and the request body must be ASCII-8BIT encoded according to the rack spec. 

It's interesting.  At least in my testing, most params are UTF-8 encoded, but most environment variables are ASCII-8BIT encoded.  For uploaded files, the :filename and :name are UTF-8 encoded, but the :type and :head are ASCII-8BIT encoded.  :tempfile is a Tempfile object, and the external encoding is ASCII-8BIT.

I think having the external encoding of ASCII-8BIT for uploaded files makes sense, since someone could upload a file in any encoding (accept-charset isn't related to the encoding of uploaded files). You shouldn't assume uploaded files are in UTF-8, but if you want to, you can probably modify the external encoding of the Tempfile.

Thanks,
Jeremy

Andrew Burleson

unread,
Apr 4, 2017, 9:46:16 PM4/4/17
to ruby...@googlegroups.com
Do you think it’s safe to *always* coerce files uploaded from ASCII-8BIT to UTF-8?

The behavior I was getting, which is kind of frustrating, is a bunch of unicode characters lost in translation (as I assume there’s no encoding for the characters in ASCII-8BIT).

So I need to come up with a mechanism for uploading the files without losing the UTF-8 encoding.

Andrew

-- 
You received this message because you are subscribed to the Google Groups "Roda" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ruby-roda+...@googlegroups.com.
To post to this group, send email to ruby...@googlegroups.com.
Visit this group at https://groups.google.com/group/ruby-roda.
To view this discussion on the web visit https://groups.google.com/d/msgid/ruby-roda/93be7860-6c48-45d5-a8ef-4a26a7d06c48%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jeremy Evans

unread,
Apr 4, 2017, 10:45:40 PM4/4/17
to Roda
On Tuesday, April 4, 2017 at 6:46:16 PM UTC-7, Andrew Burleson wrote:
Do you think it’s safe to *always* coerce files uploaded from ASCII-8BIT to UTF-8?

No, definitely not. in general case.  It may make sense in your specific case, though. 

The behavior I was getting, which is kind of frustrating, is a bunch of unicode characters lost in translation (as I assume there’s no encoding for the characters in ASCII-8BIT).

So I need to come up with a mechanism for uploading the files without losing the UTF-8 encoding.

One way to do this is to check if the UTF-8 encoding is valid, and if so, then use UTF-8 encoding:

  unless string.force_encoding('UTF-8').valid_encoding?
    string.force_encoding('ASCII-8BIT')
  end

Thanks,
Jeremy
Reply all
Reply to author
Forward
0 new messages