API file upload

88 views
Skip to first unread message

Zsolt Ero

unread,
Apr 17, 2018, 8:25:21 AM4/17/18
to pylons-...@googlegroups.com
Hi,

I'm trying to implement an API to a website which didn't have an API
yet. It's purpose will be to allow file uploads from 3rd party native
apps.

I'd like to implement the API like Dropbox v2 API, just as a good
reference for API design.

It's upload endpoint has the following specs:
https://www.dropbox.com/developers/documentation/http/documentation#files-upload

And the following cURL example:

curl -X POST https://content.dropboxapi.com/2/files/upload \
--header "Authorization: Bearer " \
--header "Dropbox-API-Arg: {\"path\":
\"/Homework/math/Matrices.txt\",\"mode\": \"add\",\"autorename\":
true,\"mute\": false}" \
--header "Content-Type: application/octet-stream" \
--data-binary @local_file.txt

Now my problem is that I've implemented most parts, but if the request
has --header "Content-Type: application/octet-stream" then WebOb
doesn't allow using request.POST. It says:

Not an HTML form submission (Content-Type: application/octet-stream)

If I remove that header, I can use request.POST.keys()[0] to read the
contents of the file as a string.

My question is:
1. What am I doing wrong that the Content-Type is not supported?
2. Is there any downside of having an up-to-100 MB file as a string?
Wouldn't the HTML multipart-form-data's file solution use less memory?
Can I make WebOb handle this kind of uploads like it does multipart
ones?

Zsolt

Zsolt Ero

unread,
Apr 17, 2018, 9:11:03 AM4/17/18
to pylons-...@googlegroups.com
I've realised the following:
1. If I don't specify Content-Type, curl defaults to x-www-form-urlencoded
2. What I thought is the binary file's contents as a string is
actually not working reliably. On an XML upload of a single file I get
thousands of items and request.POST.items() looks like:
['<?xml version', 'amp', 'lang', 'amp', 'lang', 'amp', 'lang', 'amp', 'lang']
2. The binary string I can use is actually request.body. Still, is
there any potential problems with handling this as string and not as a
file object?

Zsolt

Zsolt Ero

unread,
Apr 17, 2018, 9:30:58 AM4/17/18
to pylons-...@googlegroups.com
OK, I'm getting there, althought I'm still confused a bit. In WebOb
docs I found request.body_file, request.body_file_raw,
request.body_file_seekable.

In multipart's request.POST, I'm doing:

file_obj.seek(0, 2)
file_size = file_obj.tell()
file_obj.seek(0)

Should I be using seekable or raw for this?

Zsolt

Jonathan Vanasco

unread,
Apr 18, 2018, 4:08:50 PM4/18/18
to pylons-discuss
I was confused on this years ago. The problem is in naming...

`curl` is concerned with the HTTP method `POST`.  Let's call that `HTTP POST`.  If you look at the `curl` documents for the difference on the various `--data-XXXX` options.  Those options will `HTTP POST` data in different formats and encodings.

WebOb's `request.POST` isn't the same as `HTTP POST` though; it's a convenience method that pre-processing structured form data submissions via `HTTP POST` (see https://docs.pylonsproject.org/projects/webob/en/stable/api/request.html#webob.request.BaseRequest.POST)

If you look at the source of webob's POST https://github.com/Pylons/webob/blob/master/src/webob/request.py#L749-L797 you'll see that it's trying to parse the raw data such as `self.body_file_raw` and `self.body_file` into form values.

Since your `HTTP POST` data isn't a form, you want to avoid all that processing and just operate on the raw data.  IIRC `body_file_raw` is better for uploads, because it avoids reading/consuming the body (which may be a stream instead of a static file).  I default to `body_file_raw` out of habit.

Bert JW Regeer

unread,
Apr 18, 2018, 4:54:41 PM4/18/18
to pylons-...@googlegroups.com
Using request.body_file is actually better than using body_file_raw. The raw reads directly from the underlying wsgi.input, which means if you are using wsgiref or some other server that doesn't correctly input terminate wsgi.input you may potentially deadlock reading forever because the wsgi.input may be directly connected to the remote socket, and if you read past CONTENT_LENGTH things go awry.

Using body_file it will wrap it into a LimitedLengthFile correctly and will do the right thing.

Also, if you use something like pyramid_retry then it will have already made the wsgi.input seekable, using body_file will return the underlying file object anyway, so there is no reason to use body_file_raw.

Request.body will consumes the whole body and returns it as a byte string, avoiding that is a good idea if it is a large upload.

To make this message shorter:

Tl;dr: Just use request.body_file, and have WebOb protect you from the crappy WSGI spec/WSGI servers as much as possible


--
You received this message because you are subscribed to the Google Groups "pylons-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pylons-discus...@googlegroups.com.
To post to this group, send email to pylons-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pylons-discuss/3d8848d6-87cc-4fab-9e46-eede7457e878%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Bert JW Regeer

unread,
Apr 18, 2018, 5:07:03 PM4/18/18
to pylons-...@googlegroups.com
request.body_file will protect you against bad WSGI servers/clients not sending appropriate headers. It'll return you something that you can read from, even if it is 0 bytes. It will also not let you read past the end of the input.

request.body_file_seekable will return the same thing as request.body_file but in a way that you can seek it, this will cause WebOb to make a copy of the whole body first (if the underlying body_file_raw isn't seekable). This likely already happened anyway if you are using pyramid_retry. Feel free to use this as well.

On Apr 17, 2018, at 07:30, Zsolt Ero <zsol...@gmail.com> wrote:

OK, I'm getting there, althought I'm still confused a bit. In WebOb
docs I found request.body_file, request.body_file_raw,
request.body_file_seekable.

In multipart's request.POST, I'm doing:

file_obj.seek(0, 2)
file_size = file_obj.tell()
file_obj.seek(0)

Should I be using seekable or raw for this?

You are not guaranteed to be able to seek the thing returned from request.body_file_raw. It'll likely be fine because of pyramid_retry, but if you don't have that, then it depends on your WSGI server.

Use request.body_file_seekable to be safe.

That being said, you don't need to .tell() to get the length of the upload, that's what the request.content_length is for. The content_length matches what is available to be read from the body_file, if and only if, the WSGI server you are using does not support for `wsgi.input_terminated` which would allow streaming a file using chunked encoding, in which case the underlying body_file_raw does properly EOF, but it likely won't support seeking.

Tl;dr: use request.body_file or request.body_file_seekable if you need to seek.

-- 
You received this message because you are subscribed to the Google Groups "pylons-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pylons-discus...@googlegroups.com.
To post to this group, send email to pylons-...@googlegroups.com.

Jonathan Vanasco

unread,
Apr 19, 2018, 12:38:41 PM4/19/18
to pylons-discuss

On Wednesday, April 18, 2018 at 4:54:41 PM UTC-4, Bert JW Regeer wrote:
Using request.body_file is actually better than using body_file_raw. The raw reads directly from the underlying wsgi.input, which means if you are using wsgiref or some other server that doesn't correctly input terminate wsgi.input you may potentially deadlock reading forever because the wsgi.input may be directly connected to the remote socket, and if you read past CONTENT_LENGTH things go awry.

Wow. I did not know that. Thanks!

Reply all
Reply to author
Forward
0 new messages