Serious problems with the way Django handles file uploads

67 views
Skip to first unread message

jp

unread,
Sep 13, 2006, 5:56:50 PM9/13/06
to Django developers
Hello everyone.

I started using Django about a week ago. I have a particular app which
more or less accepts CSV files, Excel files, DBase files, and whatever
else, takes the uploaded file, converts it to a tab-delimited format,
and then does statistical work over it.

Originally this application was written in PHP by someone else. I've
since taken it and rewrote it in Rails. Rails didn't have good
libraries (I had to call out to Python programs anyway), and it was
slow, so I rewrote it in Python, in the Pylons web framework. As I
said, about a week ago I started using Django, and I more or less
directly converted this app from Pylons to Django.

Pretty much all the web app itself does is display a form to upload the
file, it uploads the file, and then passes if off to an outside Python
library which does the conversion to tab. Once that is done, the
framework (in this case a Django view function) takes control again and
does the statistical work over the tab-delimited file. The code more or
less for the Pylons and Django versions is identical. There are
obviously small changes here and there to fit each framework, but the
controller or view code has changed very little.

Monday I got the entire app converted over to Django. I uploaded my
first file, a DBase file, and immediately noticed that it was taking
*forever*. After 24 seconds, I finally got my stats page. This same
process takes 3 seconds in Pylons. Something definitely wasn't right
here - my Django app was actually slower than both my Rails and the PHP
app.

So I started doing some tests to isolate the problem. I brought the
issue up in #django on freenode IRC, and someone immediately suggested
that the problem might be the dev server. Know that the dev server very
well could be at fault, I took a little script and got my Django app
running on the exact same Paste WSGI server that my Pylons app was
running on. Again, it took 24 seconds for it to run this 6 MB file.

Continuing on this testing this morning, I realized that a very good
way to test out exactly where bottleneck existed was to cut out the
uploading process alltogether - if the process finished very quickly on
a file that was already uploaded to the local filesystem, then the
problem existed within how Django's actual upload process. Sure enough,
when I had the process run on an already-uploaded file, the process
took 3 seconds. So uploading the file was taking 21 out of 24 seconds.

Again I brough this up in the IRC chat. Someone told me that nobody was
going to take my serious because I wasn't running Django the
'preferred' way, ie, on mod_python/Apache. I really didn't think this
was the limiting factor, but I installed Apache and mod_python and got
it all setup anyway. Again, it took 24 seconds for this file to upload
and process. That was 8x as long as my Pylons app running on its dinky
little WSGI Python server.

At this point I was able to narrow down the issue:

* it had to do with Django's upload process
* it was an equal problem on any server, whether Django's dev server,
the Paste server, or Apache

I ran some profiling in order to narrow the problem even further. This
first link is a profile of the view that displays the form. This view
actual doesn't do much, as I said, it pretty much just displays the
form. When the form is submitted via a POST request, it is sent to this
second view (the second link). This is where the upload takes place,
the processing happens, and the stats are finally displayed.

http://paste.e-scribe.com/1564/
http://paste.e-scribe.com/1565/

Someone suggested that an already pending patch would fix the problem.
Ticket 1484, which has been superseeded by Ticket 2070
(http://code.djangoproject.com/ticket/2070) has to do with streaming
uploads. This afternoon I applied the most recent patch in Ticket 2070,
and suprisingly, not only did it work, it also didn't have any effect
on the upload issue. Still the same 24 seconds.

I also discovered some other strange stuff. The 6 MB file which I had
been uploading was a DBase file. I uploaded a 7 MB Excel file, and it
took 17 seconds. I uploaded a 1 MB Excel file and it took 2 seconds. I
tried to upload a 13 MB CSV file and it was at 70+ seconds and still
not finished.

There doesn't seem to be any common pattern between all this. The
filetype really shouldn't make any difference, because as I said
earlier, both my Pylons app and Django app were using the same outside
library in the same way in order to conver t the file.

So I'm a bit stuck here. I'd love to use Django, but I cannot have it
running 3x slower than another Python framework. We do a lot of file
processing here. Hopefully with all this data someone will be able to
come up with some kind of idea as to what the problem might be and what
solution can be applied.

Thanks,
jp

James Bennett

unread,
Sep 15, 2006, 12:23:13 AM9/15/06
to django-d...@googlegroups.com
On 9/13/06, jp <no.na...@gmail.com> wrote:
> http://paste.e-scribe.com/1564/
> http://paste.e-scribe.com/1565/

It's hard to infer exactly what's going on without knowing more about
the actual code you're using; for example, that first set of profiler
output is spending over 40% of its time in django.core.mail and
related Python email modules, yet AFAIK there's nothing in FileField
or the automatic manipulator for a model with a FileField which should
get into that code.

Could you provide some more detail about what your code is doing,
especially anything it's trying to do with respect to sending email?

--
"May the forces of evil become confused on the way to your house."
-- George Carlin

jp

unread,
Sep 15, 2006, 12:41:44 AM9/15/06
to Django developers
It isn't using FileField, infact it isn't touching the DB or using any
kind of manipulator at all. The form consists of a simple <input
type="file" /> box which allows the user to upload a file. The file is
then submitted in a POST request to the second view.

The second view then calls to an outside library which converts the
file to a tab-delimited file, then the view continues on and does
statistical work. None of my code has anything to do with the mail or
email stuff.

It is worth noting that parse_file_upload in django.http uses those
email and mail libraries a lot. For what I don't know.

But like I said, Im not calling to the DB or doing anything else.
Really all Django itself is doing is uploading the file and then
displaying a template.

Don Arbow

unread,
Sep 15, 2006, 1:03:31 AM9/15/06
to django-d...@googlegroups.com
On Sep 14, 2006, at 9:41 PM, jp wrote:
>
> It is worth noting that parse_file_upload in django.http uses those
> email and mail libraries a lot. For what I don't know.

Django uses the email libraries to parse the uploaded content as HTTP
uses MIME encoding for file uploads.

Don

James Bennett

unread,
Sep 15, 2006, 1:06:02 AM9/15/06
to django-d...@googlegroups.com
On 9/14/06, jp <no.na...@gmail.com> wrote:
> It is worth noting that parse_file_upload in django.http uses those
> email and mail libraries a lot. For what I don't know.

Yeah, I just realized that; I didn't poke deeply enough. I'm pretty
certain that's because file uploads (typically) come in as
"multipart/form-data", which means the easiest way to parse them is
with email-handling libraries (since they speak MIME natively).

jp

unread,
Sep 15, 2006, 10:00:11 PM9/15/06
to Django developers
Turns out when I was trying to apply ticket 2070, I was forgetting to
actually enable the upload middleware.

I've enabled it now, but I'm getting errors. This patch *should* fix
the problem if I can get it working.

Cheng Zhang

unread,
Sep 15, 2006, 11:26:28 PM9/15/06
to django-d...@googlegroups.com
I have to admit that reading such informative issue reporting and
django experience sharing email is always a pleasure.
It's also interested to know that your path from PHP -> Rails ->
Pylons -> Django. Maybe in another email thread, would like to know
the reason behind your going beyond Pylons to Django.

Back to your problem, I'd suggest that you strip out all unrelated
code and make a single purpose upload Django app. If you can re-
produce the bug on such app, I guess you can easily post the code out
to ask core developers or experienced people in the m-list. Just to
describe the behavior, even in great details, seems still won't cut
it. Just my 2c. ;-)


-Cheng Zhang
http://www.ifaxian.com
1st Django powered site in Chinese ;-)
http://www.aiyo.cn
Our 2nd Django powered site in Chinese

jp

unread,
Sep 20, 2006, 6:20:29 PM9/20/06
to Django developers
I found out a solution to the problem!

Today I thought about possibly subverting Django's way of parsing file
uploads and using something else to do so, while still using Django
itself for everything else.

After a question in IRC about it, someone mentioned that request.POST
and request.FILES are evaluated in a 'lazy' fashion; the form isn't
actually parsed until either of those are explicity referenced in your
code.

With this in mind, I figured I could use the code from Pylons that does
form parsing. This part of Pylons is actually done using a portion of
Paste, paste.request.parse_formvars.

http://pythonpaste.org/module-paste.request.html#parse_formvars

Using this, I was able to simply say:

'from paste.request import parse_formvars'

And then in the view itself:

input = parse_formvars(request.environ)

to be more consistent with Django, I believe I couldn't done something
like:

request.POST = parse_formvars(request.environ)

Basically this just bypassed Django's handling of the form and used
Paste instead. I'm happy to say that was literally the only change I
had to make in my code. Of course I didn't say
request.FILES['upload'].filename, but input['upload'].filename did it
all for me.

Even better news is...this version is actually *faster* than the Pylons
version. Only by a second or less, but it is faster.

Everything works fine. I'd like to suggest a few things:

1. Ticket 2070 needs to be looked at. Currently it incorporates both an
AJAXY upload progress indicator AND a fix on the upload system. These
two items should not be bundled; one is drastically more important than
the other
2. I honestly can't even say that Ticket 2070 would've fixed my
problem, it might. But it my be a good idea to include
paste.request.parse_formvars as a 'light' and efficient way of parsing
file uploads. It is licensed under an MIT license, there would be no
problem including it, but it might be very important if Ticket 2070
isn't a fix to the problem I was having.

Hope to hear back from someone, and thanks for those that helped me out
along the way.

SmileyChris

unread,
Sep 21, 2006, 12:36:23 AM9/21/06
to Django developers
I can't get 2070 to work either. I found
http://code.djangoproject.com/ticket/2613 which fixed up my uploading
problems (at least on the dev server) - have you tried that one?

jp

unread,
Sep 21, 2006, 1:39:22 AM9/21/06
to Django developers
Actually I did apply that patch and it worked. Problem is, even though
it kept the dev server form simply stopping when I went to upload large
files, it did not speed up the upload process at all.

Reply all
Reply to author
Forward
0 new messages