[Django] #23517: Collect static files in parallel

38 views
Skip to first unread message

Django

unread,
Sep 18, 2014, 10:48:25 AM9/18/14
to django-...@googlegroups.com
#23517: Collect static files in parallel
-------------------------------------+--------------------
Reporter: thenewguy | Owner: nobody
Type: Uncategorized | Status: new
Component: contrib.staticfiles | Version: 1.7
Severity: Normal | Keywords:
Triage Stage: Unreviewed | Has patch: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+--------------------
It would really speed up collectstatic on remote storages to copy files in
parallel.

It shouldn't be too complicated to refactor the command to work with
multiprocessing.

I am submitting the ticket as a reminder to myself when I have a free
moment. Would this be accepted into Django?

--
Ticket URL: <https://code.djangoproject.com/ticket/23517>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

Django

unread,
Sep 28, 2014, 5:13:37 AM9/28/14
to django-...@googlegroups.com
#23517: Collect static files in parallel
-------------------------------------+-------------------------------------
Reporter: thenewguy | Owner: nobody
Type: Uncategorized | Status: closed
Component: contrib.staticfiles | Version: 1.7
Severity: Normal | Resolution: needsinfo
Keywords: | Triage Stage:
Has patch: 0 | Unreviewed
Needs tests: 0 | Needs documentation: 0
Easy pickings: 0 | Patch needs improvement: 0
| UI/UX: 0
-------------------------------------+-------------------------------------
Changes (by aaugustin):

* status: new => closed
* needs_docs: => 0
* resolution: => needsinfo
* needs_tests: => 0
* needs_better_patch: => 0


Comment:

I'm afraid we'll be reluctant to hardcode concurrent behavior in Django if
there's another solution.

You shoud be able to implement parallel upload in the storage backend
with:

- a `save` method that enqueues the operation for processing by a thread
pool and returns immediately,
- a `post_process` method that waits until the thread pool has completed
all uploads.

Can you try that approach, and if it doesn't work, reopen this ticket?

Thanks!

--
Ticket URL: <https://code.djangoproject.com/ticket/23517#comment:1>

Django

unread,
Nov 8, 2014, 5:44:42 PM11/8/14
to django-...@googlegroups.com
#23517: Collect static files in parallel
-------------------------------------+-------------------------------------
Reporter: thenewguy | Owner: nobody
Type: Uncategorized | Status: closed
Component: contrib.staticfiles | Version: 1.7

Severity: Normal | Resolution: needsinfo
Keywords: | Triage Stage:
Has patch: 0 | Unreviewed
Needs tests: 0 | Needs documentation: 0
Easy pickings: 0 | Patch needs improvement: 0
| UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by thenewguy):

Just wanted to post back on this. I was able to write a quick 20 line
proof of concept for this using the threading module. The speedup was
pretty significant so I figured I would reopen this again. I could be
wrong, but I imagine something like this would be beneficial to the
general django userbase. Granted, I don't know if others get as restless
as I do while waiting on static files to upload.

I've quickly tested collectstatic with 957 static files. All files are
post processed in some fashion (at least being hashed by
ManifestFilesMixin) and also a gzipped file is created if the saved file
benefits from gzip compression. The storage backend stored the files on
AWS S3. The AWS S3 console listed 3254 files were deleted when I deleted
the files after each test. So in total, 3254 files were created during
collectstatic per case.

The following times are generated by the command line and should not be
interpreted as quality benchmarks... but they are good enough to show the
significance.

{{{
set startTime=%time%
python manage.py collectstatic --noinput
echo Start Time: %startTime%
echo Finish Time: %time%
}}}

Times (keep in mind staticfiles collectstatic does not output the count
for gzipped files, so there are roughly 957*2 more files than it reports)
{{{
957 static files copied, 957 post-processed.
async using 100 threads (ParallelUploadStaticS3Storage)
Start Time: 16:43:57.01
Finish Time: 16:49:30.31
Duration: 5.55500 minutes

sync using regular s3 storage (StaticS3Storage)
Start Time: 16:19:24.21
Finish Time: 16:41:46.78
Duration: 22.3761667 minutes
}}}


This storage is derived from ManifestFilesMixin and a subclass of
S3BotoStorage (django-storages) that creates gzipped copies and checks for
file changes to keep reliable modification dates before saving:
{{{
class ParallelUploadStaticS3Storage(StaticS3Storage):
"""
THIS STORAGE ASSUMES THAT UPLOADS ONLY OCCUR
FROM CALLS TO THE COLLECTSTATIC MANAGEMENT
COMMAND. SAVING TO THIS STORAGE DIRECTLY IS
NOT RECOMMENDED BECAUSE THE UPLOAD THREADS
ARE NOT JOINED UNTIL POST_PROCESS IS CALLED.
"""

active_uploads = []
thread_count = 100

def remove_completed_uploads(self):
for i, thread in reversed(list(enumerate(self.active_uploads))):
if not thread.is_alive():
del self.active_uploads[i]

def _save_content(self, key, content, **kwargs):
while self.thread_count < len(self.active_uploads):
self.remove_completed_uploads()

# copy the file to memory for the moment to get around file closed
errors -- BAD HACK FIXME FIX
content = ContentFile(content.read(), name=content.name)

f = super(ParallelUploadStaticS3Storage, self)._save_content
thread = threading.Thread(target=f, args=(key, content),
kwargs=kwargs)

self.active_uploads.append(thread)
thread.start()

def post_process(self, *args, **kwargs):
# perform post processing
for post_processed in super(ParallelUploadStaticS3Storage,
self).post_process(*args, **kwargs):
yield post_processed

# wait for the remaining uploads to finish
print "Post processing completed. Now waiting for the remaining
uploads to finish."
for thread in self.active_uploads:
thread.join()
}}}

--
Ticket URL: <https://code.djangoproject.com/ticket/23517#comment:2>

Django

unread,
Nov 8, 2014, 5:45:51 PM11/8/14
to django-...@googlegroups.com
#23517: Collect static files in parallel
-------------------------------------+-------------------------------------

Reporter: thenewguy | Owner: nobody
Type: Uncategorized | Status: new
Component: contrib.staticfiles | Version: 1.7
Severity: Normal | Resolution:

Keywords: | Triage Stage:
Has patch: 0 | Unreviewed
Needs tests: 0 | Needs documentation: 0
Easy pickings: 0 | Patch needs improvement: 0
| UI/UX: 0
-------------------------------------+-------------------------------------
Changes (by thenewguy):

* status: closed => new
* resolution: needsinfo =>


--
Ticket URL: <https://code.djangoproject.com/ticket/23517#comment:3>

Django

unread,
Nov 8, 2014, 5:50:35 PM11/8/14
to django-...@googlegroups.com
#23517: Collect static files in parallel
-------------------------------------+-------------------------------------

Reporter: thenewguy | Owner: nobody
Type: Uncategorized | Status: new
Component: contrib.staticfiles | Version: 1.7
Severity: Normal | Resolution:
Keywords: | Triage Stage:
Has patch: 0 | Unreviewed
Needs tests: 0 | Needs documentation: 0
Easy pickings: 0 | Patch needs improvement: 0
| UI/UX: 0
-------------------------------------+-------------------------------------
Changes (by thenewguy):

* cc: wgordonw1@… (added)


--
Ticket URL: <https://code.djangoproject.com/ticket/23517#comment:4>

Django

unread,
Nov 21, 2014, 3:51:01 PM11/21/14
to django-...@googlegroups.com
#23517: Collect static files in parallel
-------------------------------------+-------------------------------------
Reporter: thenewguy | Owner: nobody
Type: Uncategorized | Status: closed
Component: contrib.staticfiles | Version: 1.7
Severity: Normal | Resolution: wontfix

Keywords: | Triage Stage:
Has patch: 0 | Unreviewed
Needs tests: 0 | Needs documentation: 0
Easy pickings: 0 | Patch needs improvement: 0
| UI/UX: 0
-------------------------------------+-------------------------------------
Changes (by timgraham):

* status: new => closed

* resolution: => wontfix


Comment:

I think Aymeric was trying to say that if Django has enough sufficient
hooks so that users can implement this on their own, then that's enough.
Maybe `StaticS3Storage` would like to include this in their code, but it's
not obvious to me that we should include this in Django itself.

--
Ticket URL: <https://code.djangoproject.com/ticket/23517#comment:5>

Django

unread,
Oct 9, 2025, 7:23:34 AM10/9/25
to django-...@googlegroups.com
#23517: Collect static files in parallel
-------------------------------------+-------------------------------------
Reporter: thenewguy | Owner: nobody
Type: Uncategorized | Status: new
Component: contrib.staticfiles | Version: 1.7
Severity: Normal | Resolution:
Keywords: | Triage Stage:
| Unreviewed
Has patch: 1 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Changes (by Carles Barrobés i Meix):

* cc: Carles Barrobés i Meix (added)
* has_patch: 0 => 1
* resolution: wontfix =>
* status: closed => new

Comment:

Discussed and worked on this during the Django on the med event
https://djangomed.eu/

One conclusion was that despite this being possible with the current hooks
to implement this at the storage backend, it is a non-trivial endeavour
and needs to be implemented by any backend. Whereas it can be solved
relatively simply within the collectstatic command in a way that can
support any existing and future backends.

This PR shows one implementation based on a threadpool
https://github.com/django/django/pull/19935
--
Ticket URL: <https://code.djangoproject.com/ticket/23517#comment:6>

Django

unread,
Oct 13, 2025, 10:06:27 AM10/13/25
to django-...@googlegroups.com
#23517: Collect static files in parallel
-------------------------------------+-------------------------------------
Reporter: thenewguy | Owner: Carles
Type: | Barrobés i Meix
Cleanup/optimization | Status: assigned
Component: contrib.staticfiles | Version: dev
Severity: Normal | Resolution:
Keywords: | Triage Stage:
| Someday/Maybe
Has patch: 1 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Changes (by Jacob Walls):

* owner: nobody => Carles Barrobés i Meix
* stage: Unreviewed => Someday/Maybe
* status: new => assigned
* type: Uncategorized => Cleanup/optimization
* version: 1.7 => dev

Comment:

This could be worth adding so that any backend can benefit, but there is a
concern on the [https://github.com/django/django/pull/19935 PR] about race
conditions. Carles responds:

> I had a go at adding workers to django-manifeststaticfiles-enhanced And
ended up with an approach of, find the files first and then process them
with workers.
> It required more code changes, and I had to make more use of locks to
control access to the shared state and I had to figure out how to handle
thread safety issues with umask when making directories. Have a look and
see if it has any value.

If we can advance the proof of concept so that it does not have race
conditions and doesn't come with a high complexity cost, I think we can
move this out of ''Maybe'' status.
--
Ticket URL: <https://code.djangoproject.com/ticket/23517#comment:7>

Django

unread,
Oct 13, 2025, 2:42:20 PM10/13/25
to django-...@googlegroups.com
#23517: Collect static files in parallel
-------------------------------------+-------------------------------------
Reporter: thenewguy | Owner: Carles
| Barrobés i Meix
Type: New feature | Status: closed
Component: contrib.staticfiles | Version: dev
Severity: Normal | Resolution: wontfix
Keywords: | Triage Stage:
| Unreviewed
Has patch: 1 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Changes (by Natalia Bidart):

* resolution: => wontfix
* stage: Someday/Maybe => Unreviewed
* status: assigned => closed
* type: Cleanup/optimization => New feature

Comment:

I think this change should go through the New Feature process. While it
improves performance, it would also change how `collectstatic` would
behave, from sequential to concurrent execution. This can affect execution
order, error timing, and resource usage, and may introduce thread-safety
or race-condition issues for third-party or custom storage backends (which
there are plenty).

Since it revisits a prior design decision (Django intentionally left
concurrency to backends) and could require new configuration options
(worker count, failure handling, opt-in), to me it is more than a simple
optimization. While I really appreciate the discussions during the Django
on the Med sprint around this ticket, I think it should be treated as a
new feature so could you please open an issue for this on
[https://github.com/django/new-features/issues ​new feature tracker]? If
accepted, we could reopen this same ticket.
--
Ticket URL: <https://code.djangoproject.com/ticket/23517#comment:8>
Reply all
Reply to author
Forward
0 new messages