Collectstatic and determine which files need to be updated by comparing their checksum

206 views
Skip to first unread message

Daniel Blasco

unread,
Apr 12, 2016, 12:07:51 PM4/12/16
to Django developers (Contributions to Django itself)
Hi,

I posted this in django-users but I think that it goes better here.


I'm using django-storages to upload my static files to Amazon S3 and I'm serving my application from Heroku.

In my local development, when I run collectstatic for a second time just after the first one, no files are being uploaded to S3 because collectstatic checks for the modified_time to determine if the local files are newer than the ones in S3. That's fine so far.

The problem is when I deploy to Heroku. Collectstatic is being executed from the Heroku server and absolutely all the files are always being uploaded to S3, even the ones that have not changed. This is because during the deployment Heroku creates a full copy of the source code, and therefore all the files have a new modified_time. In my case, it takes almost 10 minutes to upload ~1000 files for each deployment.

Also, imagine the situation where the modified_times are not being changed and I wanted to upload older versions of the static files. I wont be able because storage wouldn't allow to upload files with an older modified_time.

I think that a more accurate way to check if a file needs to be replaced could be by comparing their checksum/hash and offer this feature for all the Storage subclasses. To preserve backwards compatibility, in collectstatic command first determine if the storage subclass implements a checksum generation and otherwise fallback to modified_time comparison.


What do you think, is this something that makes sense?
Message has been deleted

bliy...@rentlytics.com

unread,
Apr 14, 2016, 1:16:39 PM4/14/16
to Django developers (Contributions to Django itself)
This makes a lot of sense to me.

Tim Graham

unread,
Apr 14, 2016, 8:34:19 PM4/14/16
to Django developers (Contributions to Django itself)
A proposal to use checksums was closed as wontfix in https://code.djangoproject.com/ticket/19021.

Daniel Blasco

unread,
Apr 15, 2016, 8:17:48 AM4/15/16
to Django developers (Contributions to Django itself)
Thanks Tim for the info.

This is the discussion mentioned in the ticket (from 2012) https://groups.google.com/d/topic/django-developers/vtMVq8jwnf8/discussion

The solutions that ptone suggests in the ticket don't really work for Heroku. Also, to sync static files from local is not a good solution for example when using CI. And still there's the situation when trying to upload old files, like for example during a rollback.

At the end, the main problem is that collectstatic is using two different backends. One is being provided by each of the static file finders (settings.STATICFILE_FINDERS) and the other one is the one defined in settings.STATICFILES_STORAGE. As there's not a standard hash method, the Storage superclass can't force to implement a standard hash method for all its subclasses. 

Maybe a solution would be to shift the responsibility of detecting a file change from collectstatic to the STATICFILES_STORAGE? In this way we provide the flexibility of letting the Storage subclasses to decide how they want to check if a file has changed, they can use any technique they like and keep consistent.

A rough and simplified example:

# django/core/files/storage.py
class Storage(object):
    def has_changed(self, source_storage, source_path, path):
        raise NotImplementedError()


class FileSystemStorage(Storage):
    def has_changed(self, source_storage, source_path, path):
        return source_storage.modified_time(source_path) > self.modified_time(path)


# django/contrib/staticfiles/management/commands/collectstatic.py
class Command(BaseCommand):
    def delete_file(self, path, prefixed_path, source_storage):
        if self.storage.has_changed(source_storage, path, prefixed_path):
            self.storage.delete(prefixed_path)


And then, anyone could do this in their own project (or even in django-storages):

# my_app/storages/custom_s3_storage.py
class MyStorage(S3BotoStorage):
    def has_changed(self, source_storage, source_path, path):
        try:
            local_md5 = source_storage.get_md5(source_path)
        except (NotImplementedError, AttributeError):
            with source_storage.open(source_path) as source_file:
                local_md5 = hashlib.md5(source_file.read()).hexdigest()

        return self.get_md5(path) != local_md5

    def get_md5(self, path):
        return self.bucket.get_key(path).md5



It keeps backward compatibility and allows the possibility to use any comparison method by any Storage subclass.
Reply all
Reply to author
Forward
0 new messages