Feature request: collectstatic shouldn't recopy files that already exist in destination

1,256 views
Skip to first unread message

Dan Loewenherz

unread,
Sep 27, 2012, 12:51:52 PM9/27/12
to django-d...@googlegroups.com
Hey all!

This is a feature request / proposal (one which I'm willing to build out, given that I've already developed a solution for my own uploader).

I run a consulting business that helps small startups build initial MVPs. When the time ultimately comes to deciding how to store static assets, my preference (as is that of many others) is to use Amazon S3, with Amazon CloudFront acting as a CDN to the appropriate buckets. For the purposes of this ticket, s/S3/your object storage of choice/g.

Now, Django has an awesome mechanism for making sure static assets are up on S3. With the appropriate static storage backend, running ./manage.py collectstatic just searches through your static media folders and copies the files.

The problem I've run into is that collectstatic copies all files, regardless of whether they already exist on the destination. Uploading 5-10MB of files is pretty wasteful (and time consuming) when none of the files have changed and no new files have been added.

As I wrote in the trac ticket (https://code.djangoproject.com/ticket/19021), my current solution was to write a management command that does essentially the same thing that collectstatic does. But, there are a few differences. Here's a rundown (copied straight from the trac ticket).

I currently solve this problem by creating a file containing metadata of all the static media at the root of the destination. This file is a JSON object that contains file paths as keys and checksum as values. When an upload is started, the uploader checks to see if the file path exists as a key in the dictionary. If it does, it checks to see if the checksums have changed. If they haven't changed, the uploader skips the file. At the end of the upload, the checksum file is updated on the destination.

I'll contribute the patch. I know there is not a lot of time before the feature freeze, but I'm willing to make this happen if there's interest.

If we don't want to change behavior, perhaps adding a flag such as --skip-unchanged-files to the collectstatic command is the way to go?

All the best,
Dan

Babatunde Akinyanmi

unread,
Sep 27, 2012, 2:42:39 PM9/27/12
to django-d...@googlegroups.com
I like this feature and have recently been thinking of implementing
such on my own myself.
+1 for the feature request.
> --
> You received this message because you are subscribed to the Google Groups
> "Django developers" group.
> To post to this group, send email to django-d...@googlegroups.com.
> To unsubscribe from this group, send email to
> django-develop...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/django-developers?hl=en.
>
>

--
Sent from my mobile device

Anders Steinlein

unread,
Sep 27, 2012, 3:45:21 PM9/27/12
to django-d...@googlegroups.com
On Thu, Sep 27, 2012 at 6:51 PM, Dan Loewenherz <dloew...@gmail.com> wrote:
Hey all!

This is a feature request / proposal (one which I'm willing to build out, given that I've already developed a solution for my own uploader).

[...]

I'll contribute the patch. I know there is not a lot of time before the feature freeze, but I'm willing to make this happen if there's interest.

Yes, please. I've been wanting this myself. This should IMHO be the default for collectstatic, but having a flag to *force* copying all files.

--
Anders Steinlein
Eliksir AS
http://e5r.no

E-post: and...@e5r.no
Mobil: +47 926 13 069
Twitter: @asteinlein

mjl Martin J. Laubach

unread,
Sep 27, 2012, 4:06:08 PM9/27/12
to django-d...@googlegroups.com
Good idea, but shouldn't it be a per-storage thing? Perhaps this could be done with a couple of callbacks in the collectstatic run:
  • Before collectstatic starts, so the storage backend can pick up its inventory from the remote
  • One called for each file that would be copied, and that can veto a copy operation
  • At the end of the collectstatic run, to update the inventory
That would make things customisable even for strange setups like storing files in a database or whatnot?

   Cheers,

        mjl

Karen Tracey

unread,
Sep 27, 2012, 6:12:05 PM9/27/12
to django-d...@googlegroups.com
On Thu, Sep 27, 2012 at 12:51 PM, Dan Loewenherz <dloew...@gmail.com> wrote:
The problem I've run into is that collectstatic copies all files, regardless of whether they already exist on the destination.

No, as noted in the ticket, which has been closed needsinfo, staticfiles already only copies modified files. And I don't think that's a recently-added feature, so I suspect the behavior you are seeing has more to do with the S3 side than the staticfiles code. This post:

http://stackoverflow.com/questions/6618013/django-staticfiles-and-amazon-s3-how-to-detect-modified-files

looks enlightening though even that is not all that new. It sounds like the problem here is efficiently getting the last-modified time for the files out in S3, not with staticfiles.

Karen

Dan Loewenherz

unread,
Sep 27, 2012, 6:47:17 PM9/27/12
to django-d...@googlegroups.com
Just updated the ticket.

As I commented, the heuristic for checking if a file has been modified lies in line 282 of collectstatic.py:

if not prefixed_path in self.copied_files:
return self.log("Skipping '%s' (already copied earlier)" % path)


This seems off, since a path may stay the same but a file's contents may change.

Also, the existing functionality doesn't work when multiple people are running collectstatic (or the same person is running it on multiple computers).

Dan

--
You received this message because you are subscribed to the Google Groups "Django developers" group.

Carl Meyer

unread,
Sep 27, 2012, 7:13:47 PM9/27/12
to django-d...@googlegroups.com
Hi Dan,

On 09/27/2012 04:47 PM, Dan Loewenherz wrote:
> Just updated the ticket.
>
> As I commented, the heuristic for checking if a file has been modified
> lies in line 282 of collectstatic.py:
>
> *if not prefixed_path in self.copied_files:*
> *
> return self.log("Skipping '%s' (already copied earlier)" % path)
>
> *
> https://github.com/django/django/blob/master/django/contrib/staticfiles/management/commands/collectstatic.py#L282
>
> This seems off, since a path may stay the same but a file's contents may
> change.

That's not checking whether the file has been modified, that's checking
whether the same source file was previously copied in the same
collectstatic run (due to overlapping/duplicate file sources of some kind).

The check for modification date is up on line 234 in the delete_file
method, which is called by both link_file and copy_file.

Carl

Dan Loewenherz

unread,
Sep 27, 2012, 11:38:49 PM9/27/12
to django-d...@googlegroups.com
Thanks, I missed that.

I still see an issue here. In any sort of source control, when a user updates their repo, local files that were updated remotely show up as modified at the time the repo is cloned or updated, not when the file was actually last saved by the last author. You then have the same scenario I pointed to earlier: when multiple people work on a project, they will re-upload the same files multiple times.

Don't get me wrong here--I'm happy to see there is some sort of check here to avoid collectstatic'ing the same file, but I think it's warranted to push back on the use of last modified as the heuristic. I think using a checksum would work much better, and solves the problem this is trying to solve in a much more foolproof way. With all this said, I don't think this logic belongs in a "delete_file" method. IMO it'd be better to separate out the logic of "does this file already exist, if so, skip it" from "delete this file if it exists on the target" (delete_file).

@Karen--thanks for digging up that SO post. It actually is super relevant here. As mentioned, when uploading files to S3, it's quite time consuming to perform a network round trip to pick up file metadata (such as last modified time) for each file you're uploading. This was initial reason I chose to store this data in a single location that I'd only need to grab once.

Dan

James Rivett-Carnac

unread,
Sep 28, 2012, 1:24:21 AM9/28/12
to django-d...@googlegroups.com
+1
I like this implementation.

Stephen Burrows

unread,
Sep 30, 2012, 5:28:46 PM9/30/12
to django-d...@googlegroups.com
I use S3 as well, and I have seen cases where files get copied that I know don't need to be. That being said, it was never so slow that it was an issue for me. Is there clear evidence that this is something which can't be handled by the S3 backend due to an inadequate API on the django side?

Dan Loewenherz

unread,
Sep 30, 2012, 5:41:12 PM9/30/12
to django-d...@googlegroups.com
I think it's an API limitation. Many backends don't support last modified times, and even if they all did, it's incorrect to assume that last modified time is an accurate heuristic for whether a file has already been uploaded or not. It might be a better idea to let the backends decide when a file has been changed (instead of just calling the backend's last modified method).

Dan

On Sun, Sep 30, 2012 at 2:28 PM, Stephen Burrows <stephen....@gmail.com> wrote:
I use S3 as well, and I have seen cases where files get copied that I know don't need to be. That being said, it was never so slow that it was an issue for me. Is there clear evidence that this is something which can't be handled by the S3 backend due to an inadequate API on the django side?

--
You received this message because you are subscribed to the Google Groups "Django developers" group.
To view this discussion on the web visit https://groups.google.com/d/msg/django-developers/-/D9eemazwQUQJ.

Jannis Leidel

unread,
Oct 1, 2012, 3:43:03 AM10/1/12
to django-d...@googlegroups.com
Dan,

I added prefetching the metadata to the Boto based storage backend in django-storages a while ago to reduce the time spent on querying S3.

See https://bitbucket.org/david/django-storages/src/1574890d87be/storages/backends/s3boto.py#cl-42

You can either set the setting AWS_PRELOAD_METADATA to True or subclass S3BotoStorage and pass preload_metadata=True into the constructor.

Hope that helps,

Jannis

Jannis Leidel

unread,
Oct 1, 2012, 3:47:24 AM10/1/12
to django-d...@googlegroups.com

On 30.09.2012, at 23:41, Dan Loewenherz <d...@dlo.me> wrote:

> Many backends don't support last modified times, and even if they all did, it's incorrect to assume that last modified time is an accurate heuristic for whether a file has already been uploaded or not.

Well but it's an accurate way to decide whether a file has been changed on the filesystem, and that's what collectstatic cares about. The storage backend *is* the API to extend that when needed, so feel free to use it.

> It might be a better idea to let the backends decide when a file has been changed (instead of just calling the backend's last modified method).

I don't understand, you can easily implement exactly that in the last_modified method if you'd like.

Jannis

Dan Loewenherz

unread,
Oct 6, 2012, 1:48:56 PM10/6/12
to django-d...@googlegroups.com
Hey Jannis,

On Mon, Oct 1, 2012 at 12:47 AM, Jannis Leidel <lei...@gmail.com> wrote:

On 30.09.2012, at 23:41, Dan Loewenherz <d...@dlo.me> wrote:

> Many backends don't support last modified times, and even if they all did, it's incorrect to assume that last modified time is an accurate heuristic for whether a file has already been uploaded or not.

Well but it's an accurate way to decide whether a file has been changed on the filesystem, and that's what collectstatic cares about. The storage backend *is* the API to extend that when needed, so feel free to use it.

It's accurate *only* in certain situations. And on a distributed development team, I've run into a lot of issues with developers re-upload files that have already been uploaded because they just recently updated their repo.

A checksum is the only true accurate method to determine if a file has changed.

Additionally, you didn't address my point that I quoted from. Storage backends don't just reflect filesystems--they could reflect files stored in a database, S3, etc. And some of these filesystems don't support last modified times.

> It might be a better idea to let the backends decide when a file has been changed (instead of just calling the backend's last modified method).

I don't understand, you can easily implement exactly that in the last_modified method if you'd like.

This is a bit confusing...why call it last_modified when that's doesn't necessarily reflect what it's doing? It would be more flexible to create two methods:

def modification_identifier(self):

def has_changed(self):

Then, any backend could implement these however they might like, and collectstatic would have no excuse in uploading the same file more than once. Overloading last_modified to also do things like calculate md5's seems a bit hacky to me, and confusing for any developer maintaining a custom storage backend that doesn't support last modified.

Dan

Dan Loewenherz

unread,
Oct 7, 2012, 11:58:26 PM10/7/12
to django-d...@googlegroups.com
This issue just got me again tonight, so I'll try to push once more on this issue. It seems right now most people don't care that this is broken, which is a bummer, but in which case I'll just continue using my working solution.

Dan

ptone

unread,
Oct 8, 2012, 1:38:07 AM10/8/12
to django-d...@googlegroups.com, d...@dlo.me
so after scanning this thread and the ticket again - it is still unclear that there could be a completely universal solution.

While it would be nice if the storage API had a checksum(name) or md5(name) method - not all custom storage backends are going to support a single checksum standard.  S3 doesn't explicitly support MD5 (apparently it unofficially does through ETags).  Without a universal checksum - you can't use it to compare files across arbitrary backends.

I do agree that hacking modified_time return value is a little ugly - the API is clearly documented as "returns a datetime..." - so returning a M55 checksum there is, well, hacky.

If you are passionate about moving this forward, here is what I'd suggest.

Implement, document, and test .md5(name) as a standard method on storage backends - like modified_time this would raise NotImplementedError if not available - this could easily be its own ticket. md5 is probably the closest you'll get to a checksum standard.

Once you have an md5 method defined for backends - you could support a --md5 option to collectstatic that would use that as the target/source comparison.

Another workaround is to just use collectstatic locally - and rsync --checksum to your remote if it supports rsync.

-Preston

Jannis Leidel

unread,
Oct 8, 2012, 6:33:12 AM10/8/12
to django-d...@googlegroups.com

On 06.10.2012, at 19:48, Dan Loewenherz <d...@dlo.me> wrote:

> Hey Jannis,
>
> On Mon, Oct 1, 2012 at 12:47 AM, Jannis Leidel <lei...@gmail.com> wrote:
>
> On 30.09.2012, at 23:41, Dan Loewenherz <d...@dlo.me> wrote:
>
> > Many backends don't support last modified times, and even if they all did, it's incorrect to assume that last modified time is an accurate heuristic for whether a file has already been uploaded or not.
>
> Well but it's an accurate way to decide whether a file has been changed on the filesystem, and that's what collectstatic cares about. The storage backend *is* the API to extend that when needed, so feel free to use it.
>
> It's accurate *only* in certain situations. And on a distributed development team, I've run into a lot of issues with developers re-upload files that have already been uploaded because they just recently updated their repo.
>
> A checksum is the only true accurate method to determine if a file has changed.
>
> Additionally, you didn't address my point that I quoted from. Storage backends don't just reflect filesystems--they could reflect files stored in a database, S3, etc. And some of these filesystems don't support last modified times.

Then, frankly, this is a problem of the storage backends, not Django's. The S3BotoStorage backend *does* have a modified_time method:

https://bitbucket.org/david/django-storages/src/1574890d87be/storages/backends/s3boto.py#cl-298

What storage backend do you use that doesn't have a modified_time method?

> This is a bit confusing...why call it last_modified when that's doesn't necessarily reflect what it's doing? It would be more flexible to create two methods:

It's called modified_time, not last_modified.

> def modification_identifier(self):
>
> def has_changed(self):
>
> Then, any backend could implement these however they might like, and collectstatic would have no excuse in uploading the same file more than once. Overloading last_modified to also do things like calculate md5's seems a bit hacky to me, and confusing for any developer maintaining a custom storage backend that doesn't support last modified.

I disagree, modified_time is perfectly capable of handling your use case.

Jannis

Jannis Leidel

unread,
Oct 8, 2012, 6:50:23 AM10/8/12
to django-d...@googlegroups.com

On 08.10.2012, at 07:38, ptone <pre...@ptone.com> wrote:

> so after scanning this thread and the ticket again - it is still unclear that there could be a completely universal solution.
>
> While it would be nice if the storage API had a checksum(name) or md5(name) method - not all custom storage backends are going to support a single checksum standard. S3 doesn't explicitly support MD5 (apparently it unofficially does through ETags). Without a universal checksum - you can't use it to compare files across arbitrary backends.

You're able to ask S3 for the date of last modification, I don't see why a comparison by hashing the file content is needed additionally. It'd have to download the full file to do that on Django's side and I'm not aware of a API for getting a hash from cloudfiles, S3 etc.

> I do agree that hacking modified_time return value is a little ugly - the API is clearly documented as "returns a datetime..." - so returning a M55 checksum there is, well, hacky.

I beg to differ, returning a datetime object makes absolute sense for comparing it to another datetime object. What I meant before is that the modified_time method can be written however the user wants as long as it returns a datetime object, even a date that is known to be older than the file on disk.

> If you are passionate about moving this forward, here is what I'd suggest.
>
> Implement, document, and test .md5(name) as a standard method on storage backends - like modified_time this would raise NotImplementedError if not available - this could easily be its own ticket. md5 is probably the closest you'll get to a checksum standard.

-1

Jannis


> On Sunday, October 7, 2012 8:59:16 PM UTC-7, Dan Loewenherz wrote:
> This issue just got me again tonight, so I'll try to push once more on this issue. It seems right now most people don't care that this is broken, which is a bummer, but in which case I'll just continue using my working solution.
>
> Dan
>
> On Sat, Oct 6, 2012 at 10:48 AM, Dan Loewenherz <d...@dlo.me> wrote:
> Hey Jannis,
>
> On Mon, Oct 1, 2012 at 12:47 AM, Jannis Leidel <lei...@gmail.com> wrote:
>
> On 30.09.2012, at 23:41, Dan Loewenherz <d...@dlo.me> wrote:
>
> > Many backends don't support last modified times, and even if they all did, it's incorrect to assume that last modified time is an accurate heuristic for whether a file has already been uploaded or not.
>
> Well but it's an accurate way to decide whether a file has been changed on the filesystem, and that's what collectstatic cares about. The storage backend *is* the API to extend that when needed, so feel free to use it.
>
> It's accurate *only* in certain situations. And on a distributed development team, I've run into a lot of issues with developers re-upload files that have already been uploaded because they just recently updated their repo.
>
> A checksum is the only true accurate method to determine if a file has changed.
>
> Additionally, you didn't address my point that I quoted from. Storage backends don't just reflect filesystems--they could reflect files stored in a database, S3, etc. And some of these filesystems don't support last modified times.
>
> > It might be a better idea to let the backends decide when a file has been changed (instead of just calling the backend's last modified method).
>
> I don't understand, you can easily implement exactly that in the last_modified method if you'd like.
>
> This is a bit confusing...why call it last_modified when that's doesn't necessarily reflect what it's doing? It would be more flexible to create two methods:
>
> def modification_identifier(self):
>
> def has_changed(self):
>
> Then, any backend could implement these however they might like, and collectstatic would have no excuse in uploading the same file more than once. Overloading last_modified to also do things like calculate md5's seems a bit hacky to me, and confusing for any developer maintaining a custom storage backend that doesn't support last modified.
>
> Dan
>
>
> --
> You received this message because you are subscribed to the Google Groups "Django developers" group.
> To view this discussion on the web visit https://groups.google.com/d/msg/django-developers/-/weKD2x1XY4oJ.

Dan Loewenherz

unread,
Oct 8, 2012, 11:48:53 AM10/8/12
to django-d...@googlegroups.com
> On Mon, Oct 1, 2012 at 12:47 AM, Jannis Leidel <lei...@gmail.com> wrote:
Then, frankly, this is a problem of the storage backends, not Django's. The S3BotoStorage backend *does* have a modified_time method:

  https://bitbucket.org/david/django-storages/src/1574890d87be/storages/backends/s3boto.py#cl-298

What storage backend do you use that doesn't have a modified_time method?

I don't think you're seeing the problem I'm having. I'm working with a distributed team using git. This means when we check out files, the local modified time is the time at which I checked the files out, not the time which the files were actually last modified.

As a result, it's a questionable metric for figuring out if a file is the same or not, since every team member's local machine thinks they were all just created! We end up re-uploading the file every time.

> This is a bit confusing...why call it last_modified when that's doesn't necessarily reflect what it's doing? It would be more flexible to create two methods:

It's called modified_time, not last_modified.

Sorry, typo.
 

> def modification_identifier(self):
>
> def has_changed(self):
>
> Then, any backend could implement these however they might like, and collectstatic would have no excuse in uploading the same file more than once. Overloading last_modified to also do things like calculate md5's seems a bit hacky to me, and confusing for any developer maintaining a custom storage backend that doesn't support last modified.

I disagree, modified_time is perfectly capable of handling your use case.

No it does not address my needs, as I described above.

Dan

Dan Loewenherz

unread,
Oct 8, 2012, 11:56:08 AM10/8/12
to django-d...@googlegroups.com
On Mon, Oct 8, 2012 at 3:50 AM, Jannis Leidel <lei...@gmail.com> wrote:
You're able to ask S3 for the date of last modification, I don't see why a comparison by hashing the file content is needed additionally. It'd have to download the full file to do that on Django's side and I'm not aware of a API for getting a hash from cloudfiles, S3 etc.

S3 stores the md5 info in an Etag header. 

Regarding Cloudfiles, this is what Rackspace has to say:

You can ensure end-to-end data integrity by including an MD5 checksum of your object's data in the ETag header. You are not required to include the ETag header, but it is recommended to ensure that the storage system successfully stored your object's content.

Dan

ptone

unread,
Oct 8, 2012, 1:06:43 PM10/8/12
to django-d...@googlegroups.com, d...@dlo.me


On Monday, October 8, 2012 8:49:58 AM UTC-7, Dan Loewenherz wrote:

> On Mon, Oct 1, 2012 at 12:47 AM, Jannis Leidel <lei...@gmail.com> wrote:
Then, frankly, this is a problem of the storage backends, not Django's. The S3BotoStorage backend *does* have a modified_time method:

  https://bitbucket.org/david/django-storages/src/1574890d87be/storages/backends/s3boto.py#cl-298

What storage backend do you use that doesn't have a modified_time method?

I don't think you're seeing the problem I'm having. I'm working with a distributed team using git. This means when we check out files, the local modified time is the time at which I checked the files out, not the time which the files were actually last modified.

As a result, it's a questionable metric for figuring out if a file is the same or not, since every team member's local machine thinks they were all just created! We end up re-uploading the file every time.

While git may be common, and your problem not unique - this is still a condition of your dev environment rendering modification dates invalid. There might be other situations where this is the case (I've run into scripts that muck with modification dates based on camera/jpeg metadata).

So after some further discussion on IRC - it was determined that md5, while somewhat common, was far from a standard, and was likely not to be available as remote call for network based storage backends. And so the final resolution is to wontfix the ticket.

In the end - this lack of a universal fingerprint is just a limitation of our storage tools.

-Preston

Alex Ogier

unread,
Oct 8, 2012, 1:23:49 PM10/8/12
to django-d...@googlegroups.com
On Mon, Oct 8, 2012 at 1:06 PM, ptone <pre...@ptone.com> wrote:
While git may be common, and your problem not unique - this is still a condition of your dev environment rendering modification dates invalid. There might be other situations where this is the case (I've run into scripts that muck with modification dates based on camera/jpeg metadata).

So after some further discussion on IRC - it was determined that md5, while somewhat common, was far from a standard, and was likely not to be available as remote call for network based storage backends. And so the final resolution is to wontfix the ticket.

In the end - this lack of a universal fingerprint is just a limitation of our storage tools.

-Preston

Is there a reason this fingerprint must be universal? If you're dealing with a backend like S3, where network latency and expensive writes are a problem, but md5 is a builtin remote call (available on any GET), why not just do an md5 sum in the _save() method? Basically, just buffer the File object you receive, take an md5 in python, and then make a decision whether to upload or not. In the common case of reading from local disk and writing to S3, this is a big win, and doesn't require cooperation from any other backends, or standardizing on md5 as a fingerprint method.

Best,
Alex Ogier 

Jeremy Dunck

unread,
Oct 8, 2012, 11:14:48 PM10/8/12
to django-d...@googlegroups.com, django-d...@googlegroups.com
Would it be reasonable to have a backend-specific hook to determine a fingerprint, where that could be mtime or md5 or whathaveyou as long as equality (or maybe ordering) works?


--
You received this message because you are subscribed to the Google Groups "Django developers" group.

Stephen Burrows

unread,
Oct 9, 2012, 1:28:42 AM10/9/12
to django-d...@googlegroups.com
I'm a little confused by the track the discussion took recently... my impression was that the solution would *not* be to change from last_modified to a checksum, or to add an additional checksum method. Instead, storage backeds could support a has_changed method, which could be overridden to provide last_modified checking *or* checksum comparisons - or any other method of checking whether a file has changed. This seems like a useful, easy-to-explain, very generic way to handle checking whether a file has changed. And since what staticfiles actually cares about is whether the file has changed, it seems like it would make more sense to use a method that checks whether the file has changed, rather than just checking the last modification date.

Would I be correct in thinking that the main argument against this is API bloat, or is there something else that I'm not seeing?

Florian Apolloner

unread,
Oct 9, 2012, 4:11:17 AM10/9/12
to django-d...@googlegroups.com
Hi Jeremy,


On Tuesday, October 9, 2012 5:15:04 AM UTC+2, jdunck wrote:
Would it be reasonable to have a backend-specific hook to determine a fingerprint, where that could be mtime or md5 or whathaveyou as long as equality (or maybe ordering) works?

Given our discussion in IRC yesterday the answer seems to be no. Eg consider this: collectstatic actually uses two backends, a "source" storage-backend and an "upload" storage-backend. Now if the storage-backends are allowed to return any fingerprint, how would you make sure that both of these backends actually use the same fingerprint? So essentially to have a S3-backend which (let's just abuse modified_time for now) returns a md5 hash from modified_time, you'd have to also change the other backend you want to use (usually just the local filesystem backend) to return md5 instead…

Cheers,
Florian

Florian Apolloner

unread,
Oct 9, 2012, 4:20:39 AM10/9/12
to django-d...@googlegroups.com
Hi Stephen,


On Tuesday, October 9, 2012 7:28:43 AM UTC+2, Stephen Burrows wrote:
I'm a little confused by the track the discussion took recently... my impression was that the solution would *not* be to change from last_modified to a checksum, or to add an additional checksum method. Instead, storage backeds could support a has_changed method, which could be overridden to provide last_modified checking *or* checksum comparisons - or any other method of checking whether a file has changed. This seems like a useful, easy-to-explain, very generic way to handle checking whether a file has changed.

This would be one way to do it (we've also discussed that in IRC), but we couldn't figure out any implementation that would actually be useable. To make this work you'd need to have a function signature like "def has_changed(file_name, other_thing)" whereas other_thing would be some hash/modified_time whatever provided by the source storage. Now we are back to the situation I described in my answer to Jeremy…

 
And since what staticfiles actually cares about is whether the file has changed, it seems like it would make more 
sense to use a method that checks whether the file has changed, rather than just checking the last modification date.

Well staticfiles doesn't care, it's only collectstatic which cares to some extend. So in my opinion the cleanest way (which means without coupling the two needed storage backends together to strongly) is to provide your own collectstatic command where you can do what you want with the checks (if you have problems implementing that we'd happily move some code to extra methods to make reusing collectstatic easier).
 
Would I be correct in thinking that the main argument against this is API bloat, or is there something else that I'm not seeing?

I'd rather say: We don't see __any__ nice way to actually support something which is generic enough the be useful.

Cheers,
Florian

Stephen Burrows

unread,
Oct 10, 2012, 6:41:37 PM10/10/12
to django-d...@googlegroups.com
Thanks, Florian. That clarifies things. I didn't realize collectstatic was using two backends, but it makes sense. And thanks for putting up with me not being up to speed on the IRC discussion.
Reply all
Reply to author
Forward
0 new messages