Old Blob / attachments folder cleanup

204 views
Skip to first unread message

AMcBain

unread,
Feb 7, 2011, 6:17:00 AM2/7/11
to play-framework
I may have overlooked a solution to the following, but I am posting
the following in hopes of getting a complete resolution as there does
not seem to be one on this group so far or in documentation on the
Play! website.


Preface:

With Blobs in Play! 1.1.x, every time a user uploads a new file a new
destination file name is used. New upload files will not overwrite old
upload files. FileAttachments from Play! 1.0.x have more deterministic
destination file names, allowing new uploads to overwrite old uploads.
The naming system used by 1.1.x's Blobs will leave files on the disk
that are no longer associated with a database entry. Blobs associated
with entities which are deleted do not result in the deletion of the
files on the disk associated with those Blobs.† Blob offers a method,
getFile, for returning the the file on the disk associated with that
Blob (if there is one). However, a message on this group indicates
this method will become less visible (private) in a future release.
Lastly, an official Play! developer stated FileAttachments were
deprecated in favor of Blobs because of potential synchronization
issues between the database and the file system, and code in Play!
relating to handling of FileAttachment indicates support will
disappear in a future release.

† I do not know what happens with files associated with
FileAttachments. I honestly hadn't bothered to look before. I don't
believe the online documentation (website or JavaDoc) says.


Problem:

The main issue is disk clean up. Old files hanging around will add up
to quite a bit of disk space over time. Especially on very busy sites
with relatively core features based around uploads. This issue applies
to applications where "conversion" from an Upload to a Blob is handled
automatically for the application (saving of an auto-bound Model
instance from request parameters).


Resolutions:

The current suggestion to deal with this issue, given by an official
Play! developer via a message to this group, was to create a periodic
Job to clean up old files. Inside such a Job, the code can get all
model entries with associated Blobs and delete all files on disk that
are not associated with a Blob. *I cannot see this scaling.* Also,
given we know the getFile method will go away, we no longer have a
valid way of determining names of files on disk that are considered
"live."

Another "solution" is to have Blob data always stored in the database,
where it can more easily be updated. However, there is currently no
standard way of doing this (leaving it up to each developer to invent
their own way) and I know even the idea of storing files in a database
makes some developers cringe. Also, there are some databases where
storing potentially-large amounts of binary data could result in poor
database performance.

Finally, a custom Blob implementation could written which returns to a
more deterministic naming convention like the one used by
FileAttachment. Similar to the previous resolution, this would
currently also require everyone needing such a solution to write their
own. Like FileAttachment, this approach may have similar
synchronization issues.

(The above list is by no means exhaustive, just the ones most likely
to come to mind first.)


Final thoughts:

There is currently no sane way to ensure proper disk clean up of old
file uploads. FileAttachment may seem like a nice place to retreat,
but as noted in the preface it has potential synchronization issues
and support will eventually disappear.

Quite frankly, I find it strange Play! does not offer any built-in way
to manage this issue as it will affect many applications. I can
understand it would not be any easy task given how Blob works. For
example, even if Blob offered a deletion method as part of its API,
there would still need to be a way to clean up old no longer
referenced files. Lastly, I do not have any acceptable solutions to
this problem and I am unfortunately not currently in a position to be
able to share an implementation of one if I did.

I would very much appreciate it if I could get a final word on this.

grandfatha

unread,
Feb 7, 2011, 7:31:37 AM2/7/11
to play-framework
Right now, I use a "pre-delete" callback on the entity that owns the
Blob to delete the file. I guess this wont work anymore, once the
underlying getFile() method gets removed.

I actually like the Django-approach. It decouples the problem by
introducing a pluggable "FileStorage" API that can be overriden by the
user. It defaults to a FileSystemStorage class that pretty much works
like the Play blob.

http://docs.djangoproject.com/en/dev/topics/files/

One can also implement a custom storage mechanism:

http://docs.djangoproject.com/en/dev/howto/custom-file-storage/

Guillaume Bort

unread,
Feb 7, 2011, 8:08:48 AM2/7/11
to play-fr...@googlegroups.com
You can also write your own storage with Play. Just provide your own
implementation of BinaryField.

> --
> You received this message because you are subscribed to the Google Groups "play-framework" group.
> To post to this group, send email to play-fr...@googlegroups.com.
> To unsubscribe from this group, send email to play-framewor...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/play-framework?hl=en.
>
>

--
Guillaume Bort, http://guillaume.bort.fr

For anything work-related, use g...@zenexity.fr; for everything else,
write guillau...@gmail.com

Guillaume Bort

unread,
Feb 7, 2011, 8:12:04 AM2/7/11
to play-fr...@googlegroups.com
The current implementation (with UID generated file name) is the only
way to make it work properly in all cases.

Because if you replace (or delete) the old binary file on the
filesystem, and if eventually the corresponding value is not updated
in the database (because of a transaction rollback) your system is
incoherent (file deleted while the row is still there in the database,
or wrong version of the binary file). It was the problem with the old
FileAttachment.

Art McBain

unread,
Feb 9, 2011, 4:04:52 AM2/9/11
to play-framework
Okay, I probably won't say this often enough since I now realized my
mistake and Google's terrible interface. Since I used the Groups
website instead of e-mails (I disabled e-mail updates because I
figured it would be a source of "spam"), I clicked "reply to author"
thinking it would just link my post to the one above it instead of
being at the end of the thread. I am very sorry that I may have
spammed grandfatha and Guillaume Bort. I'm very sorry, my aim was not
to be a spammer ... on to what I wanted to say publically:

@grandfatha
This still doesn't solve the issue completely since a user uploads a
new file and updates a Blob field on an entity, a new file on the disk
is created and you can no longer get a File object. So deleting the
Files when the parent entity of a Blob is deleted won't remove all
files.

Art McBain

unread,
Feb 9, 2011, 4:11:45 AM2/9/11
to play-framework
Again sorry about the spam. :(

@Guillame Bort
Okay, I understand that. I'm okay with it. The problem is that process
orphans files on disk when Blobs are updated with a new file upload. I
know it was previously suggested we just do cleanup with a periodic
Job but even if I were to track the locations of old files associated
with a Blob for later deletion with said Job, the fact that the
getFile() method on Blob is going away makes this now impossible (as
there is no way to get the name of the file so it can be tracked). It
would be nice to know if there is a suggested way to handle this that
(ideally) doesn't require an extra database table to track old files
and rely on an API method that will go away.
> Guillaume Bort,http://guillaume.bort.fr
>
> For anything work-related, use g...@zenexity.fr; for everything else,
> write guillaume.b...@gmail.com

Art McBain

unread,
Feb 9, 2011, 11:38:59 PM2/9/11
to play-framework
Okay, whee! So after deciding I should file a ticket for this, I went
to look up posts here on the group to give proper support for a more
concise "here's what needs fixed" description. I found the original
posting that I now notice I obviously misread:

http://groups.google.com/group/play-framework/browse_thread/thread/588a5b1c89b64cd0/85450c4d9a5f10cd#85450c4d9a5f10cd

In this post, the getFile() method is *not* going away, and was in
fact exposed because it would be useful to the end of cleaning up
orphans. So, knowing this, my special table tracking proposal from
above seems to stand as the best way to manage these orphans. Below is
how I believe this could be reasonably implemented.


**What is needed**
* A periodic Job subclass for cleaning up orphans.
* A table for tracking file names of orphans.

**Process**
When an entity is going to be saved or deleted, get a copy of the
entity from the database. Store the file names of the Blobs in that
entity to the tracking table. This must be done in the same
transaction to ensure you do not later clean up files still in use.

The Job then comes along and reads from the tracking table and deletes
the files on disk listed by the table and clears the tracking table.

**Upsides**
Unlike deleting files associated with Blobs when the parent entity is
deleted, you ensure files orphaned by updating a Blob (instead of
deleting the parent entity) are removed. This also doesn't have an
issue where you delete the Blob-associated file, then end up somehow
having the transaction rolling back and that file no longer existing.

**Downsides**
This requires an extra database call every time you want to save an
entity with Blobs.



I hope this helps anyone who finds this in creating a useful clean-up
mechanism.
Reply all
Reply to author
Forward
0 new messages