I have written a large application for law firms to manage their files
and communications. They can upload files (pdfs, docs, etc) and
generate then using the system. I have also developed a fully
integrated communication component supporting online email and fax, so
every email attachment that they send or receive gets stored as well.
So basically I have a system with tons of files (about 2GB for this
one law firm). Currently I store these files on disk, and keep a
reference to them in the database. This has been fine for this one
client, but they now wish to take this application to market and build
it out under a SaaS model, so some transitioning is going to happen.
I am wondering how you guys (and gals) manage files in your
applications? Our target is to build this system out to support 200
law firms, at which point we would rebuild the system with SaaS in
mind from the start. 400GB of file storage does not seem practical in
this case.
So how do you folks handle things?
Do you typically ZIP uploaded files?
Do you use OS directory compression on file storage folders? That
would be an easy fix requiring no code adjustment.
Do you implement archiving in your application? If so, how do you go
about this?
Do you store files in the DB? Is this really practical when storage
might be 100GB or more? I would think that DB backups would be hell.
Any input would be appreciated.
-Brian
My last project was for a SASS company that dealt with graphics files
(from 100kb to 150mb each!). Storage when I left was around 4tb of
files, 2tb of on-line and 2tb of near-line.
On a very basic level, we stored pointers in the DB. There was an
abstraction around this called "the asset system".
It had the concept of "assets" (files), which were stored in "asset
stores" (folders).
Assets stores had both a UNC path to the folder, and public URI if
applicable. Assets contained pointers to the file name (generation of
these file names took a few attempts to get right due to concurrency
issues, ie, 20 files uploaded at the same time)
Stores were configured with a maximum size, allocation priority, etc.
They were organised into logical management units called "asset
farms", (per client, per file type, etc).
Adding more storage was simply a case of adding a new Asset Store &
allocating it to an Asset Farm, this Asset Store could reside on any
server, SAN, NAS, etc because to the software, it was really just a
UNC path to a folder).
We didn't worry about ZIP/compressing uploaded files, as storage it
cheap these days.
We implemented arching to near-line, purging after N years, etc.
These sorts of issues become a lot easer to deal with when you each
file had a corresponding database record with metadata.
The biggest tip I can give is to come up with a nice API for dealing
with your "storage system abstraction", the rest becomes really easy.
I wouldn't store images in the DB as you are really just wasting DB
resources (which are expensive) for a problem that can be solved at
the app level.
Dave
-Brian