BinaryField in database

617 views
Skip to first unread message

Todd O'Bryan

unread,
Aug 5, 2006, 3:54:13 PM8/5/06
to django-d...@googlegroups.com
It seems that requests for some way to store binary data in the
database is a perennial request. I've seen comments (I think Adrian
said it "opens a can of mutated worms."), but never a real
discussion of what the problems would be.

There's a recent ticket, #2417, that adds support for "small-ish"
bits of binary data using a BinaryField. It uses the VARBINARY type
in MySQL, which is limited to 64k bytes of data. It also subclasses
CharField, so it will show up in the Admin, which may or may not be a
good idea.

The main problems I see with dealing with binary types in the
database involve making sure that you never have too much stuff in
memory at any time, either while you're loading the file into memory
or while you're outputting it to the HttpResponse. I solved this in a
Java webapp by breaking the file into chunks, storing each chunk
separately in the database, and uploading and downloading only a
chunk at a time.

Here's what I'm suggesting:

class DatabaseFile(models.Model):
name = models.CharField(maxlength=80)
content_type = models.TextField()
last_modified = models.DateTimeField()
size = models.IntegerField()
owner = models.ForeignKey(User, blank=True)

class DatabaseFileChunk(models.Model):
file = models.ForeignKey(DatabaseFile)
number = models.IntegerField()
content = models.BinaryField() #implemented with bytea in
Postgres and BLOB in MySQL

When a file is uploaded and the developer wants it to go into the
database, only 64kb of data (or so, but this seems reasonable) are
read into memory at a time and stuck into a DatabaseFileChunk,
numbered consecutively from 0 to however many are needed.

A DatabaseFile would be a file-like object, iterable, and would be
output to an HttpResponse a chunk at a time, so never more than about
64k of server memory is used to serve the file from the database.

Yes, this will be slower than having Apache serve the file directly,
but it has the huge advantage that the file is served as the result
of a view. That means you can do all kinds of interesting permission
checking, url mapping, and general futzing around internal to Django,
without having to interact with whichever web server you're using.

Using fairly big BLOBs locally (images of about 750kb), the file was
served almost instantaneously using Django's development server, so I
think the performance hit would likely be acceptable to people who
really want the ability to save files in the database rather than on
the filesystem. (And I, for one, desperately need that flexibility.
I'm going to have about 1900 users, and I don't want to have to route
files to folders, set Apache permissions, etc., when Django has such
a nice API for handling relations.)

I'm going to try to code this up this afternoon, but please let me
know if anyone sees huge problems with it.

Thanks,
Todd

Malcolm Tredinnick

unread,
Aug 5, 2006, 8:12:45 PM8/5/06
to django-d...@googlegroups.com
On Sat, 2006-08-05 at 15:54 -0400, Todd O'Bryan wrote:
> It seems that requests for some way to store binary data in the
> database is a perennial request. I've seen comments (I think Adrian
> said it "opens a can of mutated worms."), but never a real
> discussion of what the problems would be.
>
> There's a recent ticket, #2417, that adds support for "small-ish"
> bits of binary data using a BinaryField. It uses the VARBINARY type
> in MySQL, which is limited to 64k bytes of data. It also subclasses
> CharField, so it will show up in the Admin, which may or may not be a
> good idea.

It sounds like a bad idea to me. Attempting to display random binary
data as a string doesn't sound that useful and although the security
issues are solvable, they aren't necessarily easy.

[...]


> Yes, this will be slower than having Apache serve the file directly,
> but it has the huge advantage that the file is served as the result
> of a view. That means you can do all kinds of interesting permission
> checking, url mapping, and general futzing around internal to Django,
> without having to interact with whichever web server you're using.

I'm not a big fan of this idea, so I'm not going to address the details
(although it does make me think that we need a documented way of
extending fields so that people can do this without needing to dive into
the core and so that this stuff can be shipped independently of Django).
However, this paragraph needs correcting. You can *already* serve files
through views and take advantage of Django's permissions system. The
HTTP interfaces can take iterables as their source for output, so a
well-designed wsgi server should have no real memory problems with a
well-designed iterator. Or if somebody wants to store their static files
on another system entirely, you can do that with redirects (which could
be internal) and views and custom header (with encrypted token) as well.

The only thing we can't do internally at the moment is do direct
filehandle passing or anything like that.

Regards,
Malcolm

Todd O'Bryan

unread,
Aug 5, 2006, 10:07:07 PM8/5/06
to django-d...@googlegroups.com

Right. It is all possible, but it involves reading in a file from the
filesystem and re-outputting it through Django. Even though that's
the supposed reason that you should use the real filesystem--because
Django's ill-suited to serving files and it's better to use Apache
for that kind of stuff. (At least that's what the documentation says.)

And I agree with the sentiment, but when there are use cases that
require fine grained control over access to binary data, I find
myself wishing that it were at least possible to use the wonderful
database API that's built into Django to make it easier. I gave the
case of controlling permissions on user files that are uploaded. The
responses ranged from "Apache is not a CMS" to "store the file
somewhere inaccessible and then move it somewhere accessible when you
need to serve it." Isn't *Django* supposed to be a CMS? :-)

Another guy said that he needed transactional control on his
binaries, and asked for a link to where the issues of why Django
doesn't support binary database fields were discussed:

http://groups.google.com/group/django-developers/browse_thread/thread/
4ed4ca102062f8be/3c2193a8774c1455?lnk=gst&q=blob&rnum=2#3c2193a8774c1455

As of now, his question is unanswered.

So, if you have to run files through the system anyway, what are the
reasons people don't want BLOB/bytea fields in the database?

Todd

Malcolm Tredinnick

unread,
Aug 5, 2006, 10:44:59 PM8/5/06
to django-d...@googlegroups.com

I only brought this up so that you could focus you discussion on the new
bits you are developing (a new storage method, not new access controls).
The drawbacks and advantages of files through views are largely the
same, whether the files are in the database or on the filesystem,
although obviously there are some subtleties and differences in each
case, such as binary field size limits.

That is all I was trying to say. Keeping focused on the main point of
your addition will make it easier for everybody to follow the
details. :-)

[...]


> Isn't *Django* supposed to be a CMS? :-)

No. You could build a CMS with Django (e.g. Ellington), though.

[...]


> So, if you have to run files through the system anyway, what are the
> reasons people don't want BLOB/bytea fields in the database?

So far there hasn't been a really strong proposal to include one. The
size limitation with MySQL, for example, is very real. The need to
escape the data on the way in is an issue -- we need to ensure all the
backends do this properly. For example, although PostgreSQL can store
nuls in its bytea fields, you still need to octal-escape them on input.
That is something the forthcoming tests for ticket #2417 will need to
include.

Including a binary field type is not in any way ruled out -- in fact, as
you can see from #2417, it's considered promising. It just hasn't been
included yet. I would like to see somebody put some effort into
completing the necessary parts of #2417 so that we can commit it and
then people like you can start building filestorage methods or whatever
they like on top of it. It feels a little cart-before-the-horse at the
moment, since you're talking about an app that uses the binary field
before we even have a binary field to use.

Best wishes,
Malcolm

Jason F. McBrayer

unread,
Aug 7, 2006, 1:04:14 PM8/7/06
to django-d...@googlegroups.com
"Todd O'Bryan" <toddo...@mac.com> writes:

> Yes, this will be slower than having Apache serve the file directly,
> but it has the huge advantage that the file is served as the result
> of a view. That means you can do all kinds of interesting permission
> checking, url mapping, and general futzing around internal to Django,
> without having to interact with whichever web server you're using.

You can already do all those sorts of things by writing views that
read files from the filesystem and return them wrapped in a
HttpResponse. I've done this myself in a recent application. The
files just need to be stored outside of Apache's DocumentRoot (or
whatever the equivalent on lighttpd or your server of choice is).
That requirement interacts inconveniently with the admin app's
handling of FileFields, but there are at least two ways to work around
that. So you don't actually need to store files in the database in
order to do this.

--
+-----------------------------------------------------------+
| Jason F. McBrayer jmc...@carcosa.net |
| A flower falls, even though we love it; and a weed grows, |
| even though we do not love it. -- Dogen |

Reply all
Reply to author
Forward
0 new messages