Pulling data from DB into memory

Ivan Sagalaev

unread,

Feb 18, 2009, 3:20:39 AM2/18/09

to django-d...@googlegroups.com

Hello!

This Malcolm's reply in a nearby thread:

> Unless you're
> using really huge querysets, the memory usage is not going to kill you.
> Pulling back the huge number of results already uses a bunch of memory
> and that's a property of the db wrapper.

... has reminded me that this behavior was bothering me for some time.
In fact pulling everything into a local memory on .execute('select') is
not an inherent property of a db wrapper. MySQLdb and psycopg2 can do
server-side cursors and I'm pretty sure Oracle's driver is as able.

There even was a discussion about this feature [1] where Malcolm has
raised some practical questions. I believe this feature is very useful
in some cases where one wants to iterate over a queryset and count
something or dump its content into another storage. So is there an
interest in implementing this behavior to be controllable from queryset
interface?

[1]:
http://groups.google.com/group/django-developers/browse_frm/thread/f7b78b399c478cf

Malcolm Tredinnick

unread,

Feb 18, 2009, 5:36:36 PM2/18/09

to django-d...@googlegroups.com

On Wed, 2009-02-18 at 11:20 +0300, Ivan Sagalaev wrote:
> Hello!
>
> This Malcolm's reply in a nearby thread:
>
> > Unless you're
> > using really huge querysets, the memory usage is not going to kill you.
> > Pulling back the huge number of results already uses a bunch of memory
> > and that's a property of the db wrapper.
>
> ... has reminded me that this behavior was bothering me for some time.
> In fact pulling everything into a local memory on .execute('select') is
> not an inherent property of a db wrapper. MySQLdb and psycopg2 can do
> server-side cursors and I'm pretty sure Oracle's driver is as able.
>
> There even was a discussion about this feature [1] where Malcolm has
> raised some practical questions.

So, once again, it was Jeremy stirring the pot and it's you keeping the
issue alive a couple of years later. Clearly I have to create a whole
new category of troublemaker in my little book. The current severity
levels don't go high enough. :-)

I guess you should be thanked for having a good memory, though. But
don't think that's getting you back in the good books.

> I believe this feature is very useful
> in some cases where one wants to iterate over a queryset and count
> something or dump its content into another storage. So is there an
> interest in implementing this behavior to be controllable from queryset
> interface?
>
> [1]:
> http://groups.google.com/group/django-developers/browse_frm/thread/f7b78b399c478cf

I think the comments and requests for information Younger Me made in
that thread are still valid. These days I would also throw in checking
how concurrent cursor behaviour on the same connection works. We already
know that SQLite cannot handle creating a new result set / cursor when
an existing cursor is only partially read and I wonder if there are
similar issues like that (again, it's related to lifecycle and resource
management of the server-side data).

As I wrote that previous paragraph, I also remembered [2], mentioning
unbuffered cursors in MySQLdb (at least in the future, as of a year ago)
and the need for one cursor at a time in that case. I haven't
investigated that any deeper than re-reading the post just now, but it's
something I would want to understand before making decisions about what
an API looks like here.

I suspect it is probably possible to make this work. How much it
complicates Django internals and userspace usage is something to
understand first, though.

[2] http://mysql-python.blogspot.com/2008/03/i-am-not-dead.html

Regards,
Malcolm

Ivan Sagalaev

unread,

Feb 18, 2009, 6:47:28 PM2/18/09

to django-d...@googlegroups.com

Malcolm Tredinnick wrote:
> So, once again, it was Jeremy stirring the pot and it's you keeping the
> issue alive a couple of years later.

Well, the issue wasn't resolved. What could I do? :-)

> I think the comments and requests for information Younger Me made in
> that thread are still valid. These days I would also throw in checking
> how concurrent cursor behaviour on the same connection works. We already
> know that SQLite cannot handle creating a new result set / cursor when
> an existing cursor is only partially read and I wonder if there are
> similar issues like that (again, it's related to lifecycle and resource
> management of the server-side data).

I think we can solve the whole issue of different concurrency levels
within one connection among backends just by having more than one
connection. I.e. make a new connection for each queryset requested with
a server-side flag.

API may be as simple as:

MyModel.objects.all().server_side()

which sould trigger a flag in a queryset instance that is then passed
onto backends cursor() method. This will require changing all backends
to hold a list of opened connections but I thinks it's not too
complicated. Luckily there's no public API in Django for accessing
connection objects directly. There's db._connection but it has '_'
before it for a reason :-)

Another option is to reuse an old proposal of Simon Willison on .using()
method for querysets that switches it between several configured
databases. Then we can invent a new setting system for multiple DBs like:

DATABASE = {
'default': ...,
'huge_archive': {
'name': ...,
'user_server_side_cursor': True,
}
}

And then a queryset can `.using('huge_archive')` to use server-side
cursors. But I like this thing less because it's more complex and it
confuses into one method slightly different things.

> As I wrote that previous paragraph, I also remembered [2], mentioning
> unbuffered cursors in MySQLdb (at least in the future, as of a year ago)
> and the need for one cursor at a time in that case.

I believe Andy was talking about MySQLdb > 1.2 but we're still on 1.2 so
it's not something I'd consider here.

Reply all

Reply to author

Forward