queryset caching note in docs

231 views
Skip to first unread message

Marco Paolini

unread,
Oct 28, 2011, 6:23:46 AM10/28/11
to django-d...@googlegroups.com
Hi all,

I'd like to add a small note of warning about queryset caching in dos,
in topics/db/queries.txt "Caching and QuerySets" section,
something like:

Keep in mind when looping through a queryset and altering data that
might be returned by the same queryset, that instances might not
be "fresh".

You might want to refresh the instance before calling some table-wide data-altering
method on it::

>>> for obj in Entry.objects.all():
>>> obj = Entry.objects.get(pk=obj.pk)
>>> obj.alter_some_other_entries()

Should I open a ticket or maybe it's a too specific use-case?

Cheers,

Marco

Tom Evans

unread,
Oct 28, 2011, 6:58:30 AM10/28/11
to django-d...@googlegroups.com
On Fri, Oct 28, 2011 at 11:23 AM, Marco Paolini <markop...@gmail.com> wrote:
> Hi all,
>
> I'd like to add a small note of warning about queryset caching in dos,
> in topics/db/queries.txt "Caching and QuerySets" section,
> something like:
>
> Keep in mind when looping through a queryset and altering data that
> might be returned by the same queryset, that instances might not
> be "fresh".
>
> You might want to refresh the instance before calling some table-wide
> data-altering
> method on it::
>
>    >>> for obj in Entry.objects.all():
>    >>>   obj = Entry.objects.get(pk=obj.pk)
>    >>>   obj.alter_some_other_entries()
>

Isn't that self evident? If you generate a list of models, and then
change the underlying table, why would they be fresh? It's also
nothing to do with queryset caching, or even querysets. You may as
well use this example:

obj1 = Entry.objects.get(id=i)
obj2 = Entry.objects.get(id=j)
obj1.alter_other_objects_in_table()
# oh noes, obj2 is not what is in the database

I'd go with something that explains that model instances are created
from the queryset as soon as you evaluate it, not as you iterate
through it, but I'm pretty sure that is already there.

Cheers

Tom

Marco Paolini

unread,
Oct 28, 2011, 8:05:55 AM10/28/11
to django-d...@googlegroups.com
On 28/10/2011 12:58, Tom Evans wrote:
> On Fri, Oct 28, 2011 at 11:23 AM, Marco Paolini<markop...@gmail.com> wrote:
>> Hi all,
>>
>> I'd like to add a small note of warning about queryset caching in dos,
>> in topics/db/queries.txt "Caching and QuerySets" section,
>> something like:
>>
>> Keep in mind when looping through a queryset and altering data that
>> might be returned by the same queryset, that instances might not
>> be "fresh".
>>
>> You might want to refresh the instance before calling some table-wide
>> data-altering
>> method on it::
>>
>> >>> for obj in Entry.objects.all():
>> >>> obj = Entry.objects.get(pk=obj.pk)
>> >>> obj.alter_some_other_entries()
>>
>
> Isn't that self evident? If you generate a list of models, and then
> change the underlying table, why would they be fresh? It's also
> nothing to do with queryset caching, or even querysets. You may as
> well use this example:
"for obj in qs" does not simply generate a list of instances, it iterates through
qs and, at db API level it issues a fetchmany trying to hit database only when qs cache is empty.

Of course the qs cache is a list of instances.

>
> obj1 = Entry.objects.get(id=i)
> obj2 = Entry.objects.get(id=j)
> obj1.alter_other_objects_in_table()
> # oh noes, obj2 is not what is in the database
>
> I'd go with something that explains that model instances are created
> from the queryset as soon as you evaluate it, not as you iterate
> through it, but I'm pretty sure that is already there.

it's a bit more complex: there are basically two phases:
1) row fetching from db using cursor.fetchmany()
2) model instance creation in queryset

both are delayed as much as possible (queryset lazyness)

phase two (model instance creation) depends on how you evaluate the queryset:
- testing for truth will only create enough instances to fill the cache
- calling .exists() will try to fetch a single object and leave the cache alone
- iterating will create instances in batches
- .iterator() will create the instances one at the time (but on the same db cursos using cursor.fetchmany()
- many other options...

phase one can be executed in chunks, depending on db backend (sqlite no chunks, postgre chunks)


Queryset "laziness" MIGHT lead to a WRONG assumption that objects in a for loop are fresh.

The note only wanted to stress the fact that you'll get "stale" objects if you do table-wide altering operations inside the loop

I agree it's not surprising and it's not going to make your day happier, but it might be interesting to someone.

On the other hand, your objections make sense: it should be quite clear already that in the for loop you're fetching objects
using a cursor and altering data inside the loop will not change objects returned by queryset.

cheers,

Marco

Tom Evans

unread,
Oct 28, 2011, 9:55:58 AM10/28/11
to django-d...@googlegroups.com
On Fri, Oct 28, 2011 at 1:05 PM, Marco Paolini <markop...@gmail.com> wrote:
> it's a bit more complex: there are basically two phases:
>  1) row fetching from db using cursor.fetchmany()
>  2) model instance creation in queryset
>
> both are delayed as much as possible (queryset lazyness)
>
> phase two (model instance creation) depends on how you evaluate the
> queryset:
>  - testing for truth will only create enough instances to fill the cache
>  - calling .exists() will try to fetch a single object and leave the cache
> alone
>  - iterating will create instances in batches
>  - .iterator() will create the instances one at the time (but on the same db
> cursos using cursor.fetchmany()
>  - many other options...
>

You are probably more informed about the internals than me, but isn't
it irrelevant about when the instances are created; they are created
from the same query, even if the instances are not instantiated until
iterated to. Any changes to the database that happened after that
query is executed will not be reflected in the instances that are
created from that query.

IE the critical point is when the query is executed, not when the data
is fetched from the db cursor.

Cheers

Tom

Marco Paolini

unread,
Nov 1, 2011, 8:41:37 PM11/1/11
to django-d...@googlegroups.com
Yep, you're right

while digging into django and sqlite3 backend internals I discovered:

sqlite cursors from the same connection are NOT isoltated:

>>> curs1 = conn.cursor()
>>> curs2 = conn.cursor()
>>> curs1.execute('select * from entry')
>>> curs1.fetchmany(100)
[(1, 'hey'), (2, 'hey',)...
>>> curs2.execute('update entry set name =\'\'')
>>> curs1.fetchmany(100)
[(100, ''), (101, '',)...

to overcome this issue django fetches all cursor rows in memory in a single list while iterating
a resultset (see django/db/models/sql/compiler.py SqlCompiler.execute_sql)

so if you do this:

for obj in Entry.objects.all():
pass

django does this:
- creates a cursor
- then calls fetchmany(100) until ALL rows are fetched
- creates a list containing ALL fetched rows
- passes this list to queryset instance for lazy model instance creation

I didn't know that. (maybe we should document it somewhere...)

Now that I do, I also know it's time to move to postgresql...

cheers,

Marco

Luke Plant

unread,
Nov 2, 2011, 4:43:24 AM11/2/11
to django-d...@googlegroups.com
On 02/11/11 00:41, Marco Paolini wrote:

> so if you do this:
>
> for obj in Entry.objects.all():
> pass
>
> django does this:
> - creates a cursor
> - then calls fetchmany(100) until ALL rows are fetched
> - creates a list containing ALL fetched rows
> - passes this list to queryset instance for lazy model instance creation
>
> I didn't know that. (maybe we should document it somewhere...)
>
> Now that I do, I also know it's time to move to postgresql...

And you will then find that the behaviour of the psycopg2 adapter means
that you get very similar behaviour - all rows are fetched as soon as
you start iterating - even if you do .iterator().

Luke

--
"The number you have dialled is imaginary. Please rotate your
telephone by 90 degrees and try again."

Luke Plant || http://lukeplant.me.uk/

Marco Paolini

unread,
Nov 2, 2011, 4:48:43 AM11/2/11
to django-d...@googlegroups.com
On 02/11/2011 09:43, Luke Plant wrote:
> On 02/11/11 00:41, Marco Paolini wrote:
>
>> so if you do this:
>>
>> for obj in Entry.objects.all():
>> pass
>>
>> django does this:
>> - creates a cursor
>> - then calls fetchmany(100) until ALL rows are fetched
>> - creates a list containing ALL fetched rows
>> - passes this list to queryset instance for lazy model instance creation
>>
>> I didn't know that. (maybe we should document it somewhere...)
>>
>> Now that I do, I also know it's time to move to postgresql...
>
> And you will then find that the behaviour of the psycopg2 adapter means
> that you get very similar behaviour - all rows are fetched as soon as
> you start iterating - even if you do .iterator().
thanks for pointing that to me, do you see this as an issue to be fixed?

If there is some interest, I might give it a try.

Maybe it's not fixable, at least I can investigate a bit

Cheers,

Marco

Luke Plant

unread,
Nov 2, 2011, 5:10:16 AM11/2/11
to django-d...@googlegroups.com
On 02/11/11 08:48, Marco Paolini wrote:

> thanks for pointing that to me, do you see this as an issue to be fixed?
>
> If there is some interest, I might give it a try.
>
> Maybe it's not fixable, at least I can investigate a bit

Apparently, the protocol between the Postgres client and server only
does partial sends when using named cursors, which Django doesn't use.
Using named cursors with psycopg2 in certainly possible, but probably
not trivial. That's as much as I know.

Source:

http://thebuild.com/blog/2011/07/26/unbreaking-your-django-application/

The author of that page did say that he was working on a patch.

Marco Paolini

unread,
Nov 2, 2011, 6:47:14 AM11/2/11
to django-d...@googlegroups.com
On 02/11/2011 10:10, Luke Plant wrote:
> On 02/11/11 08:48, Marco Paolini wrote:
>
>> thanks for pointing that to me, do you see this as an issue to be fixed?
>>
>> If there is some interest, I might give it a try.
>>
>> Maybe it's not fixable, at least I can investigate a bit
>
> Apparently, the protocol between the Postgres client and server only
> does partial sends when using named cursors, which Django doesn't use.
> Using named cursors with psycopg2 in certainly possible, but probably
> not trivial. That's as much as I know.
>
> Source:
>
> http://thebuild.com/blog/2011/07/26/unbreaking-your-django-application/
>
> The author of that page did say that he was working on a patch.
ok, I'll try to contact the author

talking about sqlite3, it looks like the only way to isolate two cursors
is to use two different connections.

Let's imagine there is a way to implement this (I'm not sure at this point)

We could have an option somewhere that tells django to use or not use
chunked cursor read from db.

if that option is true, sqlite shoud open one connection per cursor
and psycopg2 should use named cursors

This will cause some overhead for small querysets but will save some memory
for huge ones

cheers,

Marco

Anssi Kääriäinen

unread,
Nov 2, 2011, 7:05:01 AM11/2/11
to django-d...@googlegroups.com, Marco Paolini
On 11/02/2011 12:47 PM, Marco Paolini wrote:
> if that option is true, sqlite shoud open one connection per cursor
> and psycopg2 should use named cursors
>
The sqlite behavior leads to some problems with transaction management -
different connections, different transactions (or is there some sort of
"shared
transaction" in sqlite?). I would just fetch all the data in one go when
using
sqlite. I wouldn't worry about performance problems when using sqlite, it is
meant mostly for testing when using Django.

> This will cause some overhead for small querysets but will save some memory
> for huge ones

For PostgreSQL this would be a nice feature. Any idea what MySQL and Oracle
do currently?

- Anssi

Marco Paolini

unread,
Nov 2, 2011, 7:28:45 AM11/2/11
to Anssi Kääriäinen, django-d...@googlegroups.com
mysql can do chunked row fetching from server, but only one row at a time

curs = connection.cursor(CursorUseResultMixIn)
curs.fetchmany(100) # fetches 100 rows, one by one

Marco

Marco Paolini

unread,
Nov 2, 2011, 7:36:08 AM11/2/11
to Anssi Kääriäinen, django-d...@googlegroups.com
On 02/11/2011 12:05, Anssi K��ri�inen wrote:
maybe we could implement something like:

for obj in qs.all().chunked(100):
pass

.chunked() will automatically issue LIMITed SELECTs

that should work with all backends

this could be a no-op for sqlite, where
cursor are not isolated and you have bad bad performance for LIMITed SELECTs

Marco

Marco Paolini

unread,
Nov 2, 2011, 8:21:43 AM11/2/11
to django-d...@googlegroups.com
On 02/11/2011 12:05, Anssi K��ri�inen wrote:
> On 11/02/2011 12:47 PM, Marco Paolini wrote:
>> if that option is true, sqlite shoud open one connection per cursor
>> and psycopg2 should use named cursors
>>
> The sqlite behavior leads to some problems with transaction management -
> different connections, different transactions (or is there some sort of
> "shared
> transaction" in sqlite?). I would just fetch all the data in one go when
Yes there is a "shared cache mode" that makes all connections share the
same transaction

but this API is not exposed to python by sqlite3

Ian Kelly

unread,
Nov 2, 2011, 9:36:20 AM11/2/11
to django-d...@googlegroups.com
On Wed, Nov 2, 2011 at 5:05 AM, Anssi Kääriäinen
<anssi.ka...@thl.fi> wrote:
> For PostgreSQL this would be a nice feature. Any idea what MySQL and Oracle
> do currently?

If I'm following the thread correctly, the oracle backend already does
chunked reads. The default chunk size is 100 rows, IIRC.

Anssi Kääriäinen

unread,
Nov 2, 2011, 10:18:27 AM11/2/11
to Marco Paolini, django-d...@googlegroups.com
On 11/02/2011 01:36 PM, Marco Paolini wrote:
> maybe we could implement something like:
>
> for obj in qs.all().chunked(100):
> pass
>
> .chunked() will automatically issue LIMITed SELECTs
>
> that should work with all backends
I don't think that will be a performance improvement - this will get rid
of the memory overhead in Django, but would lead to a lot of overhead in
the DB. Assuming you are fetching 10000 objects from the DB, you would
issue these commands to the DB (I hope I got the idea correctly):

SELECT ID, ...
ORDER BY order
LIMIT 0 OFFSET 100

SELECT ID, ...
ORDER BY order
LIMIT 100 OFFSET 100
...
SELECT ID, ...
ORDER BY order
LIMIT 100 OFFSET 9900


For each query the DB will need to do:
- query parse & plan
- If the order is not indexed, a top N sort.
- Fetch the items (even in indexed case you will need to travel the
index for the OFFSET which is not free at all).

So, for the last fetch the DB would need to travel the first 9900 items
in the index (or worse, do a top 10000 sort) and then return the 100
items wanted. This is going to be expensive in the DB. The trade-off of
saving some memory Django side at the expense of doing a lot more work
at the DB side is not a good one. DB resources are in general much more
harder to scale than the Django resources.

You are going to do in total 10000 + 9900 + 9800 + ... + 100 index
travels in the DB, which equals to somewhere around 0.5 million items
traveled in the index. In addition, you will do 100 parse + plan stages.
You really don't want to do that. In addition if there are concurrent
updates to the items, it might be you will miss some objects and see
some objects as duplicates.

- Anssi

Marco Paolini

unread,
Nov 2, 2011, 10:25:27 AM11/2/11
to django-d...@googlegroups.com
On 02/11/2011 14:36, Ian Kelly wrote:
> On Wed, Nov 2, 2011 at 5:05 AM, Anssi K��ri�inen

> <anssi.ka...@thl.fi> wrote:
>> For PostgreSQL this would be a nice feature. Any idea what MySQL and Oracle
>> do currently?
>
> If I'm following the thread correctly, the oracle backend already does
> chunked reads. The default chunk size is 100 rows, IIRC.
>
yes, in Oracle it looks like rows are ALWAYS [1] fetched from the server in chunks when using fetchmany()

[1] http://cx-oracle.sourceforge.net/html/cursor.html#Cursor.arraysize

summarizing (real) chunked fetch capability by backend:

currently | supported | how to implement
sqlite N | Y (1) | using one connection per cursor + shared cache mode
postgres N | Y (2) | using named cursors
mysql N | Y (3) | using custom cursor class
oracle Y | Y | default behavior

(1) in sqlite cursors are not isolated, so we can't effectively use chunked fetchmany
(2) postgres supports chunks when using named cursors
(3) mysql supports only single-row chunks

Marco Paolini

unread,
Nov 2, 2011, 10:42:02 AM11/2/11
to django-d...@googlegroups.com
Yes, that's right,

qs.chunked() can be easily implemented this way:

i = 0
while True:
for chunk in qs[i:i+CHUNK_SIZE]:
pass
if not chunk:
break
i += CHUNK_SIZE

we should find another way for avoiding this memory-hungriness issue for huge querysets

or at least we should document the issue ;)

currently oracle is the only backend that DOES chunked row fetch, all others, for different reasons,
load all rows in memory

Marco

Tom Evans

unread,
Nov 2, 2011, 12:12:05 PM11/2/11
to django-d...@googlegroups.com
On Wed, Nov 2, 2011 at 11:28 AM, Marco Paolini <markop...@gmail.com> wrote:
> mysql can do chunked row fetching from server, but only one row at a time
>
> curs = connection.cursor(CursorUseResultMixIn)
> curs.fetchmany(100) # fetches 100 rows, one by one
>
> Marco
>

The downsides to mysql_use_result over mysql_store_result are that the
mysql thread is locked and unavailable to do anything until the query
is completed and mysql_free_result has been called.

If you do a database query that quickly returns a lot of rows from the
database, and each row returned from the database requires long
processing in django, and you use mysql_use_result, then other mysql
threads are unable to update any table being used, where as if you do
the same thing with mysql_store_result, the tables are unlocked as
soon as the client has retrieved all the data from the server.

In other words, you trade off memory usage against scalability. If
some part of the ORM was reworked to use mysql_use_result, then we
would need to add appropriate docs to explain the dangers of this
approach.

Cheers

Tom

Marco Paolini

unread,
Nov 2, 2011, 12:22:39 PM11/2/11
to django-d...@googlegroups.com
On 02/11/2011 17:12, Tom Evans wrote:
> On Wed, Nov 2, 2011 at 11:28 AM, Marco Paolini<markop...@gmail.com> wrote:
>> mysql can do chunked row fetching from server, but only one row at a time
>>
>> curs = connection.cursor(CursorUseResultMixIn)
>> curs.fetchmany(100) # fetches 100 rows, one by one
>>
>> Marco
>>
>
> The downsides to mysql_use_result over mysql_store_result are that the
> mysql thread is locked and unavailable to do anything until the query
> is completed and mysql_free_result has been called.
>
> If you do a database query that quickly returns a lot of rows from the
> database, and each row returned from the database requires long
> processing in django, and you use mysql_use_result, then other mysql
> threads are unable to update any table being used, where as if you do
> the same thing with mysql_store_result, the tables are unlocked as
> soon as the client has retrieved all the data from the server.
>
other connections in other transactions are locked too?

> In other words, you trade off memory usage against scalability. If
> some part of the ORM was reworked to use mysql_use_result, then we
> would need to add appropriate docs to explain the dangers of this
> approach.
yes indeed.

Scalability is also affected by python thread memory consumption,
it all depends on how big is the queryset being fetched and how often
that queryset is iterated

if you fetch a huge queryset in one chunk and python eats up say 1G
of heap, that's not going to scale well either.

caveats should be clearly documented for both approaches, I think

Marco

Tom Evans

unread,
Nov 2, 2011, 12:33:56 PM11/2/11
to django-d...@googlegroups.com
On Wed, Nov 2, 2011 at 4:22 PM, Marco Paolini <markop...@gmail.com> wrote:
> On 02/11/2011 17:12, Tom Evans wrote:
>> If you do a database query that quickly returns a lot of rows from the
>> database, and each row returned from the database requires long
>> processing in django, and you use mysql_use_result, then other mysql
>> threads are unable to update any table being used, where as if you do
>> the same thing with mysql_store_result, the tables are unlocked as
>> soon as the client has retrieved all the data from the server.
>>
> other connections in other transactions are locked too?

Yes. The exact wording from the C API:

"""
On the other hand, you shouldn't use mysql_use_result() if you are
doing a lot of processing for each row on the client side, or if the
output is sent to a screen on which the user may type a ^S (stop
scroll). This ties up the server and prevent other threads from
updating any tables from which the data is being fetched.
"""

mysql treats the table as in use until the result is freed.

If this behaviour was in place, then you wouldn't have even raised the
original query - the approach you were using was to iterate through a
result set and modify the table you are fetching from. With
mysql_use_result, you would have deadlocked that table in the mysql
server as soon as you tried to update it without first completing the
first query.

Cheers

Tom

Marco Paolini

unread,
Nov 2, 2011, 1:11:41 PM11/2/11
to django-d...@googlegroups.com
Yeah, the discussion has drifted a bit from it's staring point

so, summarizing again:
- mysql supports chunked fetch but will lock the table while fetching is in progress (likely causing deadlocks)
- postgresql does not seem to suffer this issue and chunked fetch seems doable (not trivial) using named cursor
- oracle does chunked fetch already (someone confirm this, please)
- sqlite3 COULD do chunked fetch by using one connection per cursor (otherwise cursors will not be isolated)

Marco

Ryan McIntosh

unread,
Nov 2, 2011, 2:19:21 PM11/2/11
to django-d...@googlegroups.com
I think the discussion actually went a bit sideways. Is there value in a model method to return an iterator which pulls results from a temporary table that gets filled from a model query? This puts the onus on the django-user to use the correct method.

Model.foo().bar().buffered() or .from_tmp()

peace,

Ryan

Marco

--
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to django-d...@googlegroups.com.
To unsubscribe from this group, send email to django-develop...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.

Kääriäinen Anssi

unread,
Nov 2, 2011, 7:06:18 PM11/2/11
to django-d...@googlegroups.com
"""
so, summarizing again:
- mysql supports chunked fetch but will lock the table while fetching is in progress (likely causing deadlocks)
- postgresql does not seem to suffer this issue and chunked fetch seems doable (not trivial) using named cursor
- oracle does chunked fetch already (someone confirm this, please)
- sqlite3 COULD do chunked fetch by using one connection per cursor (otherwise cursors will not be isolated)
"""

I did a little testing. It seems you can get the behavior you want if you just do this in PostgreSQL:
for obj in Model.objects.all().iterator(): # Note the extra .iterator()
# handle object here.

What is happening? Django correctly uses cursor.fetchmany(chunk_size) in models/sql/compiler.py. The chunk_size is hardcoded to 100. The problem is in db/models/query.py, and its __iter__ method. __iter__ will keep self._results_cache, and that is where the memory is consumed. Changing that is not wise, as in many cases you do want to keep the results around. The .iterator() call will skip the __iter__ and directly access the underlying generator.

You can also do objects.all()[0:100000].iterator() and objects are correctly fetched without caching.

Here is a printout from my tests. The memory report is the total process memory use:

Code:
i = 0
for obj in User.objects.all()[0:100000]:
i += 1
if i % 1000 == 0:
print memory()

25780.0kB
26304.0kB
26836.0kB
27380.0kB
27932.0kB
28468.0kB
29036.0kB
29580.0kB
29836.0kB
30388.0kB

And then:
i = 0
for obj in User.objects.all()[0:100000].iterator():
i += 1
if i % 1000 == 0:
print memory()

25216.0kB
25216.0kB
25216.0kB
25216.0kB
25216.0kB
25216.0kB
25216.0kB
25216.0kB
25216.0kB
25216.0kB


This would be worth documenting, with maybe a better named method wrapping the .iterator(). I have no ideas for a better name, though.

I would sure like a verification to this test, I am tired and this seems like too easy of an fix. Or am I missing the problem?

- Anssi

Javier Guerra Giraldez

unread,
Nov 2, 2011, 10:14:39 PM11/2/11
to django-d...@googlegroups.com
On Wed, Nov 2, 2011 at 11:33 AM, Tom Evans <teva...@googlemail.com> wrote:
>> other connections in other transactions are locked too?
>
> Yes. The exact wording from the C API:
>
> """
> On the other hand, you shouldn't use mysql_use_result() if you are
> doing a lot of processing for each row on the client side, or if the
> output is sent to a screen on which the user may type a ^S (stop
> scroll). This ties up the server and prevent other threads from
> updating any tables from which the data is being fetched.
> """

this seems to be the case with MyISAM tables; on the InnoDB engine
docs, it says that SELECT statements don't set any lock, since it
reads from a snapshot of the table.

on MyISAM, there are (clumsy) workarounds by forcing the use of
scratch tables, explicitly copying to temporary tables, or buffering
the output.

"""
SELECT ... FROM is a consistent read, reading a snapshot of the
database and setting no locks unless the transaction isolation level
is set to SERIALIZABLE. For SERIALIZABLE level, the search sets shared
next-key locks on the index records it encounters.
"""
(http://dev.mysql.com/doc/refman/5.0/en/innodb-locks-set.html)


--
Javier

Tom Evans

unread,
Nov 3, 2011, 8:40:56 AM11/3/11
to django-d...@googlegroups.com
On Thu, Nov 3, 2011 at 2:14 AM, Javier Guerra Giraldez
<jav...@guerrag.com> wrote:
> this seems to be the case with MyISAM tables; on the InnoDB engine
> docs, it says that SELECT statements don't set any lock, since it
> reads from a snapshot of the table.
>
> on MyISAM, there are (clumsy) workarounds by forcing the use of
> scratch tables, explicitly copying to temporary tables, or buffering
> the output.
>
> """
> SELECT ... FROM is a consistent read, reading a snapshot of the
> database and setting no locks unless the transaction isolation level
> is set to SERIALIZABLE. For SERIALIZABLE level, the search sets shared
> next-key locks on the index records it encounters.
> """
> (http://dev.mysql.com/doc/refman/5.0/en/innodb-locks-set.html)
>
>

Still, in your use case you wanted to update the table whilst reading
a query row by row from the server. Updating the table would mean
committing the current transaction, which would not be possible as the
mysql server still hasn't finished sending you the results. You would
get a "Commands out of sync" error.

Cheers

Tom

Anssi Kääriäinen

unread,
Nov 3, 2011, 4:38:23 PM11/3/11
to Django developers


On Nov 3, 1:06 am, I wrote:
> I did a little testing. It seems you can get the behavior you want if you just do this in PostgreSQL:
> for obj in Model.objects.all().iterator(): # Note the extra .iterator()
>     # handle object here.
<SNIP>
> I would sure like a verification to this test, I am tired and this seems like too easy of an fix. Or am I missing the problem?
Quoting myself:
"""
I did a little testing. It seems you can get the behavior you want if
you just do this in PostgreSQL:
for obj in Model.objects.all().iterator(): # Note the
extra .iterator()
# handle object here.

<SNIP>

I would sure like a verification to this test, I am tired and this
seems like too easy of an fix. Or am I missing the problem?
"""

I did some more testing, and I think that PostgreSQL will do the right
thing if you use the .iterator() trick, even without named cursors. I
tested combinations of changing User.objects.all().iterator() <->
User.objects.all() and in compiler.py changing fetchmany(FETCH_SIZE) <-
> fetchall().

The testcase is 20000 objects with a largish (1000 char) wide text
field fetched in a loop:
for obj in User.objects.all(): # / all().iterator()
print memory()

The results for .all() without iterator() and standard .fetchmany():
68600.0kB -> 147300.0kB

The results for .all() with iterator() and standard .fetchmany():
64960.0kB -> 64960.0kB
Here the first print memory() is called after first fetch, so that is
why there is no memory difference at all.

The results for .all() without iterator() and .fetchall():
143176.0kB -> 149232.0kB
Here the strings take almost all of the space, and filling
the ._results_cache with objects will take the last 6000kB.

Last, the results for .all() with iterator() and .fetchall():
142912.0kB -> 142912.0kB

Now, calling the .iterator() directly is not safe on SQLite3. If you
do updates to objects not seen by the iterator yet, you will see those
changes. On MySQL, all the results are fetched into Python in one go,
and the only saving is from not populating the _results_cache. I guess
Oracle will just work with the .iterator() approach. Without
the .iterator() call, it will consume the memory, as ._results_cache
will be populated.

I would like to introduce a new method (wrapping .iterator() and doing
sanity checking) for this. Allowing this to work on PostgreSQL would
be a big plus in my book, and I think Oracle would benefit from this,
too. SQLite3 should fetch all the data in one go (to get rid of update
effects on the results) and MySQL could use this, too, although it
would mean there is no large benefit for this. MySQL and SQLite3 would
benefit from not needing to create the objects, which could in some
cases be a bigger win than avoiding the cursor's row cache.

There would need to be sanity checks for at least prefetch_related (as
that will first need to fetch all the results, and then do the related
fetching). I don't know of other sanity checks needed.

This would of course need documentation detailing the cave-eats. In
addition, it would be good to document that if at all possible, try to
avoid the pattern:
for obj in Model.objects.all():
obj.field = changed_value
obj.save()

That is going to be really expensive if the result set is large,
with .iterator() or not. Basically you are going to do N selects and N
updates. Even if you use force_update, you will still be doing N
updates, which is expensive. Another matter is that Django could be a
lot wiser here, doing automatically the update if the object is
fetched from the DB, and not doing an update at all if there are no
changes to field values.

If a new method for this is not introduced, the other choice is to
document the .iterator() trick directly, and let the user beware of
the possible problems for SQLite3 / prefetch_related. Then there is
still the possibility of keeping this as an undocumented trick.

- Anssi

Marco Paolini

unread,
Nov 3, 2011, 5:09:05 PM11/3/11
to django-d...@googlegroups.com
> Now, calling the .iterator() directly is not safe on SQLite3. If you
> do updates to objects not seen by the iterator yet, you will see those
> changes. On MySQL, all the results are fetched into Python in one go,
> and the only saving is from not populating the _results_cache. I guess
> Oracle will just work with the .iterator() approach. Without
> the .iterator() call, it will consume the memory, as ._results_cache
> will be populated.
it *is* safe to call .iterator with sqlite because our ORM fetches all
rows into memory (see below and django/django/db/models/sql/compiler.py)

>
> I would like to introduce a new method (wrapping .iterator() and doing
> sanity checking) for this. Allowing this to work on PostgreSQL would
> be a big plus in my book, and I think Oracle would benefit from this,
> too. SQLite3 should fetch all the data in one go (to get rid of update
> effects on the results) and MySQL could use this, too, although it
> would mean there is no large benefit for this. MySQL and SQLite3 would
> benefit from not needing to create the objects, which could in some
> cases be a bigger win than avoiding the cursor's row cache.
>
> There would need to be sanity checks for at least prefetch_related (as
> that will first need to fetch all the results, and then do the related
> fetching). I don't know of other sanity checks needed.
>
> This would of course need documentation detailing the cave-eats. In
> addition, it would be good to document that if at all possible, try to
> avoid the pattern:
> for obj in Model.objects.all():
> obj.field = changed_value
> obj.save()
>
> That is going to be really expensive if the result set is large,
> with .iterator() or not. Basically you are going to do N selects and N
> updates. Even if you use force_update, you will still be doing N
> updates, which is expensive. Another matter is that Django could be a
> lot wiser here, doing automatically the update if the object is
> fetched from the DB, and not doing an update at all if there are no
> changes to field values.

I think this is the developer's job, not framework's

>
> If a new method for this is not introduced, the other choice is to
> document the .iterator() trick directly, and let the user beware of
> the possible problems for SQLite3 / prefetch_related. Then there is
> still the possibility of keeping this as an undocumented trick.

Sorry but I really think you are on the wrong track, maybe because the thread is
bending further and further away from it's starting point...

.iterator() disables queryset *model instance* caching, it's documented and works all right,
even in sqlite.

the problem lies one layer below in the sqlite3, psycopg2 and mysqlDB adapers

when our ORM calls fetchmany(100), both psycoopg and mysqlDB fetch *all* rows form db server in one go
at the very first fetchmany(100) call.
All next fetchmany(100) calls will *not* hit the db server, only the local backend cache.

sqlite3, instead, only fetches 100 rows (actually it feches 101, like the dalmatians, and keeps one for the next call)
but our ORM knows that two cursors are not isolated and it keeps calling fetchmany(100) storing all
rows in a single list

net effect is: the very first time a queryset hits the database,
rows are all loaded into memory by our DB backends

oracle works as expected: fetchmany(100) only fetches 100 rows from the db server (well
actually may be some more depending on driver settings)

even if we use .iterator()

Marco

Anssi Kääriäinen

unread,
Nov 3, 2011, 7:08:53 PM11/3/11
to Django developers


On Nov 3, 11:09 pm, Marco Paolini <markopaol...@gmail.com> wrote:
> > Now, calling the .iterator() directly is not safe on SQLite3. If you
> > do updates to objects not seen by the iterator yet, you will see those
> > changes. On MySQL, all the results are fetched into Python in one go,
> > and the only saving is from not populating the _results_cache. I guess
> > Oracle will just work with the .iterator() approach. Without
> > the .iterator() call, it will consume the memory, as ._results_cache
> > will be populated.
>
> it *is* safe to call .iterator with sqlite because our ORM fetches all
> rows into memory (see below and django/django/db/models/sql/compiler.py)

Doh, a little miss in my reading of the code, in the end of
compiler.py execute_sql is this little snippet:
"""
if not self.connection.features.can_use_chunked_reads:
# If we are using non-chunked reads, we return the same
data
# structure as normally, but ensure it is all read into
memory
# before going any further.
return list(result)
"""

That might explain something... :)

The test data, which shows a big memory saving when using fetchmany()
as opposed to fetchall in the compiler.py for PostgreSQL is probably
because even if the data is fetched in one go from the server, it is
transformed to Python objects in bulks by fetchmany(). I am not sure
of this. This confused me, sorry for that.

So, what you said before is correct:
- mysql supports chunked fetch but will lock the table while
fetching is in progress (likely causing deadlocks)
- postgresql does not seem to suffer this issue and chunked fetch
seems doable (not trivial) using named cursor
- oracle does chunked fetch already (someone confirm this, please)
- sqlite3 COULD do chunked fetch by using one connection per cursor
(otherwise cursors will not be isolated)

The SQLite3 shared cache mode seems to suffer from the same problem
than mysql:

"""
At any one time, a single table may have any number of active read-
locks or a single active write lock. To read data a table, a
connection must first obtain a read-lock. To write to a table, a
connection must obtain a write-lock on that table. If a required table
lock cannot be obtained, the query fails and SQLITE_LOCKED is returned
to the caller.
"""

So, if you have an open connection to a table in one cursor (which is
a separate connection), and you try to modify the table while that
connection is open by another cursor, it seems you will deadlock.

You could use the read uncommitted isolation level, but then you will
have the same problem you have now when using fetchmany() - updates
might be seen in the other cursor's objects. The SQLite transaction
isolation implementation is not MVCC. It is likely it simply does not
support multiple row versions, and this means you can't update a table
and expect to still get the old version from the server in another
cursor.

- Anssi

Marco Paolini

unread,
Nov 3, 2011, 7:20:36 PM11/3/11
to django-d...@googlegroups.com
...

> The SQLite3 shared cache mode seems to suffer from the same problem
> than mysql:
>
> """
> At any one time, a single table may have any number of active read-
> locks or a single active write lock. To read data a table, a
> connection must first obtain a read-lock. To write to a table, a
> connection must obtain a write-lock on that table. If a required table
> lock cannot be obtained, the query fails and SQLITE_LOCKED is returned
> to the caller.
> """
>
> So, if you have an open connection to a table in one cursor (which is
> a separate connection), and you try to modify the table while that
> connection is open by another cursor, it seems you will deadlock.
>
> You could use the read uncommitted isolation level, but then you will
> have the same problem you have now when using fetchmany() - updates
> might be seen in the other cursor's objects. The SQLite transaction
> isolation implementation is not MVCC. It is likely it simply does not
> support multiple row versions, and this means you can't update a table
> and expect to still get the old version from the server in another
> cursor.
yeah, you're right about the locking issues we'd hit if using sqlite3 in
shared cache mode as described above.

so eventually a .chunked() queryset implemetation for sqlite3
has to live with the fact that cursors are not isolated and
if you update a table you can get "freshly" updated objects in other querysets

time to write some patches, now!

Marco

Anssi Kääriäinen

unread,
Nov 3, 2011, 8:50:17 PM11/3/11
to Django developers
On Nov 4, 1:20 am, Marco Paolini <markopaol...@gmail.com> wrote:
> time to write some patches, now!

Here is a proof of concept for one way to achieve chunked reads when
using PyscoPG2. This lacks tests and documentation. I think the
approach is sane, though. It allows different database backends to be
able to decide how to do the chunked reads, and there is a qs method
to use when you want extreme caching avoidance - even if it means non-
safe behavior is sqlite3 / using named cursors in postgresql.

For SQLite, results for User.objects.all()[0:10000]
.chunked(): 25852.0kB
.iterator(): 65508.0kB
Modifications will be seen by the iterator.

Postgresql:
.chunked(): 26716.0kB
.iterator(): 46652.0kB

MySQL should not have any changes between chunked <-> iterator,
neither should Oracle.

I would write tests for this, but it seems a bit hard - how to test if
the backend fetched the objects in one go or not? You could test
memory usage, but that seems brittle, or you could test internals of
the backend, but that is ugly and brittle. Ideas?

The patch is at:
https://github.com/akaariai/django/commit/8990e20df50ce110fe6ddbbdfed7a98987bb5835

- Anssi

Marco Paolini

unread,
Nov 3, 2011, 9:29:24 PM11/3/11
to django-d...@googlegroups.com
nice patch!

where/when do we close() cursors, or we rely on cursor __del__() implementation?

postgres named cursors can't be used in autocommit mode [1]

psycopg fetches 2000 rows at the time

[1] http://initd.org/psycopg/docs/usage.html#server-side-cursors

Marco

Marco Paolini

unread,
Nov 3, 2011, 9:38:14 PM11/3/11
to django-d...@googlegroups.com
On 04/11/2011 01:50, Anssi K��ri�inen wrote:
> On Nov 4, 1:20 am, Marco Paolini<markopaol...@gmail.com> wrote:
>> time to write some patches, now!
>
> Here is a proof of concept for one way to achieve chunked reads when
> using PyscoPG2. This lacks tests and documentation. I think the
> approach is sane, though. It allows different database backends to be
> able to decide how to do the chunked reads, and there is a qs method
> to use when you want extreme caching avoidance - even if it means non-
> safe behavior is sqlite3 / using named cursors in postgresql.
>
> For SQLite, results for User.objects.all()[0:10000]
> .chunked(): 25852.0kB
> .iterator(): 65508.0kB
> Modifications will be seen by the iterator.
>
> Postgresql:
> .chunked(): 26716.0kB
> .iterator(): 46652.0kB
what if you use .chunked().iterator() ?

Marco

Anssi Kääriäinen

unread,
Nov 3, 2011, 9:45:28 PM11/3/11
to Django developers
On Nov 4, 3:38 am, Marco Paolini <markopaol...@gmail.com> wrote:
> what if you use .chunked().iterator() ?

You can't. .chunked() returns a generator. Note that the memory usage
is total memory usage for the process, not for the query. The memory
usage for the query is probably just a small part of the total memory
usage.

The .chunked will basically alter the qs.query to use a chunked_cursor
and avoid the fetching of all results when using sqlite3. Then it
calls .iterator().

- Anssi

Anssi Kääriäinen

unread,
Nov 3, 2011, 9:56:49 PM11/3/11
to Django developers
On Nov 4, 3:29 am, Marco Paolini <markopaol...@gmail.com> wrote:
> where/when do we close() cursors, or we rely on cursor __del__() implementation?
I guess we should rely on it going away when it happens to go away
(that is, __del__ way).

>
> postgres named cursors can't be used in autocommit mode [1]

I don't know if there is any other way than documentation + checking
this in postgresql backend's chunked_cursor method. I think there is
actually a more severe limitation: the cursor is not usable outside
the transaction. Or you will need to create it as "with hold" cursor,
which means it can live even outside transactions, but then closing it
is really important. This according to psycopg2 documentation.

I really think this is documentation / sanity checking
stuff. .chunked() is there because it can have weird requirements /
backend specific behavior. Making it behave exactly the same way on
every backend, without transaction etc requirements will make this
feature really, really hard to implement. There is a tradeoff here
between what can be achieved easily, and per backend requirements.

>
> psycopg fetches 2000 rows at the time

That is luckily alterable, we can make it fetch 100 rows at a time if
we want. I would think that the 2000 rows has a good balance between
memory usage and network round trips. Maybe chunked could have a
chunk_size kwarg, but that can wait for later.

- Anssi

Anssi Kääriäinen

unread,
Nov 3, 2011, 10:05:19 PM11/3/11
to Django developers
On Nov 4, 3:38 am, Marco Paolini <markopaol...@gmail.com> wrote:
> > Postgresql:
> > .chunked(): 26716.0kB
> > .iterator(): 46652.0kB
>
> what if you use .chunked().iterator() ?

Quick test shows that the actual memory used by the queryset is around
1.2Mb. Using smaller fetch size than the default 2000 would result in
less memory used, but even this isn't bad at all. The test is:

i = 0
print memory()
for obj in User.objects.all().order_by('pk')
[0:10000].chunked():
i += 1
if i % 1000 == 0:
print memory()

Each instance has a 1000 char wide text field.

Before the loop the usage is:
25584.0kB
In loop the memory use stays at:
26716.0kB

These are process total usages, and there is likely a lot of error in
the numbers. Anyways, the memory usage is most likely somewhere around
1-3MB.

Marco Paolini

unread,
Nov 3, 2011, 10:51:43 PM11/3/11
to django-d...@googlegroups.com

tested your patch for memory usage, looks good

sqlite on a 1000 record queryset

76353536.0 .iterator()

74825728.0 .chunked()

sqlite on a 10000 record qs:

99713024.0 .iterator()

79847424.0 .chunked()

~20% memory usage decrease

Marco

Asif Saifuddin

unread,
Mar 18, 2015, 2:43:03 PM3/18/15
to django-d...@googlegroups.com
assigned myself to the ticket https://code.djangoproject.com/ticket/16614

will try to give akarai's patch a try


Reply all
Reply to author
Forward
0 new messages