I'd like to add a small note of warning about queryset caching in dos,
in topics/db/queries.txt "Caching and QuerySets" section,
something like:
Keep in mind when looping through a queryset and altering data that
might be returned by the same queryset, that instances might not
be "fresh".
You might want to refresh the instance before calling some table-wide data-altering
method on it::
>>> for obj in Entry.objects.all():
>>> obj = Entry.objects.get(pk=obj.pk)
>>> obj.alter_some_other_entries()
Should I open a ticket or maybe it's a too specific use-case?
Cheers,
Marco
Isn't that self evident? If you generate a list of models, and then
change the underlying table, why would they be fresh? It's also
nothing to do with queryset caching, or even querysets. You may as
well use this example:
obj1 = Entry.objects.get(id=i)
obj2 = Entry.objects.get(id=j)
obj1.alter_other_objects_in_table()
# oh noes, obj2 is not what is in the database
I'd go with something that explains that model instances are created
from the queryset as soon as you evaluate it, not as you iterate
through it, but I'm pretty sure that is already there.
Cheers
Tom
Of course the qs cache is a list of instances.
>
> obj1 = Entry.objects.get(id=i)
> obj2 = Entry.objects.get(id=j)
> obj1.alter_other_objects_in_table()
> # oh noes, obj2 is not what is in the database
>
> I'd go with something that explains that model instances are created
> from the queryset as soon as you evaluate it, not as you iterate
> through it, but I'm pretty sure that is already there.
it's a bit more complex: there are basically two phases:
1) row fetching from db using cursor.fetchmany()
2) model instance creation in queryset
both are delayed as much as possible (queryset lazyness)
phase two (model instance creation) depends on how you evaluate the queryset:
- testing for truth will only create enough instances to fill the cache
- calling .exists() will try to fetch a single object and leave the cache alone
- iterating will create instances in batches
- .iterator() will create the instances one at the time (but on the same db cursos using cursor.fetchmany()
- many other options...
phase one can be executed in chunks, depending on db backend (sqlite no chunks, postgre chunks)
Queryset "laziness" MIGHT lead to a WRONG assumption that objects in a for loop are fresh.
The note only wanted to stress the fact that you'll get "stale" objects if you do table-wide altering operations inside the loop
I agree it's not surprising and it's not going to make your day happier, but it might be interesting to someone.
On the other hand, your objections make sense: it should be quite clear already that in the for loop you're fetching objects
using a cursor and altering data inside the loop will not change objects returned by queryset.
cheers,
Marco
You are probably more informed about the internals than me, but isn't
it irrelevant about when the instances are created; they are created
from the same query, even if the instances are not instantiated until
iterated to. Any changes to the database that happened after that
query is executed will not be reflected in the instances that are
created from that query.
IE the critical point is when the query is executed, not when the data
is fetched from the db cursor.
Cheers
Tom
while digging into django and sqlite3 backend internals I discovered:
sqlite cursors from the same connection are NOT isoltated:
>>> curs1 = conn.cursor()
>>> curs2 = conn.cursor()
>>> curs1.execute('select * from entry')
>>> curs1.fetchmany(100)
[(1, 'hey'), (2, 'hey',)...
>>> curs2.execute('update entry set name =\'\'')
>>> curs1.fetchmany(100)
[(100, ''), (101, '',)...
to overcome this issue django fetches all cursor rows in memory in a single list while iterating
a resultset (see django/db/models/sql/compiler.py SqlCompiler.execute_sql)
so if you do this:
for obj in Entry.objects.all():
pass
django does this:
- creates a cursor
- then calls fetchmany(100) until ALL rows are fetched
- creates a list containing ALL fetched rows
- passes this list to queryset instance for lazy model instance creation
I didn't know that. (maybe we should document it somewhere...)
Now that I do, I also know it's time to move to postgresql...
cheers,
Marco
> so if you do this:
>
> for obj in Entry.objects.all():
> pass
>
> django does this:
> - creates a cursor
> - then calls fetchmany(100) until ALL rows are fetched
> - creates a list containing ALL fetched rows
> - passes this list to queryset instance for lazy model instance creation
>
> I didn't know that. (maybe we should document it somewhere...)
>
> Now that I do, I also know it's time to move to postgresql...
And you will then find that the behaviour of the psycopg2 adapter means
that you get very similar behaviour - all rows are fetched as soon as
you start iterating - even if you do .iterator().
Luke
--
"The number you have dialled is imaginary. Please rotate your
telephone by 90 degrees and try again."
Luke Plant || http://lukeplant.me.uk/
If there is some interest, I might give it a try.
Maybe it's not fixable, at least I can investigate a bit
Cheers,
Marco
> thanks for pointing that to me, do you see this as an issue to be fixed?
>
> If there is some interest, I might give it a try.
>
> Maybe it's not fixable, at least I can investigate a bit
Apparently, the protocol between the Postgres client and server only
does partial sends when using named cursors, which Django doesn't use.
Using named cursors with psycopg2 in certainly possible, but probably
not trivial. That's as much as I know.
Source:
http://thebuild.com/blog/2011/07/26/unbreaking-your-django-application/
The author of that page did say that he was working on a patch.
talking about sqlite3, it looks like the only way to isolate two cursors
is to use two different connections.
Let's imagine there is a way to implement this (I'm not sure at this point)
We could have an option somewhere that tells django to use or not use
chunked cursor read from db.
if that option is true, sqlite shoud open one connection per cursor
and psycopg2 should use named cursors
This will cause some overhead for small querysets but will save some memory
for huge ones
cheers,
Marco
> This will cause some overhead for small querysets but will save some memory
> for huge ones
For PostgreSQL this would be a nice feature. Any idea what MySQL and Oracle
do currently?
- Anssi
curs = connection.cursor(CursorUseResultMixIn)
curs.fetchmany(100) # fetches 100 rows, one by one
Marco
for obj in qs.all().chunked(100):
pass
.chunked() will automatically issue LIMITed SELECTs
that should work with all backends
this could be a no-op for sqlite, where
cursor are not isolated and you have bad bad performance for LIMITed SELECTs
Marco
but this API is not exposed to python by sqlite3
If I'm following the thread correctly, the oracle backend already does
chunked reads. The default chunk size is 100 rows, IIRC.
SELECT ID, ...
ORDER BY order
LIMIT 0 OFFSET 100
SELECT ID, ...
ORDER BY order
LIMIT 100 OFFSET 100
...
SELECT ID, ...
ORDER BY order
LIMIT 100 OFFSET 9900
For each query the DB will need to do:
- query parse & plan
- If the order is not indexed, a top N sort.
- Fetch the items (even in indexed case you will need to travel the
index for the OFFSET which is not free at all).
So, for the last fetch the DB would need to travel the first 9900 items
in the index (or worse, do a top 10000 sort) and then return the 100
items wanted. This is going to be expensive in the DB. The trade-off of
saving some memory Django side at the expense of doing a lot more work
at the DB side is not a good one. DB resources are in general much more
harder to scale than the Django resources.
You are going to do in total 10000 + 9900 + 9800 + ... + 100 index
travels in the DB, which equals to somewhere around 0.5 million items
traveled in the index. In addition, you will do 100 parse + plan stages.
You really don't want to do that. In addition if there are concurrent
updates to the items, it might be you will miss some objects and see
some objects as duplicates.
- Anssi
[1] http://cx-oracle.sourceforge.net/html/cursor.html#Cursor.arraysize
summarizing (real) chunked fetch capability by backend:
currently | supported | how to implement
sqlite N | Y (1) | using one connection per cursor + shared cache mode
postgres N | Y (2) | using named cursors
mysql N | Y (3) | using custom cursor class
oracle Y | Y | default behavior
(1) in sqlite cursors are not isolated, so we can't effectively use chunked fetchmany
(2) postgres supports chunks when using named cursors
(3) mysql supports only single-row chunks
qs.chunked() can be easily implemented this way:
i = 0
while True:
for chunk in qs[i:i+CHUNK_SIZE]:
pass
if not chunk:
break
i += CHUNK_SIZE
we should find another way for avoiding this memory-hungriness issue for huge querysets
or at least we should document the issue ;)
currently oracle is the only backend that DOES chunked row fetch, all others, for different reasons,
load all rows in memory
Marco
The downsides to mysql_use_result over mysql_store_result are that the
mysql thread is locked and unavailable to do anything until the query
is completed and mysql_free_result has been called.
If you do a database query that quickly returns a lot of rows from the
database, and each row returned from the database requires long
processing in django, and you use mysql_use_result, then other mysql
threads are unable to update any table being used, where as if you do
the same thing with mysql_store_result, the tables are unlocked as
soon as the client has retrieved all the data from the server.
In other words, you trade off memory usage against scalability. If
some part of the ORM was reworked to use mysql_use_result, then we
would need to add appropriate docs to explain the dangers of this
approach.
Cheers
Tom
Scalability is also affected by python thread memory consumption,
it all depends on how big is the queryset being fetched and how often
that queryset is iterated
if you fetch a huge queryset in one chunk and python eats up say 1G
of heap, that's not going to scale well either.
caveats should be clearly documented for both approaches, I think
Marco
Yes. The exact wording from the C API:
"""
On the other hand, you shouldn't use mysql_use_result() if you are
doing a lot of processing for each row on the client side, or if the
output is sent to a screen on which the user may type a ^S (stop
scroll). This ties up the server and prevent other threads from
updating any tables from which the data is being fetched.
"""
mysql treats the table as in use until the result is freed.
If this behaviour was in place, then you wouldn't have even raised the
original query - the approach you were using was to iterate through a
result set and modify the table you are fetching from. With
mysql_use_result, you would have deadlocked that table in the mysql
server as soon as you tried to update it without first completing the
first query.
Cheers
Tom
so, summarizing again:
- mysql supports chunked fetch but will lock the table while fetching is in progress (likely causing deadlocks)
- postgresql does not seem to suffer this issue and chunked fetch seems doable (not trivial) using named cursor
- oracle does chunked fetch already (someone confirm this, please)
- sqlite3 COULD do chunked fetch by using one connection per cursor (otherwise cursors will not be isolated)
Marco
Model.foo().bar().buffered() or .from_tmp()
peace,
Ryan
Marco
--
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to django-d...@googlegroups.com.
To unsubscribe from this group, send email to django-develop...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.
I did a little testing. It seems you can get the behavior you want if you just do this in PostgreSQL:
for obj in Model.objects.all().iterator(): # Note the extra .iterator()
# handle object here.
What is happening? Django correctly uses cursor.fetchmany(chunk_size) in models/sql/compiler.py. The chunk_size is hardcoded to 100. The problem is in db/models/query.py, and its __iter__ method. __iter__ will keep self._results_cache, and that is where the memory is consumed. Changing that is not wise, as in many cases you do want to keep the results around. The .iterator() call will skip the __iter__ and directly access the underlying generator.
You can also do objects.all()[0:100000].iterator() and objects are correctly fetched without caching.
Here is a printout from my tests. The memory report is the total process memory use:
Code:
i = 0
for obj in User.objects.all()[0:100000]:
i += 1
if i % 1000 == 0:
print memory()
25780.0kB
26304.0kB
26836.0kB
27380.0kB
27932.0kB
28468.0kB
29036.0kB
29580.0kB
29836.0kB
30388.0kB
And then:
i = 0
for obj in User.objects.all()[0:100000].iterator():
i += 1
if i % 1000 == 0:
print memory()
25216.0kB
25216.0kB
25216.0kB
25216.0kB
25216.0kB
25216.0kB
25216.0kB
25216.0kB
25216.0kB
25216.0kB
This would be worth documenting, with maybe a better named method wrapping the .iterator(). I have no ideas for a better name, though.
I would sure like a verification to this test, I am tired and this seems like too easy of an fix. Or am I missing the problem?
- Anssi
this seems to be the case with MyISAM tables; on the InnoDB engine
docs, it says that SELECT statements don't set any lock, since it
reads from a snapshot of the table.
on MyISAM, there are (clumsy) workarounds by forcing the use of
scratch tables, explicitly copying to temporary tables, or buffering
the output.
"""
SELECT ... FROM is a consistent read, reading a snapshot of the
database and setting no locks unless the transaction isolation level
is set to SERIALIZABLE. For SERIALIZABLE level, the search sets shared
next-key locks on the index records it encounters.
"""
(http://dev.mysql.com/doc/refman/5.0/en/innodb-locks-set.html)
--
Javier
Still, in your use case you wanted to update the table whilst reading
a query row by row from the server. Updating the table would mean
committing the current transaction, which would not be possible as the
mysql server still hasn't finished sending you the results. You would
get a "Commands out of sync" error.
Cheers
Tom
>
> I would like to introduce a new method (wrapping .iterator() and doing
> sanity checking) for this. Allowing this to work on PostgreSQL would
> be a big plus in my book, and I think Oracle would benefit from this,
> too. SQLite3 should fetch all the data in one go (to get rid of update
> effects on the results) and MySQL could use this, too, although it
> would mean there is no large benefit for this. MySQL and SQLite3 would
> benefit from not needing to create the objects, which could in some
> cases be a bigger win than avoiding the cursor's row cache.
>
> There would need to be sanity checks for at least prefetch_related (as
> that will first need to fetch all the results, and then do the related
> fetching). I don't know of other sanity checks needed.
>
> This would of course need documentation detailing the cave-eats. In
> addition, it would be good to document that if at all possible, try to
> avoid the pattern:
> for obj in Model.objects.all():
> obj.field = changed_value
> obj.save()
>
> That is going to be really expensive if the result set is large,
> with .iterator() or not. Basically you are going to do N selects and N
> updates. Even if you use force_update, you will still be doing N
> updates, which is expensive. Another matter is that Django could be a
> lot wiser here, doing automatically the update if the object is
> fetched from the DB, and not doing an update at all if there are no
> changes to field values.
I think this is the developer's job, not framework's
>
> If a new method for this is not introduced, the other choice is to
> document the .iterator() trick directly, and let the user beware of
> the possible problems for SQLite3 / prefetch_related. Then there is
> still the possibility of keeping this as an undocumented trick.
Sorry but I really think you are on the wrong track, maybe because the thread is
bending further and further away from it's starting point...
.iterator() disables queryset *model instance* caching, it's documented and works all right,
even in sqlite.
the problem lies one layer below in the sqlite3, psycopg2 and mysqlDB adapers
when our ORM calls fetchmany(100), both psycoopg and mysqlDB fetch *all* rows form db server in one go
at the very first fetchmany(100) call.
All next fetchmany(100) calls will *not* hit the db server, only the local backend cache.
sqlite3, instead, only fetches 100 rows (actually it feches 101, like the dalmatians, and keeps one for the next call)
but our ORM knows that two cursors are not isolated and it keeps calling fetchmany(100) storing all
rows in a single list
net effect is: the very first time a queryset hits the database,
rows are all loaded into memory by our DB backends
oracle works as expected: fetchmany(100) only fetches 100 rows from the db server (well
actually may be some more depending on driver settings)
even if we use .iterator()
Marco
so eventually a .chunked() queryset implemetation for sqlite3
has to live with the fact that cursors are not isolated and
if you update a table you can get "freshly" updated objects in other querysets
time to write some patches, now!
Marco
where/when do we close() cursors, or we rely on cursor __del__() implementation?
postgres named cursors can't be used in autocommit mode [1]
psycopg fetches 2000 rows at the time
[1] http://initd.org/psycopg/docs/usage.html#server-side-cursors
Marco
Marco
tested your patch for memory usage, looks good
sqlite on a 1000 record queryset
76353536.0 .iterator()
74825728.0 .chunked()
sqlite on a 10000 record qs:
99713024.0 .iterator()
79847424.0 .chunked()
~20% memory usage decrease
Marco