New issue 162 by andreas.kloeckner: Upon querying a view, couchdb-python
reads the entire result, ...
http://code.google.com/p/couchdb-python/issues/detail?id=162
...then parses it, then returns it to the user. All the while, both raw and
parsed representation are kept in memory. How about some love for people
whose data is too big to fit in main memory? Even once?
:(
It's a great idea to support this. However, it's not a straightforward
issue. A streaming JSON parser is required in order to deliver an iterable
stream of rows without holding the whole response in memory. To solve this
problem probably requires using YAJL in combination with a Python binding
like ijson[1]. A parser like ijson has a much different API than simplejson
or the standard library parser meaning code must be differentiated and
larger pieces of CouchDB-Python rewritten to handle it. I also suspect that
making it a strict requirement would be untenable. The best approach would
perhaps be to use the http and client modules of CouchDB-Python directly,
subclassing Resource, Server, Database, etc. I don't see an immediately
straightforward way to just graft a streaming parser into the system
without lots of new code.
I wrote a some code to support iterating over rows.
https://github.com/openlibrary/openlibrary/blob/master/openlibrary/core/couch.py
@Matt had implemented iterative views a long time ago
http://code.google.com/r/mattgoodall-couchdb-python-iterview/source/browse/couchdb/client.py#829
After some time in production I could say that they works perfectly. The
only thing left to use them by default for db iteration and ViewField with
some constant batch number: too small as 100 produce too many requests, 10K
is quite optimal and mercy for memory and requests count, but it should be
tweakable.
Why not to get this feature to mainstream?