update() and new _bulk_docs - bugs and API comments

Matt Goodall

unread,

Mar 18, 2009, 6:38:30 AM3/18/09

to couchdb...@googlegroups.com

Hi,

I was looking through the recent change to match the new _bulk_docs
behaviour and immediately noticed it does not expect any of the docs
to fail to update - there is no 'rev' key in a failing document's
result dict.

The use of a generator to process the _bulk_docs response actually
masks the above problem.

>>> doc_id = db.create({})
>>> doc = db.get(doc_id)
>>> db.update([doc])
<generator object at 0x8a15dac>
>>> db.update([doc])
<generator object at 0x8a15fcc>

>>> doc_id = db.create({})
>>> doc = db.get(doc_id)
>>> db.update([doc])
<generator object at 0x8a15e4c>
>>> list(db.update([doc]))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/matt/src/ext/couchdb-python/couchdb/client.py", line
544, in _update
doc.update({'_id': result['id'], '_rev': result['rev']})
KeyError: 'rev'

However, I think the generator gives update() a "surprising" API, and
has been doing so for some time. Before the _bulk_docs change the
updated documents' revs were not updated; now it raises an exception
(because of the bug ;-)).

Regardless of the above I don't think Database.update() should really
try to process the _bulk_docs response much anymore. I think it has to
be left to the application now. In other words, update() should simply
return a nicer form of the _bulk_docs response and leave the rest to
the application.

Perhaps the simplest version of my proposal would be:

def update(self, documents):
docs = []
for doc in documents:
if isinstance(doc, dict):
docs.append(doc)
elif hasattr(doc, 'items'):
docs.append(dict(doc.items()))
else:
raise TypeError('expected dict, got %s' % type(doc))
resp, data = self.resource.post('_bulk_docs', content={'docs': docs})
return data

A slightly better version, that makes the possibility/likelihood of
failure more explicit by yielding (status, result) tuples:

def update(self, documents):
docs = []
for doc in documents:
if isinstance(doc, dict):
docs.append(doc)
elif hasattr(doc, 'items'):
docs.append(dict(doc.items()))
else:
raise TypeError('expected dict, got %s' % type(doc))
resp, data = self.resource.post('_bulk_docs', content={'docs': docs})
def _gen_result():
for result in data:
yield 'error' not in result, result
return _gen_result()

And finally, a version that yields (status, id, rev or
ResourceConflict) tuples. The ResourceConflict makes for improved
consistency with Database.__setitem__:

def update(self, documents):
docs = []
for doc in documents:
if isinstance(doc, dict):
docs.append(doc)
elif hasattr(doc, 'items'):
docs.append(dict(doc.items()))
else:
raise TypeError('expected dict, got %s' % type(doc))
resp, data = self.resource.post('_bulk_docs', content={'docs': docs})
def _gen_result():
for result in data:
if result.get('error') == 'conflict':
yield False, result['id'],
ResourceConflict('conflict', result['reason'])
else:
yield True, result['id'], result['rev']
return _gen_result()

Note that update does not even update the revs of successfully updated
documents anymore. I think that's important given the new semantics of
_bulk_docs but it also makes the create() and update() API a little
more consistent at the same time. create() does not update the rev,
update() does. I always found that a little weird ;-).

Now, I know that this completely changes the update API ... but I
think we can blame CouchDB for that ;-).

Hope that all makes sense. cmlenz, if you want me to implement one of
the above (I favour the last version) with tests and send you a patch
then please ask.

- Matt

Matt Goodall

unread,

Mar 26, 2009, 1:54:12 PM3/26/09

to couchdb...@googlegroups.com

Ping! Any thoughts on this issue. couchdb-python is currently broken
against CouchDB trunk (aka 0.9).

- Matt

2009/3/18 Matt Goodall <matt.g...@gmail.com>:

Christopher Lenz

unread,

Mar 27, 2009, 4:54:54 AM3/27/09

to couchdb...@googlegroups.com

On 26.03.2009, at 18:54, Matt Goodall wrote:
> Ping! Any thoughts on this issue. couchdb-python is currently broken
> against CouchDB trunk (aka 0.9).

Sorry for the delay here. I've checked in some changes to the update()
implementation (plus tests) that I still had lying around in my local
workspace but forgot to commit.

<http://code.google.com/p/couchdb-python/source/detail?r=140>

Aside from enabling the use of the "all_or_nothing" flag, it basically
keeps the previous semantics, while fixing the handling of conflict
errors. Instead of yielding the patched Document instance, it'll just
yield the error dict.

Beyond that, I think I agree that the generator approach isn't really
appropriate for this method. I'm unsure on how to improve the handling
of errors, putting exception instances into the returned list seems
very awkward, but of course throwing an exception isn't an option.
Maybe allow the caller to pass in a error_fun that gets called for all
failed updates??

Cheers,
--
Christopher Lenz
cmlenz at gmx.de
http://www.cmlenz.net/

Matt Goodall

unread,

Mar 27, 2009, 7:21:21 AM3/27/09

to couchdb...@googlegroups.com

2009/3/27 Christopher Lenz <cml...@gmx.de>:

>
> On 26.03.2009, at 18:54, Matt Goodall wrote:
>> Ping! Any thoughts on this issue. couchdb-python is currently broken
>> against CouchDB trunk (aka 0.9).
>
> Sorry for the delay here. I've checked in some changes to the update()
> implementation (plus tests) that I still had lying around in my local
> workspace but forgot to commit.
>
> <http://code.google.com/p/couchdb-python/source/detail?r=140>

Thanks, I'll use that and see how I get on. I have a couple of
immediate thoughts, see below.

> Aside from enabling the use of the "all_or_nothing" flag, it basically
> keeps the previous semantics, while fixing the handling of conflict
> errors. Instead of yielding the patched Document instance, it'll just
> yield the error dict.

So, now we get a dict to represent an error and a dict (or sometimes a
Document, I think?) to represent a success. That's going to make
things quite awkward to detect errors.

>>> doc1 = {'num': 1}
>>> doc2 = {'num': 2, 'error': 'conflict'}
>>> doc1 = list(db.update([doc1]))[0]
>>> doc1
{'_rev': '1-3143113288', 'num': 1, '_id': '7a1605c6f02e3b4fde55d8e4aa4c3aed'}
>>> db.update([doc1])
<generator object at 0xb78fa8cc>
>>> doc1
{'_rev': '1-3143113288', 'num': 1, '_id': '7a1605c6f02e3b4fde55d8e4aa4c3aed'}
>>> list(db.update([doc1]))
[{'reason': 'Document update conflict.', 'id':
'7a1605c6f02e3b4fde55d8e4aa4c3aed', 'error': 'conflict'}]
>>> results = list(db.update([doc1, doc2]))
>>> results
[{'reason': 'Document update conflict.', 'id':
'7a1605c6f02e3b4fde55d8e4aa4c3aed', 'error': 'conflict'}, {'_id':
'a5277d46781bad945fbef855a5a80168', '_rev': '1-78329959', 'num': 2,
'error': 'conflict'}]
>>> [type(result) for result in results]
[<type 'dict'>, <type 'dict'>]
>>>

Most of that's just setting up the db to cause a conflict on doc1 at
the same time as another doc, doc2, is created that looks remarkably
like an error. The last three prompts and their output highlight the
interesting bits:

* update() returns two results, as expected.
* both results are of type dict, so can't type check to detect conflicts.
* both results contain an 'error' key, the logical thing to test for,
so can't use that to detect conflicts.

So, that means the correct test for a conflict result would be "'_id'
not in result" which, although not complicated, is not at all
intuitive imho.

> Beyond that, I think I agree that the generator approach isn't really
> appropriate for this method.

Yep, I think it needs ripping out to avoid the unexpected difference
of iterating vs not iterating the results.

Removing it should not affect anyone who already iterates the results.
However, it may affect those who do not iterate the results *and*
reuse the docs they passed to update(). I suspect that's a very small
number of users, if any.

With the bulk_docs change it's basically a requirement to iterate the
results now - either inside update() or by the caller - so people
affected by the bulk_docs change are going to have to update their
code anyway.

> I'm unsure on how to improve the handling
> of errors, putting exception instances into the returned list seems
> very awkward, but of course throwing an exception isn't an option.

I agree that returning something like a (status, exception/doc) tuple
for each result is not ideal but bulk_docs has, to put it crudely,
become an efficient HTTP transport so it does make some sense.

Twisted's DeferredList uses the (status, exception/response) approach
to collect the results from a list of async operations. Again, it's
not nice but it's about the best API I can think of for what the
developer is trying to do.

> Maybe allow the caller to pass in a error_fun that gets called for all
> failed updates??

Not entirely clear what you're suggesting here as it could be one of:

1. error_fun is called once per conflict result
2. error_fun is called one time with the list of conflict results.
3. error_fun is called one time with the whole result list - success
and failures.

I don't think handling failures in isolation is generally useful
which, for me, immediately rules out 1 and 2. If something goes wrong
as part of a bulk_docs operation then there's a very good chance
you'll just ignore the failures and undo the successes to regain some
level of data consistency.

So, that leaves option 3 which sounds a little too much like an error
callback for my liking. (Not that there's anything wrong with
callbacks in any way, they're just not that useful for handling the
results of blocking calls.)

Another approach entirely might be to scan the list for sign of a
conflict and return either None or raise an exception containing a
reference to the list of results (success and failures). At least you
get something approaching normal looking Python error handling that
way. However, I'm not even sure about this idea as failing to handle
the exception leaves your database inconsistent, although I guess
that's true of any code that makes non-transactional changes to
persistent storage.

Not easy, this! My gut feeling says that problems with maintaining
consistency in a CouchDB has now been pushed so far back into the
application that only the application can really handle it anyway ...
so you might as well make the update() API a syntactically nice way of
making the HTTP call without doing anything clever.