Does CouchDB check autogenerated document id's?

111 views
Skip to first unread message

A.J. Brown

unread,
Jan 2, 2009, 10:34:37 AM1/2/09
to us...@couchdb.apache.org
Hi all,

I'm writing a proof of concept for our organization to use couchdb for our content. Part of the task will be generating document Ids from legacy Ids (autonumber integers). This would be done by simply doing an md5() hash on the legacy id.

Since we're storing the same 32char hex string as our user-provided document ids, When CouchDB generates an Id for a new document, will it make sure that ID doesn't exist first? I know the chances of generating the same exact ID are slim to none, but the question still remains.

--
A.J. Brown

Jason Davies

unread,
Jan 2, 2009, 10:55:15 AM1/2/09
to us...@couchdb.apache.org

According to the documentation on the Wiki, POSTing to _uuids
retrieves a list of *unused* document IDs, so one would assume that it
does check to ensure the ID doesn't exist.

However, from reading the code it looks like *no* checks are actually
made (either when POSTing to _uuids or when POSTing to save a document
with autogenerated ID), but given that the UUID is securely randomly
generated the chances of a collision are so slim there is no point
worrying about it (see http://en.wikipedia.org/wiki/Universally_Unique_Identifier#Random_UUID_probability_of_duplicates)
. If a collision does occur, then I think all that would happen is
you would get a document update conflict error, perhaps someone more
familiar with the internals can confirm this?

Jason
--
Jason Davies

www.jasondavies.com

Paul Davis

unread,
Jan 2, 2009, 10:55:21 AM1/2/09
to us...@couchdb.apache.org
A.J.,

Depends on what you mean:

When saving, if a document happened to get an auto-generated DocID
that already existed it'd throw a 412 precondition failed error. This
would be the same error you'd get if you retrieved a doc, deleted it's
_rev member and tried saving.

On the other hand, if you request a set of id's from the _uuids
endpoint, they aren't checked against the current DB to enforce
uniqueness (which would be a race condition anyway).

Also remember that if you ever get an autogenerated ID collision then
something is most definitely broken. For an example of a similar
situation, see the Ubuntu OpenSSL bruhaha with public key collisions.

HTH,
Paul Davis

Paul Davis

unread,
Jan 2, 2009, 11:11:01 AM1/2/09
to us...@couchdb.apache.org
On Fri, Jan 2, 2009 at 10:55 AM, Jason Davies <ja...@jasondavies.com> wrote:
>
> On 2 Jan 2009, at 15:34, A.J. Brown wrote:
>
>> I'm writing a proof of concept for our organization to use couchdb for our
>> content. Part of the task will be generating document Ids from legacy Ids
>> (autonumber integers). This would be done by simply doing an md5() hash
>> on the legacy id.
>>
>> Since we're storing the same 32char hex string as our user-provided
>> document ids, When CouchDB generates an Id for a new document, will it make
>> sure that ID doesn't exist first? I know the chances of generating the same
>> exact ID are slim to none, but the question still remains.
>
> According to the documentation on the Wiki, POSTing to _uuids retrieves a
> list of *unused* document IDs, so one would assume that it does check to
> ensure the ID doesn't exist.
>

Hmm. Unused does kind of give the connotation of things being checked.
Perhaps a special caveat should be presented.

Also, as a random aside, why does _uuids require a POST?

Jason Davies

unread,
Jan 2, 2009, 11:16:59 AM1/2/09
to us...@couchdb.apache.org

On 2 Jan 2009, at 16:11, Paul Davis wrote:

> Also, as a random aside, why does _uuids require a POST?


I guess it's to comply with REST as it returns a different output each
time it is called.

Adam Kocoloski

unread,
Jan 2, 2009, 11:17:53 AM1/2/09
to us...@couchdb.apache.org
On Jan 2, 2009, at 11:11 AM, Paul Davis wrote:

> Also, as a random aside, why does _uuids require a POST?

Well, for one thing you definitely don't want that request cached :-)

Paul Davis

unread,
Jan 2, 2009, 11:19:10 AM1/2/09
to us...@couchdb.apache.org

Touché

A.J. Brown

unread,
Jan 2, 2009, 11:32:24 AM1/2/09
to us...@couchdb.apache.org
Hmmm, while I knew the chances of a collision were slim, after reading the wikipedia page on UUIDs and the probability of a collision, I don't even think the question is worth asking :)


--

Damien Katz

unread,
Jan 2, 2009, 12:19:53 PM1/2/09
to us...@couchdb.apache.org

On Jan 2, 2009, at 11:32 AM, A.J. Brown wrote:

> Hmmm, while I knew the chances of a collision were slim, after
> reading the wikipedia page on UUIDs and the probability of a
> collision, I don't even think the question is worth asking :)
>

Heh, It's worth asking (it's means you are doing your job), but the
answer for all practical purposes is "it doesn't matter". The
documentation as stated is incorrect (for one thing, the UUID
generation request is independent of a database), and performing such
a check would be needlessly expensive, given the extremely low
likelihood of collision. And of course if there is a collision, it's
automatically detected and no data is lost.

-Damien

Jan Lehnardt

unread,
Jan 2, 2009, 1:13:59 PM1/2/09
to us...@couchdb.apache.org

On 2 Jan 2009, at 18:19, Damien Katz wrote:

> The documentation as stated is incorrect

Fixed.

Cheers
Jan
--

Matt Goodall

unread,
Jan 2, 2009, 7:10:28 PM1/2/09
to us...@couchdb.apache.org
2009/1/2 Jason Davies <ja...@jasondavies.com>:

>
> On 2 Jan 2009, at 16:11, Paul Davis wrote:
>
>> Also, as a random aside, why does _uuids require a POST?
>
>
> I guess it's to comply with REST as it returns a different output each time
> it is called.


I don't see a problem with GET'ing a set of UUIDs. It doesn't change
the server's state in any way and doing another GET continues to have
no effect.

It's a time-dependent resource, that's all.

- Matt

Chris Anderson

unread,
Jan 2, 2009, 7:52:29 PM1/2/09
to us...@couchdb.apache.org
On Fri, Jan 2, 2009 at 4:10 PM, Matt Goodall <matt.g...@gmail.com> wrote:
> 2009/1/2 Jason Davies <ja...@jasondavies.com>:
>>
>> On 2 Jan 2009, at 16:11, Paul Davis wrote:
>>
>>> Also, as a random aside, why does _uuids require a POST?
>>

In the normal case you would POST a document to a collection when you
want the server to choose the final URL. However, intermediaries have
a habit of retrying POSTs randomly, so when you POST and id-less Couch
document, occasionally duplicate documents are created. We work around
this by recommending PUT as the document creation method. Of course
clients can specify any document id they'd like to, but for
lightweight clients CouchDB provides the _uuids service.

The POST is pragmatic for cache-control reasons, but also RESTy,
because it exposes the service that CouchDB uses internally for
directing document POSTs to new ids. By using the _uuids service,
clients can become the part of CouchDB that would direct documents to
URLs in a collection.

--
Chris Anderson
http://jchris.mfdz.com

Noah Slater

unread,
Jan 3, 2009, 8:12:18 AM1/3/09
to us...@couchdb.apache.org

I don't agree and I think it should change to GET.

* You hint that it is to mirror the process required for the creation of
documents. This is not how we should be designing the interface. UUID
creation is totally disjoint and should be considered separately.

* You mention cache-control, but nothing about GET semantics implies
cacheability so unless there is some major flaw with common UA
implementations I don't see this as a valid argument.

I would suggest you open a bug for this issue Matt.

--
Noah Slater, http://tumbolia.org/nslater

Antony Blakey

unread,
Jan 3, 2009, 8:28:59 AM1/3/09
to us...@couchdb.apache.org

GET is meant to be idempotent - http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html
.

Antony Blakey
--------------------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

Hi, I'd like to do $THING. I know that $SOLUTION_A and $SOLUTION_B
will do it very easily and for a very reasonable price, but I don't
want to use $SOLUTION_A or $SOLUTION_B because $VAGUE_REASON and
$CONTRADICTORY_REASON. Instead, I'd like your under-informed ideas on
how to achieve my $POORLY_CONCEIVED_AMBITIONS using Linux, duct tape,
an iPod, and hours and hours of my precious time.
-- Slashdot response to an enquiry


Noah Slater

unread,
Jan 3, 2009, 8:41:08 AM1/3/09
to us...@couchdb.apache.org
On Sat, Jan 03, 2009 at 11:58:59PM +1030, Antony Blakey wrote:
> GET is meant to be idempotent -
> http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html.

So? How does the idempotency of GET affect the UUID service?

Jason Davies

unread,
Jan 3, 2009, 8:43:42 AM1/3/09
to us...@couchdb.apache.org

After researching this in more depth it turns out I was indeed
mistaken in thinking *any* responses to GET requests can potentially
be cached. Quoting from RFC 2616 [1]:

"The response to a GET request is cacheable if and only if it meets
the requirements for HTTP caching described in section 13."

And section 13 [2] goes on to say that operations are transparent by
default, and transparency can be relaxed:

- only by an explicit protocol-level request when relaxed by
client or origin server

- only with an explicit warning to the end user when relaxed by
cache or client

So as you say, unless common implementations have flaws, I think GET
responses would not be cached unless we explicitly say so, assuming we
handle the HEAD + If-Modified-Since etc. requests correctly (which is
what a conforming cache proxy should do).

[1]: http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html#sec9.3
[2]: http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html

Antony Blakey

unread,
Jan 3, 2009, 8:43:45 AM1/3/09
to us...@couchdb.apache.org

On 03/01/2009, at 11:58 PM, Antony Blakey wrote:

> GET is meant to be idempotent - http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html
> .

Although the nature of UUIDs means that allocation doesn't change the
server state, so that's irrelevent.

Any situation where caching of GET requests is an issue is going to
have very serious problems getting stale documents where a rev isn't
supplied as a query parameter.

So, I agree with Noah.

Antony Blakey


-------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

Nothing is really work unless you would rather be doing something else.
-- J. M. Barre


Jason Davies

unread,
Jan 3, 2009, 8:45:31 AM1/3/09
to us...@couchdb.apache.org
On 3 Jan 2009, at 13:41, Noah Slater wrote:
> On Sat, Jan 03, 2009 at 11:58:59PM +1030, Antony Blakey wrote:
>> GET is meant to be idempotent -
>> http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html.
>
> So? How does the idempotency of GET affect the UUID service?

Quoting from the RFC: A sequence that never has side effects is
idempotent, by definition (provided that no concurrent operations are
being executed on the same set of resources).

Hence the UUID service is idempotent, as it has no side effects.

Noah Slater

unread,
Jan 3, 2009, 8:47:58 AM1/3/09
to us...@couchdb.apache.org

My point exactly, glad we're all in agreement! Heh heh.

A.J. Brown

unread,
Jan 3, 2009, 8:49:57 AM1/3/09
to us...@couchdb.apache.org
W3 also goes on to say that it's ok if a change occurs, as long as the
user cannot request them and isn't held responsible:

"The important distinction here is that the user did not request the
side-effects, so therefore cannot be held accountable for them."

It's equivalent of saying "show me the state of document #3", and then
making the same request an hour later, if you ask me.

Antony Blakey

unread,
Jan 3, 2009, 8:50:43 AM1/3/09
to us...@couchdb.apache.org

On 04/01/2009, at 12:11 AM, Noah Slater wrote:

> On Sat, Jan 03, 2009 at 11:58:59PM +1030, Antony Blakey wrote:
>> GET is meant to be idempotent -
>> http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html.
>
> So? How does the idempotency of GET affect the UUID service?

In an inverse way to how I first thought. In an ideal world GET would
carry a payload so that POST vs. GET could be decided on the basis of
idempotency, rather than by the limitation on the size of query
parameters.

Antony Blakey


-------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

Did you hear about the Buddhist who refused Novocain during a root
canal?
His goal: transcend dental medication.


Noah Slater

unread,
Jan 3, 2009, 8:54:18 AM1/3/09
to us...@couchdb.apache.org
On Sun, Jan 04, 2009 at 12:20:43AM +1030, Antony Blakey wrote:
>
> On 04/01/2009, at 12:11 AM, Noah Slater wrote:
>
>> On Sat, Jan 03, 2009 at 11:58:59PM +1030, Antony Blakey wrote:
>>> GET is meant to be idempotent -
>>> http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html.
>>
>> So? How does the idempotency of GET affect the UUID service?
>
> In an inverse way to how I first thought. In an ideal world GET would
> carry a payload so that POST vs. GET could be decided on the basis of
> idempotency, rather than by the limitation on the size of query
> parameters.

You lost me with this. Are we still talking about CouchDB here or are we on an
(interesting but unrelated) tangent? What has the limit of query strings to do
with the decision to use POST vs. GET? Are you talking about the situation where
you want to pass in data to a GET request that would be too long (in practice,
the RFC doesn't specify a limit) for a query string? My rule of thumb would be
that if you find your self in that situation, you're doing something that could
probably be simplified.

Antony Blakey

unread,
Jan 3, 2009, 9:17:06 AM1/3/09
to us...@couchdb.apache.org

On 04/01/2009, at 12:24 AM, Noah Slater wrote:

> You lost me with this. Are we still talking about CouchDB here or
> are we on an
> (interesting but unrelated) tangent?

A tangent, but the issue does impact Couch ...

> What has the limit of query strings to do
> with the decision to use POST vs. GET? Are you talking about the
> situation where
> you want to pass in data to a GET request that would be too long (in
> practice,
> the RFC doesn't specify a limit) for a query string? My rule of
> thumb would be
> that if you find your self in that situation, you're doing something
> that could
> probably be simplified.

The multikey get in Couch should be a GET, but it can't be unless you
want you API to be limited by the (practical) limitation on URL length.

From an API perspective, I think POST and GET mix up idempotency with
the ability to have a payload or not, which in practical terms results
in people using POST when they should use GET, because those two
issues, whilst theoretically orthogonal, and not implemented that way.

Antony Blakey
-------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

Borrow money from pessimists - they don't expect it back.
-- Steven Wright


Jason Davies

unread,
Jan 3, 2009, 9:41:56 AM1/3/09
to us...@couchdb.apache.org

On 3 Jan 2009, at 14:17, Antony Blakey wrote:
> The multikey get in Couch should be a GET, but it can't be unless
> you want you API to be limited by the (practical) limitation on URL
> length.
>
> From an API perspective, I think POST and GET mix up idempotency
> with the ability to have a payload or not, which in practical terms
> results in people using POST when they should use GET, because those
> two issues, whilst theoretically orthogonal, and not implemented
> that way.

Continuing the tangent...

Someone should invent an extension to HTTP whereby a client may issue
multiple GET requests at once at the beginning of a single TCP
connection. These resources may take time to generate, but are
amenable to parallelisation of some kind thus making it advantageous
to do this. This is a bit like KeepAlive, except that you can request
e.g. multiple CouchDB keys right at the beginning for maximum
performance, rather than serially.

This is where I hope someone will pipe up and say this already
exists :-)

Matt Goodall

unread,
Jan 3, 2009, 10:12:22 AM1/3/09
to us...@couchdb.apache.org
2009/1/3 Noah Slater <nsl...@apache.org>:

Done, and thanks for the interesting discussion too :).

- Matt

Jason Davies

unread,
Jan 3, 2009, 10:13:07 AM1/3/09
to us...@couchdb.apache.org


Aha, http://en.wikipedia.org/wiki/HTTP_pipelining is what I am
thinking of.

Although it kind of works as a potential alternative to the multikey
POST, it won't return rows in any particular order, where as the
current multikey POST returns rows in the same order as the keys
specified. There will also be slightly more overhead in doing
pipelining, as each request would have to be handled separately on the
server-side. In any case, most Web browsers either don't support
pipelining or they have it turned off by default, so writing AJAX apps
using this would be a no-no.

As CouchDB is HTTP/1.1-compliant, it should support pipelining out-of-
the-box.

Noah Slater

unread,
Jan 3, 2009, 10:14:15 AM1/3/09
to us...@couchdb.apache.org
On Sat, Jan 03, 2009 at 02:41:56PM +0000, Jason Davies wrote:
> Someone should invent an extension to HTTP...

Danger, Will Robinson!

> whereby a client may issue multiple GET requests at once at the beginning of a
> single TCP connection.

...


> This is where I hope someone will pipe up and say this already exists :-)

Why not just thread your client code? :)

Jason Davies

unread,
Jan 3, 2009, 10:20:34 AM1/3/09
to us...@couchdb.apache.org
On 3 Jan 2009, at 15:14, Noah Slater wrote:
>
>> whereby a client may issue multiple GET requests at once at the
>> beginning of a
>> single TCP connection.
> ...
>> This is where I hope someone will pipe up and say this already
>> exists :-)
>
> Why not just thread your client code? :)

That doesn't really solve the issue, which is that making separate
HTTP requests is expensive for people on dialup :-)

Adam Kocoloski

unread,
Jan 3, 2009, 11:25:18 AM1/3/09
to us...@couchdb.apache.org
Hi Jason, Couch does support pipelining, and takes advantage of that
support in the replication code. Unfortunately, you'll find that not
all HTTP clients can pipeline. Python's httplib, for one. Best,

Adam

Sent from my iPhone

Reply all
Reply to author
Forward
0 new messages