I'm writing a proof of concept for our organization to use couchdb for our content. Part of the task will be generating document Ids from legacy Ids (autonumber integers). This would be done by simply doing an md5() hash on the legacy id.
Since we're storing the same 32char hex string as our user-provided document ids, When CouchDB generates an Id for a new document, will it make sure that ID doesn't exist first? I know the chances of generating the same exact ID are slim to none, but the question still remains.
--
A.J. Brown
According to the documentation on the Wiki, POSTing to _uuids
retrieves a list of *unused* document IDs, so one would assume that it
does check to ensure the ID doesn't exist.
However, from reading the code it looks like *no* checks are actually
made (either when POSTing to _uuids or when POSTing to save a document
with autogenerated ID), but given that the UUID is securely randomly
generated the chances of a collision are so slim there is no point
worrying about it (see http://en.wikipedia.org/wiki/Universally_Unique_Identifier#Random_UUID_probability_of_duplicates)
. If a collision does occur, then I think all that would happen is
you would get a document update conflict error, perhaps someone more
familiar with the internals can confirm this?
Jason
--
Jason Davies
Depends on what you mean:
When saving, if a document happened to get an auto-generated DocID
that already existed it'd throw a 412 precondition failed error. This
would be the same error you'd get if you retrieved a doc, deleted it's
_rev member and tried saving.
On the other hand, if you request a set of id's from the _uuids
endpoint, they aren't checked against the current DB to enforce
uniqueness (which would be a race condition anyway).
Also remember that if you ever get an autogenerated ID collision then
something is most definitely broken. For an example of a similar
situation, see the Ubuntu OpenSSL bruhaha with public key collisions.
HTH,
Paul Davis
Hmm. Unused does kind of give the connotation of things being checked.
Perhaps a special caveat should be presented.
Also, as a random aside, why does _uuids require a POST?
> Also, as a random aside, why does _uuids require a POST?
I guess it's to comply with REST as it returns a different output each
time it is called.
> Also, as a random aside, why does _uuids require a POST?
Well, for one thing you definitely don't want that request cached :-)
Touché
--
> Hmmm, while I knew the chances of a collision were slim, after
> reading the wikipedia page on UUIDs and the probability of a
> collision, I don't even think the question is worth asking :)
>
Heh, It's worth asking (it's means you are doing your job), but the
answer for all practical purposes is "it doesn't matter". The
documentation as stated is incorrect (for one thing, the UUID
generation request is independent of a database), and performing such
a check would be needlessly expensive, given the extremely low
likelihood of collision. And of course if there is a collision, it's
automatically detected and no data is lost.
-Damien
> The documentation as stated is incorrect
Fixed.
Cheers
Jan
--
I don't see a problem with GET'ing a set of UUIDs. It doesn't change
the server's state in any way and doing another GET continues to have
no effect.
It's a time-dependent resource, that's all.
- Matt
In the normal case you would POST a document to a collection when you
want the server to choose the final URL. However, intermediaries have
a habit of retrying POSTs randomly, so when you POST and id-less Couch
document, occasionally duplicate documents are created. We work around
this by recommending PUT as the document creation method. Of course
clients can specify any document id they'd like to, but for
lightweight clients CouchDB provides the _uuids service.
The POST is pragmatic for cache-control reasons, but also RESTy,
because it exposes the service that CouchDB uses internally for
directing document POSTs to new ids. By using the _uuids service,
clients can become the part of CouchDB that would direct documents to
URLs in a collection.
--
Chris Anderson
http://jchris.mfdz.com
I don't agree and I think it should change to GET.
* You hint that it is to mirror the process required for the creation of
documents. This is not how we should be designing the interface. UUID
creation is totally disjoint and should be considered separately.
* You mention cache-control, but nothing about GET semantics implies
cacheability so unless there is some major flaw with common UA
implementations I don't see this as a valid argument.
I would suggest you open a bug for this issue Matt.
--
Noah Slater, http://tumbolia.org/nslater
GET is meant to be idempotent - http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html
.
Antony Blakey
--------------------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787
Hi, I'd like to do $THING. I know that $SOLUTION_A and $SOLUTION_B
will do it very easily and for a very reasonable price, but I don't
want to use $SOLUTION_A or $SOLUTION_B because $VAGUE_REASON and
$CONTRADICTORY_REASON. Instead, I'd like your under-informed ideas on
how to achieve my $POORLY_CONCEIVED_AMBITIONS using Linux, duct tape,
an iPod, and hours and hours of my precious time.
-- Slashdot response to an enquiry
So? How does the idempotency of GET affect the UUID service?
After researching this in more depth it turns out I was indeed
mistaken in thinking *any* responses to GET requests can potentially
be cached. Quoting from RFC 2616 [1]:
"The response to a GET request is cacheable if and only if it meets
the requirements for HTTP caching described in section 13."
And section 13 [2] goes on to say that operations are transparent by
default, and transparency can be relaxed:
- only by an explicit protocol-level request when relaxed by
client or origin server
- only with an explicit warning to the end user when relaxed by
cache or client
So as you say, unless common implementations have flaws, I think GET
responses would not be cached unless we explicitly say so, assuming we
handle the HEAD + If-Modified-Since etc. requests correctly (which is
what a conforming cache proxy should do).
[1]: http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html#sec9.3
[2]: http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html
> GET is meant to be idempotent - http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html
> .
Although the nature of UUIDs means that allocation doesn't change the
server state, so that's irrelevent.
Any situation where caching of GET requests is an issue is going to
have very serious problems getting stale documents where a rev isn't
supplied as a query parameter.
So, I agree with Noah.
Antony Blakey
-------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787
Nothing is really work unless you would rather be doing something else.
-- J. M. Barre
Quoting from the RFC: A sequence that never has side effects is
idempotent, by definition (provided that no concurrent operations are
being executed on the same set of resources).
Hence the UUID service is idempotent, as it has no side effects.
My point exactly, glad we're all in agreement! Heh heh.
"The important distinction here is that the user did not request the
side-effects, so therefore cannot be held accountable for them."
It's equivalent of saying "show me the state of document #3", and then
making the same request an hour later, if you ask me.
> On Sat, Jan 03, 2009 at 11:58:59PM +1030, Antony Blakey wrote:
>> GET is meant to be idempotent -
>> http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html.
>
> So? How does the idempotency of GET affect the UUID service?
In an inverse way to how I first thought. In an ideal world GET would
carry a payload so that POST vs. GET could be decided on the basis of
idempotency, rather than by the limitation on the size of query
parameters.
Antony Blakey
-------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787
Did you hear about the Buddhist who refused Novocain during a root
canal?
His goal: transcend dental medication.
You lost me with this. Are we still talking about CouchDB here or are we on an
(interesting but unrelated) tangent? What has the limit of query strings to do
with the decision to use POST vs. GET? Are you talking about the situation where
you want to pass in data to a GET request that would be too long (in practice,
the RFC doesn't specify a limit) for a query string? My rule of thumb would be
that if you find your self in that situation, you're doing something that could
probably be simplified.
> You lost me with this. Are we still talking about CouchDB here or
> are we on an
> (interesting but unrelated) tangent?
A tangent, but the issue does impact Couch ...
> What has the limit of query strings to do
> with the decision to use POST vs. GET? Are you talking about the
> situation where
> you want to pass in data to a GET request that would be too long (in
> practice,
> the RFC doesn't specify a limit) for a query string? My rule of
> thumb would be
> that if you find your self in that situation, you're doing something
> that could
> probably be simplified.
The multikey get in Couch should be a GET, but it can't be unless you
want you API to be limited by the (practical) limitation on URL length.
From an API perspective, I think POST and GET mix up idempotency with
the ability to have a payload or not, which in practical terms results
in people using POST when they should use GET, because those two
issues, whilst theoretically orthogonal, and not implemented that way.
Antony Blakey
-------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787
Borrow money from pessimists - they don't expect it back.
-- Steven Wright
Continuing the tangent...
Someone should invent an extension to HTTP whereby a client may issue
multiple GET requests at once at the beginning of a single TCP
connection. These resources may take time to generate, but are
amenable to parallelisation of some kind thus making it advantageous
to do this. This is a bit like KeepAlive, except that you can request
e.g. multiple CouchDB keys right at the beginning for maximum
performance, rather than serially.
This is where I hope someone will pipe up and say this already
exists :-)
Done, and thanks for the interesting discussion too :).
- Matt
Aha, http://en.wikipedia.org/wiki/HTTP_pipelining is what I am
thinking of.
Although it kind of works as a potential alternative to the multikey
POST, it won't return rows in any particular order, where as the
current multikey POST returns rows in the same order as the keys
specified. There will also be slightly more overhead in doing
pipelining, as each request would have to be handled separately on the
server-side. In any case, most Web browsers either don't support
pipelining or they have it turned off by default, so writing AJAX apps
using this would be a no-no.
As CouchDB is HTTP/1.1-compliant, it should support pipelining out-of-
the-box.
Danger, Will Robinson!
> whereby a client may issue multiple GET requests at once at the beginning of a
> single TCP connection.
...
> This is where I hope someone will pipe up and say this already exists :-)
Why not just thread your client code? :)
That doesn't really solve the issue, which is that making separate
HTTP requests is expensive for people on dialup :-)
Adam
Sent from my iPhone