When should include_docs be used?

27 views
Skip to first unread message

Tito Ciuro

unread,
Jun 16, 2014, 12:35:32 AM6/16/14
to couc...@googlegroups.com
Hi,

I've been using CouchDB for a while and now I'm evaluating Couchbase. I'm wondering what's the best way to determine when to emit data vs null. I typically avoid emitting the whole document is it's too "large" (i.e. 1 MB or so) because the index would grow way too much. In this case, I tend to emit null and then collect the documents via Include_docs. However, if the data set is small (or all I need is a subset of the document, then I emit this subset, as it's faster and puts less strain on the storage system. There is also the potential for a race condition. As per CouchDB's documentation (http://wiki.apache.org/couchdb/HTTP%5Fview%5FAPI)

The include_docs option will include the associated document. However, the user should keep in mind that there is a race condition when using this option. It is possible that between reading the view data and fetching the corresponding document that the document has changed. If you want to alleviate such concerns you should emit an object with a _rev attribute as in emit(key, {"_rev": doc._rev}). This alleviates the race condition but leaves the possibility that the returned document has been deleted (in which case, it includes the "_deleted": true attribute). Note: include_docs will cause a single document lookup per returned view result row. This adds significant strain on the storage system if you are under high load or return a lot of rows per request. If you are concerned about this, you can emit the full doc in each row; this will increase view index time and space requirements, but will make view reads optimally fast.

Since Couchbase utilizes memcache, storing and retrieving data is a whole different game: while in general a CouchDB document should not be split and related into other documents (it's not a RDBMS!), it seems to be perfectly fine in Couchbase. Because get/set/multiget are cheap operations, it's perfectly feasible to "break" a document into smaller pieces and retrieve them piecemeal. It seems this would be great for memcache because it'd allow to cache the documents that are used the most. On the other hand, keeping a document "monolithic" not only makes the index larger, but it makes it less efficient to cache (it's an all or nothing proposition.)

So it seems that a valid approach in Couchbase would be to:

1) break "large" documents into smaller, more manageable ones. Retrieve them via get/multiget (cheap op) and let memcache cache them as efficiently as possible.
2) emit small data subsets as needed, as opposed to the entire document where possible.
3) for those queries where the entire document needs to be retrieved... what then?:

    3.1) should we emit null and include_docs=true?
    3.2) should we emit the entire document instead?

It's clear that always emitting null in CouchDB puts a lot of pressure on the storage system. But what about Couchbase? Are there any best practices to be followed?

Thanks,

-- Tito

Volker Mische

unread,
Jun 16, 2014, 3:53:08 AM6/16/14
to couc...@googlegroups.com
Hi Tito,
The Couchbase implementation for include_docs is different. If you use
an SDK, it requests the view to get all the IDs and then it fetches the
full docs via a memcache GET. In the upcoming version of Couchbase (3.0)
the original include_docs of the views will completely go away aand it
will only be supported through the SDKS (don't worry the API won't
change when you use the SDKS).

> Since Couchbase utilizes memcache, storing and retrieving data is a
> whole different game: while in general a CouchDB document should not be
> split and related into other documents (it's not a RDBMS!), it seems to
> be perfectly fine in Couchbase. Because get/set/multiget are cheap
> operations, it's perfectly feasible to "break" a document into smaller
> pieces and retrieve them piecemeal. It seems this would be great for
> memcache because it'd allow to cache the documents that are used the
> most. On the other hand, keeping a document "monolithic" not only makes
> the index larger, but it makes it less efficient to cache (it's an all
> or nothing proposition.)
>
> So it seems that a valid approach in Couchbase would be to:
>
> 1) break "large" documents into smaller, more manageable ones. Retrieve
> them via get/multiget (cheap op) and let memcache cache them as
> efficiently as possible.
> 2) emit small data subsets as needed, as opposed to the entire document
> where possible.
> 3) for those queries where the entire document needs to be retrieved...
> what then?:
>
> 3.1) should we emit null and include_docs=true?
> 3.2) should we emit the entire document instead?

You would emit null and let the SDK do the rest

> It's clear that always emitting null in CouchDB puts a lot of pressure
> on the storage system. But what about Couchbase? Are there any best
> practices to be followed?

Do you mean "emittin the full document ...."?

Cheers,
Volker

Tito Ciuro

unread,
Jun 16, 2014, 11:29:06 AM6/16/14
to couc...@googlegroups.com
Hi Volker,

Do you mean "emittin the full document ...."?

You already answered my question: by relying on the SDK, the full documents will be retrieved by using the IDs obtained through the specified view.

Thanks for the help!

Regards,

-- Tito

-- 
You received this message because you are subscribed to a topic in the Google Groups "Couchbase" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/couchbase/Y385HZQ73k0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to couchbase+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages