Mass updates

12 views
Skip to first unread message

Charles S. Koppelman-Milstein

unread,
May 8, 2013, 11:24:37 PM5/8/13
to us...@couchdb.apache.org
I am trying to understand whether Couch is the way to go to meet some of
my organization's needs. It seems pretty terrific.
The main concern I have is maintaining a consistent state across code
releases. Presumably, our data model will change over the course of
time, and when it does, we need to make the several million old
documents conform to the new model.

Although I would love to pipe a view through an update handler and call
it a day, I don't believe that option exists. The two ways I
understandto do this are:

1. Query all documents, update each doc client-side, and PUT those
changes in the _bulk_docs API (presumably this should be done in batches)
2. Query the ids for all docs, and one at a time, PUT them through an
update handler

Are these options reasonably performant? If we have to do a mass-update
once a deployment, it's not terrible if it's not lightning-speed, but it
shouldn't take terribly long. Also, I have read that update handlers
have indexes built against them. If this is a fire-once option, is that
worthwhile?

Which option is better? Is there an even better way?

Thanks,
Charles

James Marca

unread,
May 9, 2013, 1:18:12 AM5/9/13
to us...@couchdb.apache.org
On Wed, May 08, 2013 at 11:24:37PM -0400, Charles S. Koppelman-Milstein wrote:
> I am trying to understand whether Couch is the way to go to meet some of
> my organization's needs. It seems pretty terrific.
> The main concern I have is maintaining a consistent state across code
> releases. Presumably, our data model will change over the course of
> time, and when it does, we need to make the several million old
> documents conform to the new model.
>
> Although I would love to pipe a view through an update handler and call
> it a day, I don't believe that option exists. The two ways I
> understandto do this are:
>
> 1. Query all documents, update each doc client-side, and PUT those
> changes in the _bulk_docs API (presumably this should be done in batches)
> 2. Query the ids for all docs, and one at a time, PUT them through an
> update handler

I don't see much difference between those two options, but what I
would do is something like this, in node or perl or java or whatever
you like using something like (I have node.js code that does something
very similar, so I am cutting and pasting small stuff below)

(I apologize this sort of rambles and may be less helpful that nothing)

pick a batchsize, start with 100 or so and ramp up. depending on how
big your documents are, asking for too many at once could be a RAM issue.

var batchsize = 100
var querysize = batchsize+1 //(I borrowed this trick from an old posting by jchris, I think)
var query = {limit:limit
,include_docs:true
}
var state=get_state_from_couchdb() // use couchdb to store progress
query.startkey=state

function get_docs (query,callback){ // generic boiler plate to send
// a get request to couchdb
}
get_docs(query,function(err,resp){
// get the plus 1 row just to get its id
last_fetched= rows.pop()
// save it to couchdb for some other process to use
save_state_to_couchdb(last_fetched._id)
process_rows(rows)
}

Then as you said, when you are done you can use _bulk_docs to put the
new docs back into the old database, or probably better, write them to
a new database, so that you can keep your old database pristine in
case you break something along the way and want to start over.

This is slow. No way around that. It would be slow and dangerous if
you used some sort of view to in-place update your db, but it would
still be slow.

If the processing time is high, you can speed things up by running one
or two threads using the "couchdb as state machine" trick, but
probably doc updating will be super quick and the the limiter will be
disk I/O, so one thread is safest. I'd still use the state-machine
trick so you can stop and restart without pulling your hair out.

And keep in mind that once you update each doc, then all of your views
will need to get rebuilt against the entire db...there is no way for
the view to know that your change was trivial, etc. Another reason to
keep the old db in place until your new version's views have all been
rebuilt.

Another option is to ignore the bulk update, and just store a version
tag in the documents. If the document is version 1.1, and it should
be 4.3, then you know you have to update that document before you do
anything crazy, but it may be that you don't need to do anything...it
is application specific. If I'm doing traffic counts and version 1.1
of a doc has fields a,b and c, and 1.2 has a new field 'd', I can't go
back and collect 'd' from the older counts, so I don't bother changing
the old docs. Instead, if a view needs that 'd' field, then I make
sure the version check for 1.2 passes inside of that view.

Hope that helps with your decision.

James Marca

Paul Davis

unread,
May 9, 2013, 1:41:22 AM5/9/13
to us...@couchdb.apache.org
On Wed, May 8, 2013 at 10:24 PM, Charles S. Koppelman-Milstein
<cko...@alumni.gwu.edu> wrote:
> I am trying to understand whether Couch is the way to go to meet some of
> my organization's needs. It seems pretty terrific.
> The main concern I have is maintaining a consistent state across code
> releases. Presumably, our data model will change over the course of
> time, and when it does, we need to make the several million old
> documents conform to the new model.
>
> Although I would love to pipe a view through an update handler and call
> it a day, I don't believe that option exists. The two ways I
> understandto do this are:
>
> 1. Query all documents, update each doc client-side, and PUT those
> changes in the _bulk_docs API (presumably this should be done in batches)
> 2. Query the ids for all docs, and one at a time, PUT them through an
> update handler
>

You are correct that there's no server side way to do a migration like
you're asking for server side.

The general pattern for these things is to write a view that only
includes the documents that need to be changed and then write
something that goes through and processes each doc in the view to the
desired form (that removes it from the view). This way you can easily
know when you're done working. Its definitely possible to write
something that stores state and/or just brute force a db scan each
time you write run the migration.

Performance wise, your first suggestion would probably be the most
performant although depending on document sizes and latencies it may
be possible to get better numbers using an update handler but I doubt
it unless you have huge docs and a super slow connection with high
latencies.

> Are these options reasonably performant? If we have to do a mass-update
> once a deployment, it's not terrible if it's not lightning-speed, but it
> shouldn't take terribly long. Also, I have read that update handlers
> have indexes built against them. If this is a fire-once option, is that
> worthwhile?
>

I'm not sure what you mean that update handlers have indexes built
against them. That doesn't match anything that currently exist in
CouchDB.

> Which option is better? Is there an even better way?
>

There's nothing better than you're general ideas listed.

> Thanks,
> Charles

Andrey Kuprianov

unread,
May 9, 2013, 7:16:42 AM5/9/13
to us...@couchdb.apache.org
Rebuilding the views mentioned by James is hell! And the more docs and
views you have, the longer your views will have to catch up with the
updates. We dont have the best of the servers, but ours (dedicated) took
several hours to rebuild our views (not too many as well) after we inserted
~150k documents (we use full text search with Lucene as well, so it also
contributed to the overall sever slowdown).

So my suggestion is:

1. Once you want to migrate your stuff, make a copy of your db.
2. Do migration on the copy
3. Allow for views to rebuild (you need to query each desing's document
single view once to trigger for views to start catching up with the
updates). You'd probably ask, if it was possible to limit resource usage of
Couch, when views are rebuilding, but i dont have answer to that question.
Maybe someone else can help here...
4. Switch database pointer from one DB to another.

Robert Newson

unread,
May 9, 2013, 7:18:54 AM5/9/13
to us...@couchdb.apache.org

Andrey Kuprianov

unread,
May 9, 2013, 8:16:42 AM5/9/13
to us...@couchdb.apache.org
Regarding cpu usage limiting. I've just tried cpulimit and it works great.

http://superuser.com/questions/442970/limit-a-processes-cpu-usage-methods

Lance Carlson

unread,
May 9, 2013, 8:31:38 AM5/9/13
to us...@couchdb.apache.org
This is a very common use case. I've been banging my head against a
wall with it a bit too. I think my most ideal and optimal setup would
be to stream all of my relevant docs into Redis (key is the ID of the
document, value is some json blob). A million docs should only use
150MB ish if they are average sized docs. Then grab said updated data
source, update docs that need updating and attach a _deleted flag on
the docs that aren't in the new data set anymore and create new keys
for new docs (I always try to come up with an ID naming convention for
my docs. If your new docs don't require IDs and you just want couch to
generate them, it might be a good idea to just make a Redis key that
is prefixed with new and a UUID) . Then run another batch script that
collects some number of documents at a time and bulk saves them back
into couch.

Perhaps your use case doesn't require bulk deleting like my case, but
when the proposal to start creating new databases came up, I figured
I'd include my alternative method since I've gone down that path
before too and it can be a pain in the arse to have to track what
database is the most up to date.

Sent from my iPhone

svilen

unread,
May 9, 2013, 8:52:52 AM5/9/13
to us...@couchdb.apache.org
> >>>> The main concern I have is maintaining a consistent state across
> >>>> code releases. Presumably, our data model will change over the
> >>>> course of time, and when it does, we need to make the several
> >>>> million old documents conform to the new model.

a question: do u need to keep the old variant/state too?
(think bi-temporal stuff)

that one, and if the once-off conversion process is going to take
loooong time, it might be better to bite the bullet and allow for
multiple (e.g. 2) versions of the "schema" to coexist - in both
server/views and client code. Thus once u already have some doc in the
new variant, use it. Else, fallback to runtime-conversion.

i know it isn't easy to organise but otherwise u're fighting reality.

svilen

Wendall Cada

unread,
May 9, 2013, 4:07:50 PM5/9/13
to us...@couchdb.apache.org
This may sound ideal, but in my experience, this can lead to crazy,
buggy, eventually un-maintainable code. Additionally, this approach ends
up quickly in a situation where you're not supporting just two schemas,
but multiple others as well, as you can never guarantee that the schema
was updated across the entire database.

This has been discussed at my workplace quite extensively recently.
Ideally, there would be some robust extension to couchdb to handle
schema changes for large online databases in a sane way.

The real issue here isn't updating the schema in the db docs, it's
updating the schema and the requisite view indexes. If the schema
changes break your ddoc code, there isn't any live db solution available
that can fix this issue. This part of any database maintenance is
really, really hard work. It's what DBAs do for a living, and is just
difficult in different ways depending on the database.

Wendall

Jim Klo

unread,
May 9, 2013, 4:28:36 PM5/9/13
to <user@couchdb.apache.org>, us...@couchdb.apache.org
Could you not use a VDU function that fixes the structure in a target DB and then replicate from old to new DB?

- JK

Sent from my iPhone

svilen

unread,
May 9, 2013, 4:32:15 PM5/9/13
to us...@couchdb.apache.org
yeah i know. i've been through it, needing temporaly-versioned
code stored in a db, apart of the data for that time-period.. i know
what it is. Near to impossible - still, it's the only correct one.

> Ideally, there would be some robust extension to couchdb to handle
> schema changes for large online databases in a sane way.
btw i read that something like virtual views is going to be
made.. one that would help multi-pass queries (which now are made just
via artificial temporary dbs). Maybe these multi-version schemas are
kind of that..

Lance Carlson

unread,
May 10, 2013, 12:53:47 PM5/10/13
to us...@couchdb.apache.org
So, since I'm dealing with this problem now.. I open sourced a small script
that helps at least get couch to Redis and a brain dump of my ideas are
included in the README. Feedback appreciated!

https://github.com/lancecarlson/couchout.go

Noah Slater

unread,
May 10, 2013, 1:02:03 PM5/10/13
to us...@couchdb.apache.org
Looks great Lance!
--
NS

Benoit Chesneau

unread,
May 11, 2013, 1:02:08 AM5/11/13
to us...@couchdb.apache.org
On May 9, 2013 1:17 PM, "Andrey Kuprianov" <andrey.k...@gmail.com>
wrote:
>
> Rebuilding the views mentioned by James is hell! And the more docs and
> views you have, the longer your views will have to catch up with the
> updates. We dont have the best of the servers, but ours (dedicated) took
> several hours to rebuild our views (not too many as well) after we
inserted
> ~150k documents (we use full text search with Lucene as well, so it also
> contributed to the overall sever slowdown).
>
> So my suggestion is:
>
> 1. Once you want to migrate your stuff, make a copy of your db.
> 2. Do migration on the copy
> 3. Allow for views to rebuild (you need to query each desing's document
> single view once to trigger for views to start catching up with the
> updates). You'd probably ask, if it was possible to limit resource usage
of
> Couch, when views are rebuilding, but i dont have answer to that question.
> Maybe someone else can help here...
> 4. Switch database pointer from one DB to another.
>
>

You don' t need to wait that all the docs are here to triggerthe viewupdat,
Jus trigger it more often. So view calculation will happen on smaller set.

You caneven make it //by using different ddocs.

Benoit Chesneau

unread,
May 11, 2013, 1:10:10 AM5/11/13
to us...@couchdb.apache.org
Since the content of the docs will change then what you really need is to
update the view of them inside your application. So I would handle the
models changes at the view lvel ( and trigger updates async). You could
also handle schema n-1 if needed.

- benoit
On May 9, 2013 5:25 AM, "Charles S. Koppelman-Milstein" <

Andrey Kuprianov

unread,
May 11, 2013, 2:27:59 AM5/11/13
to us...@couchdb.apache.org, us...@couchdb.apache.org
We do that and we have a cron to touch view every 5 min. Its just that at that particular time we had to insert those 150k in one go (we were migrating from mysql)

Sent from my iPhone

Lance Carlson

unread,
May 13, 2013, 2:24:12 AM5/13/13
to us...@couchdb.apache.org
Made a lot of updates to my couchout project. It now includes a couchin
project as well. Might create another project for updating, but it's pretty
easy for someone to script a node js script (or any language for that
matter) that connects to redis, decodes and encodes base64.

Lance Carlson

unread,
May 13, 2013, 2:24:50 AM5/13/13
to us...@couchdb.apache.org

James Marca

unread,
May 15, 2013, 2:17:54 AM5/15/13
to us...@couchdb.apache.org
On Mon, May 13, 2013 at 02:24:50AM -0400, Lance Carlson wrote:
> Oops, urls:
>
> https://github.com/lancecarlson/couchin.go
> https://github.com/lancecarlson/couchout.go
>
> Feedback appreciated!
>

I don't understand the use case here, so I'd appreciate an example.
If you can define a view or use all_docs to pull docs from couch and
into redis, why use redis at all? Why not just use couch directly,
load docs into ram, and process them?

I feel like I'm missing something obvious.

Also, I've never stressed Redis much. What happens when you bump up
against ram limits?

James

Lance Carlson

unread,
May 15, 2013, 2:26:09 AM5/15/13
to us...@couchdb.apache.org
I use Redis to stick docs into RAM. Once they're in RAM, I like to use node
to parse the docs in the way I want them, then purge the dataset. Couchout
pulls them into RAM using Redis, couchin bulk_saves back into couchdb from
Redis. I tried to make the couchout/in tools language agnostic.

Anyway, you can certainly use whatever language you want and load all of
the docs into memory. Typically though if you're dealing with a non
statically compiled language, you're going to run into situations where
Redis would be more efficient.

Lance Carlson

unread,
May 15, 2013, 2:38:21 AM5/15/13
to us...@couchdb.apache.org
There are other benefits to having your dataset in Redis rather than RAM
BTW.. for one, it's easier to run multiple processes against your dataset
to split up the work of manipulating the data.

Anyway, I've not really stressed the upper bounds of Redis RAM limits. Our
largest datasets only use a total of 150MB's on Redis so it hasn't been a
big deal yet.
Reply all
Reply to author
Forward
0 new messages