bigcouch crash on bulk insert

ooi

unread,

Oct 11, 2012, 3:13:12 PM10/11/12

to bigcou...@googlegroups.com

I have an application using bigcouch and when it first starts it tries to initialize a clean DB with ~400 docs (~500kb) in a single bulk update. This works fine for clusters with N=1, but when N=3 (should be odd, right?) it occasionally succeeds, but more often fails -- sometimes crashing bigcouch and sometimes just printing lots of nasty errors. Q=256 R=2 W=2

I'm running on clusters of 3 CentOS VMs with 3 cores and 6gb RAM, bigcouch 1.1.1. Looking into the problem, the CPU is pegged at 100% utilization and running truss on the beam process it seems to spend 98% of the system time in "futex" calls. (Although this could be a red herring -- most of the CPU time is user, not system.)

It seems like this should be a pretty trivial bulk update for this much hardware. But I get the same results on our own vxware and Amazon's EC2. Same from within our application or reproducing the insert with curl. Smaller reads and writes to the DB work fine. On the rare occasions when the builk insert succeeds, the documents are fine and the application proceeds normally (although in the future there will be more bulk operations throughout). The IDs are not sequential (as recommended for performance) but retrying with sequential IDs did not change anything.

Any suggestions? Here are a few details if they help:

http://pastehtml.com/view/celvd4fm8.txt --- JSON to insert if you want to reproduce
http://pastehtml.com/view/celvr44c0.txt --- logfile entries with crash info

Thanks for any help!

Michael Miller

unread,

Oct 11, 2012, 4:05:04 PM10/11/12

to bigcou...@googlegroups.com, bigcou...@googlegroups.com

Q=256 is likely the culprit. That sounds like overkill for 3 nodes. I'd drop that to q=6 for 3 nodes.

Mike Miller

ooi

unread,

Oct 11, 2012, 4:54:40 PM10/11/12

to bigcou...@googlegroups.com

Hmmm.... I didn't suspect Q but figured I'd give it a try anyway -- and the bulk update succeeded just fine with Q=8. And the CPU utilization didn't even register.

Any idea why Q would cause so much trouble? All the configuration recommendations say to overshard. I actually don't know if Q=256 will be enough in the long run: the DB is expected to handle growth on the order of TB daily with an operational lifespan of 25 years. This works out to 9PB or 35TB per shard with Q=256.

But with Q=8 each shard would have to store over 1 PB. This is years out, so probably our future selves won't bat an eye at multi-PB volumes... But I still have the issue of handling requests. With Q=8 and N=3 I can only have 24 bigcouch nodes. This means we really have to have 24 big servers instead of 100s or 1000s of commodity PCs.

Thanks, Mike. I'll repost this as a new thread about shards instead of making this one shift gears too much...

Michael Miller

unread,

Oct 11, 2012, 5:02:35 PM10/11/12

to bigcou...@googlegroups.com, bigcou...@googlegroups.com

You'll need to reshard as you grow. That's manual in bigcouch (automatic in Cloudant). Hop in #cloudant on IRC as you near that PByte mark ;)

-M

Reply all

Reply to author

Forward