Hi,
I've spent days using trial and error to try and figure out why I am
getting a very high CPU load on only a single node in my cluster. I'm
hoping someone has an idea of what is going on as I'm getting stuck.
Here's my configuration:
1. 2 node cluster:
1. Each node is located in a different AWS availability zone
2. Each node is a t2 medium instance (2 CPU cores, 4 GB Mem)
2. A haproxy server is load balancing traffic to the nodes using round
robin
The problem:
1. After users make changes via PouchDB, a backend runs a number of
routines that use views to calculate notifications. The issue is that on a
single node, the couchjs processes stack up and then start to consume
nearly all the available CPU. This server then becomes the "workhorse" that
always does *all* the heavy duty couchjs processing until I restart this
node.
2. It is important to note that both nodes have couchjs processes, but
it is only a single node that has the couchjs processes that are using 100%
CPU
3. I've even resorted to setting `os_process_limit = 10` and this just
results in each couchjs process taking over 10% each! In other words, the
couchjs processes just eat up all the CPU no matter how many couchjs
process there are!
4. The CPU usage will eventually clear after all the processing is done,
but then as soon as there is more to process the workhorse node will get
bogged down again.
5. If I restart the workhorse node, the other node then becomes the
workhorse node. This is the only way to get the couchjs processes to "move"
to another node.
6. The problem is that this design is not scalable as only one node can
be the workhorse node at any given time. Moreover this causes specific
instances to run out of CPU credits. Shouldn't the couchjs processes be
spread out over all my nodes? From what I can tell, if I add more nodes I'm
still going to have the issue where only one of the nodes is getting bogged
down. Is it possible that the problem is that I have 2 nodes and really I
need at least 3 nodes? (I know a 2-node cluster is not very typical)
Things I've checked:
1. Ensured that the load balancing is working, i.e. haproxy is indeed
distributing traffic accordingly
2. I've tried setting `os_process_limit = 10` and `os_process_soft_limit
= 5` to see if I could force a more conservative usage of couchjs
processes, but instead the couchjs processes just consume all the CPU load.
3. I've tried simulating the issue locally with VMs and I cannot
duplicate any such load. My guess is that this is because the nodes are
located on the same box so hop distance between nodes is very small and
this somehow keeps the CPU usage to a minimum
4. I've tried isolating the issue by creating short code snippets that
intentionally try to spawn a lot of couchjs processes and they are spawned
but don't consume 100% CPU
5. I've tried rolling back from CouchDB 2.1.1 to CouchDB 2.0 and this
doesn't seem to change anything
6. The only error entries in my CouchDB logs are like the following and
I don't believe they are related to my issue:
1.
[error] 2017-12-04T18:13:38.728970Z
cou...@172.31.83.32 <0.13974.79>
4b0b21c664 rexi_server: from:
cou...@172.31.83.32(<0.20638.79>) mfa:
fabric_rpc:open_shard/2 throw:{forbidden,<<"You are not allowed to access
this db.">>}
[{couch_db,open,2,[{file,"src/couch_db.erl"},{line,185}]},{fabric_rpc,open_shard,2,[{file,"src/fabric_rpc.erl"},{line,267}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]
Does CouchDB have some logic built in that spawns a number of couchjs
processes on a "primary" node? Will future view processing then always be
routed to this "primary" node?
Is there a way to better distribute these heavy duty couchjs processes? Is
it possible to limit their CPU consumption? (I'm hesitant to start down the
path of using something like cpulimit as I think there is a root problem
that needs to be addressed)
I'm running out of ideas and hope that someone has some notion of what is
causing this bizarre load or if there is a bug in CouchDB.
Thank you for any help you can provide!
Geoff