100% CPU on only a single node because of couchjs processes

1,017 views
Skip to first unread message

Geoffrey Cox

unread,
Dec 4, 2017, 1:46:41 PM12/4/17
to user
Hi,

I've spent days using trial and error to try and figure out why I am
getting a very high CPU load on only a single node in my cluster. I'm
hoping someone has an idea of what is going on as I'm getting stuck.

Here's my configuration:

1. 2 node cluster:
1. Each node is located in a different AWS availability zone
2. Each node is a t2 medium instance (2 CPU cores, 4 GB Mem)
2. A haproxy server is load balancing traffic to the nodes using round
robin

The problem:

1. After users make changes via PouchDB, a backend runs a number of
routines that use views to calculate notifications. The issue is that on a
single node, the couchjs processes stack up and then start to consume
nearly all the available CPU. This server then becomes the "workhorse" that
always does *all* the heavy duty couchjs processing until I restart this
node.
2. It is important to note that both nodes have couchjs processes, but
it is only a single node that has the couchjs processes that are using 100%
CPU
3. I've even resorted to setting `os_process_limit = 10` and this just
results in each couchjs process taking over 10% each! In other words, the
couchjs processes just eat up all the CPU no matter how many couchjs
process there are!
4. The CPU usage will eventually clear after all the processing is done,
but then as soon as there is more to process the workhorse node will get
bogged down again.
5. If I restart the workhorse node, the other node then becomes the
workhorse node. This is the only way to get the couchjs processes to "move"
to another node.
6. The problem is that this design is not scalable as only one node can
be the workhorse node at any given time. Moreover this causes specific
instances to run out of CPU credits. Shouldn't the couchjs processes be
spread out over all my nodes? From what I can tell, if I add more nodes I'm
still going to have the issue where only one of the nodes is getting bogged
down. Is it possible that the problem is that I have 2 nodes and really I
need at least 3 nodes? (I know a 2-node cluster is not very typical)


Things I've checked:

1. Ensured that the load balancing is working, i.e. haproxy is indeed
distributing traffic accordingly
2. I've tried setting `os_process_limit = 10` and `os_process_soft_limit
= 5` to see if I could force a more conservative usage of couchjs
processes, but instead the couchjs processes just consume all the CPU load.
3. I've tried simulating the issue locally with VMs and I cannot
duplicate any such load. My guess is that this is because the nodes are
located on the same box so hop distance between nodes is very small and
this somehow keeps the CPU usage to a minimum
4. I've tried isolating the issue by creating short code snippets that
intentionally try to spawn a lot of couchjs processes and they are spawned
but don't consume 100% CPU
5. I've tried rolling back from CouchDB 2.1.1 to CouchDB 2.0 and this
doesn't seem to change anything
6. The only error entries in my CouchDB logs are like the following and
I don't believe they are related to my issue:
1.

[error] 2017-12-04T18:13:38.728970Z cou...@172.31.83.32 <0.13974.79>
4b0b21c664 rexi_server: from: cou...@172.31.83.32(<0.20638.79>) mfa:
fabric_rpc:open_shard/2 throw:{forbidden,<<"You are not allowed to access
this db.">>}
[{couch_db,open,2,[{file,"src/couch_db.erl"},{line,185}]},{fabric_rpc,open_shard,2,[{file,"src/fabric_rpc.erl"},{line,267}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]

Does CouchDB have some logic built in that spawns a number of couchjs
processes on a "primary" node? Will future view processing then always be
routed to this "primary" node?

Is there a way to better distribute these heavy duty couchjs processes? Is
it possible to limit their CPU consumption? (I'm hesitant to start down the
path of using something like cpulimit as I think there is a root problem
that needs to be addressed)

I'm running out of ideas and hope that someone has some notion of what is
causing this bizarre load or if there is a bug in CouchDB.

Thank you for any help you can provide!

Geoff

Sinan Gabel

unread,
Dec 4, 2017, 4:08:47 PM12/4/17
to us...@couchdb.apache.org
Hi,

I am also experiencing 100% CPU usage, not sure why, it happens suddenly
and continues until couchdb is restarted.
CouchDB version being used is also single-node (n:3, q:8) and v.
2.1.0-6c4def6 on Ubuntu 16.04 2 vCPU's and 4.5 GB memory.

Jan Lehnardt

unread,
Dec 5, 2017, 4:11:50 AM12/5/17
to us...@couchdb.apache.org
Heya Geoff,

a CouchDB cluster is designed to run in the same data center / with local are networking latencies. A cluster across AWS Availability Zones won’t work as you see. If you want CouchDB’s in both AZs, use regular replication and keep the clusters local to the AZ.

Best
Jan
--
--
Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/

Jan Lehnardt

unread,
Dec 5, 2017, 4:12:13 AM12/5/17
to us...@couchdb.apache.org

> On 4. Dec 2017, at 22:08, Sinan Gabel <sinan...@gmail.com> wrote:
>
> Hi,
>
> I am also experiencing 100% CPU usage, not sure why, it happens suddenly
> and continues until couchdb is restarted.
> CouchDB version being used is also single-node (n:3, q:8) and v.
> 2.1.0-6c4def6 on Ubuntu 16.04 2 vCPU's and 4.5 GB memory.

We need a lot more info about your setup, configuration, log files etc. to comment.

Thanks!
Jan
--

Robert Samuel Newson

unread,
Dec 5, 2017, 7:43:29 AM12/5/17
to user
Sorry to contradict you, but Cloudant deploys clusters across amazon AZ's as standard. It's fast enough. It's cross-region that you need to avoid.

B.

Geoffrey Cox

unread,
Dec 5, 2017, 9:36:28 AM12/5/17
to us...@couchdb.apache.org
Thanks for the responses, any other thoughts?

FYI: I’m trying to work on a very focused test case that I can share with
the Dev team, but it is taking a little while to narrow down the exact
cause.
On Tue, Dec 5, 2017 at 4:43 AM Robert Samuel Newson <rne...@apache.org>
wrote:

Jan Lehnardt

unread,
Dec 5, 2017, 10:03:43 AM12/5/17
to us...@couchdb.apache.org
Oops, sorry, I got AZs and regions mixed up!

Best
Jan
--

Adam Kocoloski

unread,
Dec 5, 2017, 11:55:20 AM12/5/17
to us...@couchdb.apache.org
Hi Geoff, a couple of additional questions:

1) Are you making these view requests with stale=ok or stale=update_after?
2) What are you using for N and Q in the [cluster] configuration settings?
3) Did you take advantage of the (barely-documented) “zones" attribute when defining cluster members?
3) Do you have any other JS code besides the view definitions?

Regarding #1, the cluster will actually select shards differently depending on the use of those query parameters. When your request stipulates that you’re OK with stale results the cluster *will* select a “primary” copy in order to improve the consistency of repeated requests to the same view. The algorithm for choosing those primary copies is somewhat subtle hence my question #3.

If you’re not using stale requests I have a much harder time explaining why the 100% CPU issue would migrate from node to node like that.

Adam

Geoffrey Cox

unread,
Dec 5, 2017, 3:13:48 PM12/5/17
to us...@couchdb.apache.org
Hey Adam,

Attached is my local.ini and the design doc with the view JS.

Please see my responses below:

Thanks for the help!

On Tue, Dec 5, 2017 at 8:55 AM Adam Kocoloski <koco...@apache.org> wrote:
Hi Geoff, a couple of additional questions:

1) Are you making these view requests with stale=ok or stale=update_after?
GC: I am not using the stale parameter 
2) What are you using for N and Q in the [cluster] configuration settings?
GC: As per the attached local.ini, I specified n=2 and am using the default q=8.
3) Did you take advantage of the (barely-documented) “zones" attribute when defining cluster members?
GC: As per the attached local.ini, I have *not* specified this option. 
4) Do you have any other JS code besides the view definitions?
GC: When you refer to JS code, I think you mean in terms of JS code "in" CouchDB and if that is the case then my only JS code is very simple views like those in the attached view.json. (I know that I really need to break out the views so that there is one view per doc, but I haven't quite gotten around to refactoring this and I don't believe this is causing the CPU usage) 
views.json

Geoffrey Cox

unread,
Dec 5, 2017, 5:49:10 PM12/5/17
to us...@couchdb.apache.org
Hi Adam, quick follow-up: is it possible that writes can also be designed
to a "primary" node, like the `stale` option for a view? I was originally
thinking that the issue is with reading data via a view, but now I'm
thinking it may be related to writing data and those writes somehow
triggering these persistent and heavyweight couchjs processes. It's tough
to say as I'd imagine that you don't have couchjs load unless you have
frequent writing and reading. I'm still trying to isolate the issue and it
is difficult as the problem only seems to happen in a production env and
only with *all* the production code, figures ;)

Jan Lehnardt

unread,
Dec 6, 2017, 4:31:45 AM12/6/17
to us...@couchdb.apache.org

> On 5. Dec 2017, at 21:13, Geoffrey Cox <redg...@gmail.com> wrote:
>
> Hey Adam,
>
> Attached is my local.ini and the design doc with the view JS.
>
> Please see my responses below:
>
> Thanks for the help!
>
> On Tue, Dec 5, 2017 at 8:55 AM Adam Kocoloski <koco...@apache.org> wrote:
> Hi Geoff, a couple of additional questions:
>
> 1) Are you making these view requests with stale=ok or stale=update_after?
> GC: I am not using the stale parameter
> 2) What are you using for N and Q in the [cluster] configuration settings?
> GC: As per the attached local.ini, I specified n=2 and am using the default q=8.
> 3) Did you take advantage of the (barely-documented) “zones" attribute when defining cluster members?
> GC: As per the attached local.ini, I have *not* specified this option.
> 4) Do you have any other JS code besides the view definitions?
> GC: When you refer to JS code, I think you mean in terms of JS code "in" CouchDB and if that is the case then my only JS code is very simple views like those in the attached view.json. (I know that I really need to break out the views so that there is one view per doc, but I haven't quite gotten around to refactoring this and I don't believe this is causing the CPU usage)

Quick comment on one or multiple view(s)-per-ddoc: this is a performance trade-off and not either one is always correct. But generally, I would recommend grouping all views an app would need into a single ddoc.

For each ddoc, all docs in a database have to be serialised and shipped to couchjs and the results are shipped back, that’s the bulk of the work in view indexing. Evaluating a single map/reduce function is comparatively minuscule, so grouping views in a single ddoc makes that more efficient.
> <views.json>

Geoffrey Cox

unread,
Dec 6, 2017, 9:56:25 AM12/6/17
to us...@couchdb.apache.org
Interesting, I read somewhere that having a view per ddoc is more
efficient. Thanks for clarifying!

Geoffrey Cox

unread,
Dec 9, 2017, 8:01:15 PM12/9/17
to us...@couchdb.apache.org
Well, I'm back at this and here is the latest info and I think it may be
related to writes and the _global_changes database:

1. I run my production env and one of my nodes becomes the "workhorse"
node with 100% CPU
2. I stop all my production code from generating any more CouchDB
requests and eventually the workhorse node goes back to 0% CPU
3. I can then issue writes on a single database (really any database and
ANY node--not just the workhorse node) and the workhorse node will kick
back up to 100% CPU. If I stop the writes, the workhorse node will return
to 0% CPU.
4. And now the punch line: if I delete the _global_changes database, the
CPU drops down to 0% even if I am issuing writes! Pure cray cray

Any thoughts?

(Sorry, still working on a reproducible env for everyone)

Sinan Gabel

unread,
Dec 10, 2017, 9:28:55 AM12/10/17
to us...@couchdb.apache.org
Hi,

I do not have a solution but also experiencing the 100% CPU usage. Here's a .png screen shot of the processes running (my case):

Sinan Gabel

unread,
Dec 10, 2017, 10:50:49 AM12/10/17
to us...@couchdb.apache.org
PS I have just now upgraded to latest couchdb clustered version, I will see
if that solves the 100% CPU usage problem.

Sinan Gabel

unread,
Dec 10, 2017, 12:07:23 PM12/10/17
to us...@couchdb.apache.org
Same problem with latest couchdb clustered version. What is the process "
*/.fs-manager*" doing?

Googling it returned something about a python package called "fs-manager"!?

For now I have killed the *couchdb fs-manager* process, and couchdb still
seems to be working fine.

Robert Samuel Newson

unread,
Dec 10, 2017, 12:11:33 PM12/10/17
to user
fs-manager is not part of couchdb. You should check that you've not been hacked. See https://justi.cz/security/2017/11/14/couchdb-rce-npm.html <https://justi.cz/security/2017/11/14/couchdb-rce-npm.html>.

I've seen another case where a user's couchdb installation was compromised and a bitcoin mining tool installed. This would (obviously) use all your cpu, and I think fs-manager was part of that.

B.

Geoffrey Cox

unread,
Dec 10, 2017, 12:14:36 PM12/10/17
to us...@couchdb.apache.org
Hi Sinan, I don’t believe we are encountering the same problem as my
CouchDB instances run fine with almost no CPU usage until I introduce some
activity. And this is fine except the issue is that with a 2-node cluster,
the majority of the activity is focused on just one of the nodes.

Perhaps in your situation you just need to add some more CPU or memory
capacity?

Sinan Gabel

unread,
Dec 10, 2017, 1:03:05 PM12/10/17
to us...@couchdb.apache.org
@Robert Thanks a lot for the advice, I will check it.

On 10 December 2017 at 18:11, Robert Samuel Newson <rne...@apache.org>
wrote:

Sinan Gabel

unread,
Dec 10, 2017, 1:06:05 PM12/10/17
to us...@couchdb.apache.org
@Geoffrey Thanks Geoffrey, I will start by checking out the advice of
@Robert and see what that brings. I understand now that ./fs-manager has
nothing to do with couchdb but it is running as user: couchdb so hacking
could be the issue in my case.

Sinan Gabel

unread,
Dec 11, 2017, 9:04:28 AM12/11/17
to us...@couchdb.apache.org
I have now checked it, and it is as you @Robert perceived, a crypto mining
program.

It is a program "fs-manager" with config.json as given below that is
restarted every 3 hours or so (8:41, 11:41 etc.) unless it keeps running.

I updated couchdb to latest version and changed all admin passwords but
that has not solved it, so it hidden somewhere on the server.

Anyone have a clue to how to remove this hack completely?



$ sudo find / -name "fs-manager"

=>

/var/tmp/.X1M-Unix/fs-manager


$ ls -al

=>


$ less config.json

=>

{ "algo": "cryptonight", "av": 0, "background": true, "colors": false,
"cpu-affinity": null, "cpu-priority": null, "donate-level": 2, "log-file":
"xmrig.log", "max-cpu-usage": 85, "print-time": 60, "retries": 2,
"retry-pause": 3, "safe": false, "syslog": false, "threads": null, "pools":
[ { "url": "pool-proxy.com:8080", "user": "user", "pass": "x", "keepalive":
true, "nicehash": false } ]}




On 10 December 2017 at 18:11, Robert Samuel Newson <rne...@apache.org>
wrote:

Sinan Gabel

unread,
Dec 11, 2017, 11:49:45 AM12/11/17
to us...@couchdb.apache.org
I just found this, it may solve the problem by removing it. It is a script that starts the crypto mining program as user couchdb.

Inline images 1

On 11 December 2017 at 15:04, Sinan Gabel <sinan...@gmail.com> wrote:


I have now checked it, and it is as you @Robert perceived, a crypto mining program.

It is a program "fs-manager" with config.json as given below that is restarted every 3 hours or so (8:41, 11:41 etc.) unless it keeps running. 

I updated couchdb to latest version and changed all admin passwords but that has not solved it, so it hidden somewhere on the server.

Anyone have a clue to how to remove this hack completely?



$ sudo find / -name "fs-manager"


=>


/var/tmp/.X1M-Unix/fs-manager



$ ls -al


=>


Geoffrey Cox

unread,
Dec 11, 2017, 7:48:36 PM12/11/17
to us...@couchdb.apache.org
I've finally managed to narrow this down to a reproducible set of scripts
and have created a GH issue: https://github.com/apache/couchdb/issues/1063

The summary is that I believe there is a resource leak in CouchDB when you
abort continuous listening on the _global_changes database. Fortunately,
there appears to be a workaround (if your use case supports it) where you
can use feed=longpoll instead of feed=continuous

I hope this saves someone else some headache!

Sinan Gabel

unread,
Dec 21, 2017, 12:27:57 PM12/21/17
to us...@couchdb.apache.org
PS Just want to say that by removing the cron job as described the crypto mining problem is removed and the 100% CPU problem is resolved. (I have had responses from other users experiencing the exact same problem.)

$ sudo -i
$ su couchdb
$ crontab -e

Then remove the line in the cron. If you use the nano editor the command are "ctrl-k" to remove lines, and "ctrl-o" to write the file, and "ctrl-x" to exit the crontab editing. And thus the problem should be resolved. Else just write to me.

Víctor Torre

unread,
Dec 26, 2017, 3:18:31 AM12/26/17
to us...@couchdb.apache.org
Hi Sinan,

I recommend you to destroy the server which has been hacked. If someone
wrote a mining program in it, maybe they've persisted some malicious code
as well (rootkit, etc)

Definetly, If you detect a database node compromised you must destroy it
and load a backup.
--
[image: Cabify - Your private Driver] <http://www.cabify.com/>

*Victor Torre*
DevOps
Madrid, Spain

victor...@cabify.com


[image: Facebook] <http://cbify.com/fb_ES>[image: Twitter]
<http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>

--
Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su
destinatario, pudiendo contener información confidencial sometida a secreto
profesional. No está permitida su reproducción o distribución sin la
autorización expresa de Cabify. Si usted no es el destinatario final por
favor elimínelo e infórmenos por esta vía.

This message and any attached file are intended exclusively for the
addressee, and it may be confidential. You are not allowed to copy or
disclose it without Cabify's prior written authorization. If you are not
the intended recipient please delete it from your system and notify us by
e-mail.

Sinan Gabel

unread,
Dec 26, 2017, 4:41:41 AM12/26/17
to us...@couchdb.apache.org
@Victor Thanks Victor, sounds like the right thing to do. Will do.

Karl Helmer

unread,
Jan 5, 2018, 3:00:27 PM1/5/18
to us...@couchdb.apache.org
Hi Everyone,

I'm having an issue trying to install 1.7.1 on a CentOS 6.9 server.
I've done this successfully on other servers running the same OS. I've
installed the recommended packages and get "Time to relax" after
running configure. The make and make install step doesn't throw any
errors either. But when I start CouchDB using:
$sudo -u couchdb /usr/local/bin/couchdb
it returns

CRASH REPORT==== 5-Jan-2018::13:02:05 ===
crasher:
initial call: application_master:init/4
pid: <0.55.0>
registered_name: []
exception exit: {{app_would_not_start,ssl},
{couch_app,start,
[normal,
["/usr/local/etc/couchdb/default.ini",
"/usr/local/etc/couchdb/local.ini"]]}}
in function application_master:init/4 (application_master.erl, line
134)
ancestors: [<0.54.0>]
messages: [{'EXIT',<0.56.0>,normal}]
links: [<0.54.0>,<0.31.0>]
dictionary: []
trap_exit: true
status: running
heap_size: 610
stack_size: 27
reductions: 132
neighbours:

=INFO REPORT==== 5-Jan-2018::13:02:05 ===
application: couch
exited: {{app_would_not_start,ssl},
{couch_app,start,
[normal,
["/usr/local/etc/couchdb/default.ini",
"/usr/local/etc/couchdb/local.ini"]]}}
type: temporary

Is this a problem related to ssl connection? I've checked the user:group
of the files mentioned and they're all couchdb:couchdb and have changed
the ownership of the directories as per the instructions. I'm not sure
what this is telling me and any help would be appreciated.

thanks,
Karl
--
Karl Helmer, PhD
Athinoula A Martinos Center for Biomedical Imaging
Massachusetts General Hospital
149 - 13th St Room 2301
Charlestown, MA 02129
(p) 617.726.8636
(f) 617.726.7422
hel...@nmr.mgh.harvard.edu
http://www.martinos.org/user/6787




The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.

Dave Cottlehuber

unread,
Jan 6, 2018, 11:27:52 AM1/6/18
to us...@couchdb.apache.org
On Fri, 5 Jan 2018, at 21:00, Karl Helmer wrote:
> Hi Everyone,
>
> I'm having an issue trying to install 1.7.1 on a CentOS 6.9 server.
> I've done this successfully on other servers running the same OS. I've
> installed the recommended packages and get "Time to relax" after
> running configure. The make and make install step doesn't throw any

> CRASH REPORT==== 5-Jan-2018::13:02:05 ===
> crasher:
> initial call: application_master:init/4
> pid: <0.55.0>
> registered_name: []
> exception exit: {{app_would_not_start,ssl},

Hi Karl,

This suggests asn1 or crypto modules are perhaps not installed.

$ erl
Erlang/OTP 20 [erts-9.2] [source] [64-bit] [smp:32:32] [ds:32:32:10] [async-threads:10] [hipe] [kernel-poll:true] [dtrace]

Eshell V9.2 (abort with ^G)
1> application:ensure_all_started(ssl).
{ok,[crypto,asn1,public_key,ssl]}
2>

What do you get here?

What steps specifically to install? how is erlang being installed?

A+
Dave

Karl Helmer

unread,
Jan 6, 2018, 4:47:18 PM1/6/18
to us...@couchdb.apache.org
Thanks Dave, you're right it's my erlang install. It looks as though
there was an initial, possibly partial, install of erlang R14B-04.3 that
is
colliding with my attempt to put on R19.3.6-1 (which I have running
successfully on other servers). I've tried to start over using yum remove,
but it says that it can't find erlang:

$sudo yum remove 'erlang-*'

Loaded plugins: fastestmirror, security
Setting up Remove Process
No Match for argument: erlang-*
Loading mirror speeds from cached hostfile
* base: mirror.sigmanet.com
* epel: mirror.steadfast.net
* extras: centos.s.uw.edu
* updates: mirrors.lga7.us.voxel.net
Package(s) erlang-* available, but not installed.
No Packages marked for removal

But somehow it seems to think that 19.3.6 is already installed:

$sudo yum localinstall esl-erlang_19.3.6-1~centos~6_amd64.rpm
Loaded plugins: fastestmirror, security
Setting up Local Package Process
Examining esl-erlang_19.3.6-1~centos~6_amd64.rpm: esl-erlang-19.3.6-1.x86_64
esl-erlang_19.3.6-1~centos~6_amd64.rpm: does not update installed package.
Nothing to do

Why this is odd is that when I do:
$erl --version
-bash: /usr/bin/erl: No such file or directory

I expected this since I then tried a manual removal (yum remove said it
couldn't find anything) of erlang by deleting:

/usr/local/lib/erlang (for 19.x)
/usr/lib/erlang
/usr/lib64/erlang
usr/share/java/erlang
and the links in /usr/bin/:
epmd, erl, erlc, escript, run_erl, run_test, to-erl


I also removed the various erlang rpm’s in the directory:
/var/cache/yum/x86_64/6/epel/packages

The only erlang files left are the ones in the couchdb installation and
/usr/share/augeas/lenses/dist/erlang.aug
/usr/share/autoconf/autoconf/erlang.m4
/usr/share/mime/text/x-erlang.xml
/usr/share/vim/vim74/syntax/erlang.vim
/usr/share/vim/vim74/ftplugin/erlang.vim
/usr/share/vim/vim74/compiler/erlang.vim
/usr/share/vim/vim74/indent/erlang.vim

Same problem persists - yum still thinks that R19 is installed. Should I
delete any of the above files? Any other ideas are welcome.

thanks,
Karl
Reply all
Reply to author
Forward
0 new messages