Troubleshooting connection issues

Travis Beauvais

unread,

Sep 8, 2015, 4:49:53 PM9/8/15

to mgo-users

Lately we've been seeing a lot of connection related issues and are having a hard time figuring out exactly where the problem is. Is it our mgo configuration/usage? Is it our actually mongo machines? Is it our mongod configuration? Anything else?

The problems that we will suddenly start getting all of the errors. "tcp timeout", "connection closed explicitly", "connection reset by peer", "broken pipe" and "EOF" all seem to happen. "connection reset by peer" seems to be the most common. "closed explicitly" seems to only happen under the heaviest load. The others just seem random almost. To troubleshoot I setup a new mongo machine and wrote a script to just continuously do what our app logic does and see where it breaks.

That script is at https://gist.github.com/MrTravisB/649c6df37f2db01d9342. Essentially, it take an account id and a user id, then attempts to find or create a document for that combination. There is a unique index on those fields. So we do a find, if not found, do an insert, if duplicate error, do a find again.

Note that we copy the session each time. In our production app, the session would actually be copies and closed for each find and/or insert operation. In this script the session is copies and closed for each find/insert/find combination.

Our mgo connection settings are pretty simple.

sess, _ := mgo.DialWithTimeout(fmt.Sprintf("localhost:27017?maxPoolSize=%d", 4096*3), 60*time.Second)

sess.SetMode(mgo.Eventual, true)

sess.SetSafe(&mgo.Safe{

W: 1,

})

We only recently changed to eventual consistency in hopes that it would ease some of the pressure on mongo. Not entirely convinced it did anything though.

All machines run Mongo 3.0.5 running on Amazon Linux. The mongod config is

systemLog:

destination: file

path: /tmp/mongod.log

logAppend: true

storage:

dbPath: /mongo

journal:

enabled: true

processManagement:

fork: true

pidFilePath: /var/run/mongodb/mongod.pid

net:

port: 27017

replication:

replSetName: users00

setParameter:

notablescan: true

ulmit settings on mongo

core file size (blocks, -c) 0

data seg size (kbytes, -d) unlimited

scheduling priority (-e) 0

file size (blocks, -f) unlimited

pending signals (-i) 491420

max locked memory (kbytes, -l) 64

max memory size (kbytes, -m) unlimited

open files (-n) 65536

pipe size (512 bytes, -p) 8

POSIX message queues (bytes, -q) 819200

real-time priority (-r) 0

stack size (kbytes, -s) 8192

cpu time (seconds, -t) unlimited

max user processes (-u) 491420

virtual memory (kbytes, -v) unlimited

file locks (-x) unlimited

ulimit on client machines

core file size (blocks, -c) 0

data seg size (kbytes, -d) unlimited

scheduling priority (-e) 0

file size (blocks, -f) unlimited

pending signals (-i) 483041

max locked memory (kbytes, -l) 64

max memory size (kbytes, -m) unlimited

open files (-n) 65536

pipe size (512 bytes, -p) 8

POSIX message queues (bytes, -q) 819200

real-time priority (-r) 0

stack size (kbytes, -s) 8192

cpu time (seconds, -t) unlimited

max user processes (-u) 483041

virtual memory (kbytes, -v) unlimited

file locks (-x) unlimited

Is there anything you would recommend to look into? Any config (either in mongod or mgo) that I'm missing or doing wrong? Am I handling connections properly in code?

Gustavo Niemeyer

unread,

Sep 8, 2015, 5:29:35 PM9/8/15

to mgo-...@googlegroups.com

Hi Travis,

What error are you actually getting most of the time, and how often that is? All of the errors you report are actual issues being reported by the socket itself, and all of them are not really big issues if seen every once in a while only. Also, have you looked at the server logs to have an idea of what the server reports in these situations?

Then, a few side notes about that code:

- Please never ignore errors without a great reason to do so (see DialWithTimeout). As a rule of thumb, if an error is being ignored, there should be a comment explaining why.

- There are concurrency bugs in that code. A variable cannot be read/written by multiple goroutines without synchronization (see stop).

- Why Eventual? It doesn't look like the application is organized to use this mode in a correct way. Start with Strong, and walk your way from there once you understand better the consequences of that choice.

- The "err != nil &&" part is unnecessary when checking if err == mgo.ErrNotFound or IsDup. These will already imply not nil.

Please let me know how that goes.

--
You received this message because you are subscribed to the Google Groups "mgo-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mgo-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

gustavo @ http://niemeyer.net

Travis Beauvais

unread,

Sep 8, 2015, 6:24:33 PM9/8/15

to mgo-users

At the time we were having problems the mongod logs weren't really showing anything out of the ordinary. The most common errors by far are "connection reset by peer" and "timeout".

On Tuesday, September 8, 2015 at 2:29:35 PM UTC-7, Gustavo Niemeyer wrote:

Hi Travis,

What error are you actually getting most of the time, and how often that is? All of the errors you report are actual issues being reported by the socket itself, and all of them are not really big issues if seen every once in a while only. Also, have you looked at the server logs to have an idea of what the server reports in these situations?

Then, a few side notes about that code:

- Please never ignore errors without a great reason to do so (see DialWithTimeout). As a rule of thumb, if an error is being ignored, there should be a comment explaining why.

This is a test script. It wasn't meant to be perfect. It was just meant to replicate production env as best as possible.

- There are concurrency bugs in that code. A variable cannot be read/written by multiple goroutines without synchronization (see stop).

"stop" lives outside of the goroutines. No global variables are read/written in goroutines except for copying the mgo session.

- Why Eventual? It doesn't look like the application is organized to use this mode in a correct way. Start with Strong, and walk your way from there once you understand better the consequences of that choice.

In this script it doesn't even matter since I'm testing against a single instance. In production we moved to this in hopes that spreading reads to secondaries would reduce the load on the primary thus stop the flow of errors. It didn't do that at all. We will likely switch back to the default.

- The "err != nil &&" part is unnecessary when checking if err == mgo.ErrNotFound or IsDup. These will already imply not nil.

Yeah I know.

Travis Beauvais

unread,

Sep 10, 2015, 10:53:57 PM9/10/15

to mgo-users

So I went at this again, this time completely replicating my production environment. The production code has a single goroutine that pulls from a queue and pushes onto a channel. Then are then n worker goroutines processing items from that channel.

When I turn n up high (more workers) to like 5k, after a minute or so i get a bunch of connection errors and then it stops and resumes work. After a few more minutes I get a bunch of connection errors and work resumes.

I started printing mgo stats just to see what was going on. This is after the first round of connection errors.

MGO clusters 4

MGO master conns 2383

MGO socks refs 0

MGO socks alive 1312

MGO socks in use 0

After the second round of connection errors

MGO clusters 4

MGO master conns 4485

MGO socks refs 0

MGO socks alive 2549

MGO socks in use 0

And now it has pretty much settled here

MGO clusters 4

MGO master conns 5499

MGO socks refs 0

MGO socks alive 3298

MGO socks in use 0

Keep in mind this is with 5k workers. Should be more than 5k connection (give or take a few). If I turn down the number of workers to like 2k, it all works fine. No connection errors. The number of master conns vs socks alive is always insync. Only when I start getting the errors do these 2 nums get out of sync.

Gustavo Niemeyer

unread,

Sep 11, 2015, 6:20:39 AM9/11/15

to mgo-...@googlegroups.com

Hi Travis,

On Sep 10, 2015 11:53 PM, "Travis Beauvais" <tbea...@gmail.com> wrote:
>
> When I turn n up high (more workers) to like 5k, after a minute or so i get a bunch of connection errors and then it stops and resumes work. After a few more minutes I get a bunch of connection errors and work resumes.

And one minute is your connection timeout, isn't it? It might just be overloading the server enough that it is unable to handle everyone's requests in the established timeframe.

How are the errors presented? Which method returns it, and after doing what?

What exactly is in the server logs and in the client logs when this happens?

> Keep in mind this is with 5k workers. Should be more than 5k connection (give or take a few). If I turn down the number of workers to like 2k, it all works fine. No connection errors. The number of master conns vs socks alive is always insync. Only when I start getting the errors do these 2 nums get out of sync.

It's natural for them to not be in sync. It just means that more connections were ever established to the server than are open right now, which exactly what should happen when there's any issue with an existing connection and a new one is open in its place.

gustavo @ http://niemeyer.net

Reply all

Reply to author

Forward