memory leaks in sync_gateway

126 views
Skip to first unread message

Brett Harrison

unread,
Apr 1, 2015, 1:30:34 PM4/1/15
to mobile-c...@googlegroups.com
I am seeing a memory leaks in sync_gateway (1.0.4)

I am not a Go programmer but I was hoping to gather some data that would be useful in finding these leaks. 
I turned on profiling and took heap data snapshots about 12 hours apart.  I hope I did this correctly.
Is this data useful in finding the leaks?  Is there anything else I could do to help?


Snapshot 1 pprof...

Entering interactive mode (type "help" for commands)

(pprof) top

170.47MB of 173.04MB total (98.52%)

Dropped 253 nodes (cum <= 0.87MB)

Showing top 10 nodes out of 39 (cum >= 1MB)

flat flat% sum% cum cum%

68.63MB 39.66% 39.66% 70.63MB 40.82% strconv.fmtF

64.63MB 37.35% 77.01% 64.63MB 37.35% github.com/couchbase/sync_gateway/rest.(*handler).handlePersonaPOST

12.69MB 7.34% 84.35% 18.20MB 10.52% goexit0

10.50MB 6.07% 90.42% 10.50MB 6.07% badfutexwakeup

3.50MB 2.02% 92.44% 21.70MB 12.54% github.com/couchbase/sync_gateway/base.ParseLogFlags

3.01MB 1.74% 94.18% 3.01MB 1.74% strconv.fmtE

2.80MB 1.62% 95.80% 2.80MB 1.62% net/http.initNPNRequest.ServeHTTP

2.71MB 1.56% 97.36% 2.71MB 1.56% github.com/gorilla/mux.(*Route).URLHost

1MB 0.58% 97.94% 1MB 0.58% github.com/couchbaselabs/walrus.(*lolrus).GetRaw

1MB 0.58% 98.52% 1MB 0.58% fmt.(*ss).scanOne


Snapshot 2 pprof

(pprof) top

394.93MB of 428.50MB total (92.16%)

Dropped 239 nodes (cum <= 2.14MB)

Showing top 10 nodes out of 54 (cum >= 6.50MB)

flat flat% sum% cum cum%

69.63MB 16.25% 16.25% 84.64MB 19.75% strconv.fmtF

68.67MB 16.03% 32.28% 68.67MB 16.03% net/http.initNPNRequest.ServeHTTP

64.63MB 15.08% 47.36% 64.63MB 15.08% github.com/couchbase/sync_gateway/rest.(*handler).handlePersonaPOST

45.20MB 10.55% 57.91% 164.58MB 38.41% goexit0

39.70MB 9.27% 67.17% 39.70MB 9.27% github.com/gorilla/mux.(*Route).URLHost

37.01MB 8.64% 75.81% 201.59MB 47.04% github.com/couchbase/sync_gateway/base.ParseLogFlags

35.56MB 8.30% 84.11% 35.56MB 8.30% badfutexwakeup

19.50MB 4.55% 88.66% 19.50MB 4.55% type..eq.runtime.mspan

8.52MB 1.99% 90.65% 8.52MB 1.99% github.com/couchbase/sync_gateway/rest.(*handler).userAgentIs

6.50MB 1.52% 92.16% 6.50MB 1.52% github.com/couchbaselabs/walrus.(*lolrus).GetRaw





(pprof) list net/http.initNPNRequest.ServeHTTP

Total: 428.50MB

ROUTINE ======================== net/http.initNPNRequest.ServeHTTP in /usr/local/go/src/net/http/server.go

68.67MB 68.67MB (flat, cum) 16.03% of Total

. . 2024: h serverHandler

. . 2025:}

. . 2026:

. . 2027:func (h initNPNRequest) ServeHTTP(rw ResponseWriter, req *Request) {

. . 2028: if req.TLS == nil {

68.67MB 68.67MB 2029: req.TLS = &tls.ConnectionState{}

. . 2030: *req.TLS = h.c.ConnectionState()

. . 2031: }

. . 2032: if req.Body == nil {

. . 2033: req.Body = eofReader

. . 2034: }




(pprof) list goexit0

Total: 428.50MB

ROUTINE ======================== goexit0 in /usr/local/go/src/runtime/proc.c

45.20MB 164.58MB (flat, cum) 38.41% of Total

. . 1688: runtime·mcall(&fn);

. . 1689:}

. . 1690:

. . 1691:// runtime·goexit continuation on g0.

. . 1692:static void

. . 1693:goexit0(G *gp)

. . 1694:{

. 2.50MB 1695: runtime·casgstatus(gp, Grunning, Gdead);

. . 1696: gp->m = nil;

. . 1697: gp->lockedm = nil;

. . 1698: g->m->lockedg = nil;

. . 1699: gp->paniconfault = 0;

. . 1700: gp->defer = nil; // should be true already but just in case.

. . 1701: gp->panic = nil; // non-nil for Goexit during panic. points at stack-allocated data.

. . 1702: gp->writebuf.array = nil;

45.20MB 162.08MB 1703: gp->writebuf.len = 0;

. . 1704: gp->writebuf.cap = 0;

. . 1705: gp->waitreason.str = nil;

. . 1706: gp->waitreason.len = 0;

. . 1707: gp->param = nil;



. . 1708:
Entering interactive mode (type "help" for commands)

(pprof) top

170.47MB of 173.04MB total (98.52%)

Dropped 253 nodes (cum <= 0.87MB)

Showing top 10 nodes out of 39 (cum >= 1MB)

flat flat% sum% cum cum%

68.63MB 39.66% 39.66% 70.63MB 40.82% strconv.fmtF

64.63MB 37.35% 77.01% 64.63MB 37.35% github.com/couchbase/sync_gateway/rest.(*handler).handlePersonaPOST

12.69MB 7.34% 84.35% 18.20MB 10.52% goexit0

10.50MB 6.07% 90.42% 10.50MB 6.07% badfutexwakeup

3.50MB 2.02% 92.44% 21.70MB 12.54% github.com/couchbase/sync_gateway/base.ParseLogFlags

3.01MB 1.74% 94.18% 3.01MB 1.74% strconv.fmtE

2.80MB 1.62% 95.80% 2.80MB 1.62% net/http.initNPNRequest.ServeHTTP

2.71MB 1.56% 97.36% 2.71MB 1.56% github.com/gorilla/mux.(*Route).URLHost

1MB 0.58% 97.94% 1MB 0.58% github.com/couchbaselabs/walrus.(*lolrus).GetRaw

1MB 0.58% 98.52% 1MB 0.58% fmt.(*ss).scanOne








(pprof) top

394.93MB of 428.50MB total (92.16%)

Dropped 239 nodes (cum <= 2.14MB)

Showing top 10 nodes out of 54 (cum >= 6.50MB)

flat flat% sum% cum cum%

69.63MB 16.25% 16.25% 84.64MB 19.75% strconv.fmtF

68.67MB 16.03% 32.28% 68.67MB 16.03% net/http.initNPNRequest.ServeHTTP

64.63MB 15.08% 47.36% 64.63MB 15.08% github.com/couchbase/sync_gateway/rest.(*handler).handlePersonaPOST

45.20MB 10.55% 57.91% 164.58MB 38.41% goexit0

39.70MB 9.27% 67.17% 39.70MB 9.27% github.com/gorilla/mux.(*Route).URLHost

37.01MB 8.64% 75.81% 201.59MB 47.04% github.com/couchbase/sync_gateway/base.ParseLogFlags

35.56MB 8.30% 84.11% 35.56MB 8.30% badfutexwakeup

19.50MB 4.55% 88.66% 19.50MB 4.55% type..eq.runtime.mspan

8.52MB 1.99% 90.65% 8.52MB 1.99% github.com/couchbase/sync_gateway/rest.(*handler).userAgentIs

6.50MB 1.52% 92.16% 6.50MB 1.52% github.com/couchbaselabs/walrus.(*lolrus).GetRaw

(pprof) list net/http.initNPNRequest.ServeHTTP

Total: 428.50MB

ROUTINE ======================== net/http.initNPNRequest.ServeHTTP in /usr/local/go/src/net/http/server.go

   68.67MB    68.67MB (flat, cum) 16.03% of Total

         .          .   2024: h serverHandler

         .          .   2025:}

         .          .   2026:

         .          .   2027:func (h initNPNRequest) ServeHTTP(rw ResponseWriter, req *Request) {

         .          .   2028: if req.TLS == nil {

   68.67MB    68.67MB   2029: req.TLS = &tls.ConnectionState{}

         .          .   2030: *req.TLS = h.c.ConnectionState()

         .          .   2031: }

         .          .   2032: if req.Body == nil {

         .          .   2033: req.Body = eofReader

         .          .   2034: }


(pprof) list goexit0

Total: 428.50MB

ROUTINE ======================== goexit0 in /usr/local/go/src/runtime/proc.c

   45.20MB   164.58MB (flat, cum) 38.41% of Total

         .          .   1688: runtime·mcall(&fn);

         .          .   1689:}

         .          .   1690:

         .          .   1691:// runtime·goexit continuation on g0.

         .          .   1692:static void

         .          .   1693:goexit0(G *gp)

         .          .   1694:{

         .     2.50MB   1695: runtime·casgstatus(gp, Grunning, Gdead);

         .          .   1696: gp->m = nil;

         .          .   1697: gp->lockedm = nil;

         .          .   1698: g->m->lockedg = nil;

         .          .   1699: gp->paniconfault = 0;

         .          .   1700: gp->defer = nil; // should be true already but just in case.

         .          .   1701: gp->panic = nil; // non-nil for Goexit during panic. points at stack-allocated data.

         .          .   1702: gp->writebuf.array = nil;

   45.20MB   162.08MB   1703: gp->writebuf.len = 0;

         .          .   1704: gp->writebuf.cap = 0;

         .          .   1705: gp->waitreason.str = nil;

         .          .   1706: gp->waitreason.len = 0;

         .          .   1707: gp->param = nil;

         .          .   1708:

Jens Alfke

unread,
Apr 1, 2015, 1:43:30 PM4/1/15
to mobile-c...@googlegroups.com

On Apr 1, 2015, at 10:30 AM, Brett Harrison <brett.h...@zyamusic.com> wrote:

I am not a Go programmer but I was hoping to gather some data that would be useful in finding these leaks. 

It’s nearly impossible to inspect Go programs using regular tools, because Go doesn’t use the platform’s calling conventions, stack layout, or memory allocator. (It uses its own lightweight threads and a garbage collector.)

At a high level you can watch the total memory usage of the Go process, or its CPU usage, but to get any finer grained than that you have to use Go tools. I do know that the SG team have been doing a lot of work tracking down memory issues lately, so you might try building your own SG from the master branch (it’s pretty easy even if you don’t know Go) and seeing if you still see excessive memory use.

—Jens

Adam Fraser

unread,
Apr 1, 2015, 2:12:16 PM4/1/15
to mobile-c...@googlegroups.com
Hi Brett,

I'd be interested in more details about your Sync Gateway usage during the period of your testing, to understand whether this is actually a potential memory "leak", versus increased memory usage in response to activity in the environment.  

I saw your earlier post (about several thousand active users eventually using up 6GB of RAM).   Some more specifics about usage over the 12 hours might help understand whether the memory usage is expected or an issue.  

Thanks,
Adam

Brett Harrison

unread,
Apr 1, 2015, 4:02:03 PM4/1/15
to mobile-c...@googlegroups.com
Here are some stats.  This server has been up for about 1 day 6 hours and has had about 500,000 session on it (HAProxy stat).
From a top command you can see that sync_gateway has 40% of the memory currently.  The used memory will increase slowly with traffic until it is all used up and then it crashes and restarts.

I have tried to compile and run the master build, but it starts to have a lot of log errors about sockets disconnecting after awhile, so I went back to using 1.0.4.

  538 root      20   0 3765652 2.349g   5960 S   1.3 40.0  35:11.40 sync_gateway                                                                                                  

    1 root      20   0   33480   2808   1492 S   0.0  0.0   0:01.28 init                                                                                                          

    2 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kthreadd                                                                                                      

    3 root      20   0       0      0      0 S   0.0  0.0   0:03.77 ksoftirqd/0                                                                                                   

    4 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0                                                                                                   

    5 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/0:0H                                                                                                  

    6 root      20   0       0      0      0 S   0.0  0.0   0:00.53 kworker/u8:0    


root@sync01:~# date

Wed Apr  1 18:30:53 CDT 2015

root@sync01:~# free

             total       used       free     shared    buffers     cached

Mem:       6153524    2664664    3488860        216     128776     179636

-/+ buffers/cache:    2356252    3797272

Swap:      2096444          0    2096444

root@sync01:~# date

Wed Apr  1 19:00:14 CDT 2015

root@sync01:~# free

             total       used       free     shared    buffers     cached

Mem:       6153524    2735788    3417736        216     129692     181180

-/+ buffers/cache:    2424916    3728608

Swap:      2096444          0    2096444

root@sync01:~# date 

Wed Apr  1 19:53:01 CDT 2015

root@sync01:~# free

             total       used       free     shared    buffers     cached

Mem:       6153524    2884424    3269100        216     131268     184148

-/+ buffers/cache:    2569008    3584516

Swap:      2096444          0    2096444

Adam Fraser

unread,
Apr 1, 2015, 5:33:28 PM4/1/15
to mobile-c...@googlegroups.com
A few more questions to try to narrow down the possible memory usage:

1. What are the read/write ops in that time window?   
2. What's your distribution of docs across channels (many channels, few channels, etc)?
3. What type of bucket are you using (if Couchbase Server, what version)?

Thanks,
Adam

Brett Harrison

unread,
Apr 1, 2015, 5:57:39 PM4/1/15
to mobile-c...@googlegroups.com
Over 1 hour I am averaging about 85 reads less than 1 set per second on Couchbase server 3.0.1 Community Edition.  (There are 4 sync_gateways connected to that cluster.)

I am using the sync_gateway to deliver dynamic content to the app (iOS and Android).  The content does not change very often (no changes over the last 5 days).
The app is currently only pulling this content.  There are no documents pushed from the app.  I have 35 documents in the a channel called "public".  Everyone uses the guest user to get the "public" documents (this was made before the "!" channel became available).

Of the 35 documents 4 are marked as deleted and 30 of them have 2 attachments each. (image and audio).

Adam Fraser

unread,
Apr 2, 2015, 1:55:29 PM4/2/15
to mobile-c...@googlegroups.com
Given the very low write load, this seems unlikely to be related to in-memory channel cache handling.  

It's possible you're accumulating unreleased TCP connections and that's resulting in the memory usage.  There are details on that issue, and potential OS-level tuning to address it on this issue: https://github.com/couchbase/sync_gateway/issues/742


Brett Harrison

unread,
Apr 2, 2015, 2:58:04 PM4/2/15
to mobile-c...@googlegroups.com
I don't have many open connections.  Only about 94.  You can see the memory usage continues to climb compared to yesterday.

2954740 used yesterday, 3842568 used today

Could it have anything to do with document attachments?


root@sync01:~# date

Thu Apr  2 18:53:34 CDT 2015

root@sync01:~# lsof -p 538 | grep -i established | wc -l

94

root@sync01:~# free

             total       used       free     shared    buffers     cached

Mem:       6153524    3842568    2310956        216     151072     229764

-/+ buffers/cache:    3461732    2691792

Swap:      2096444          0    2096444

Adam Fraser

unread,
Apr 2, 2015, 6:18:13 PM4/2/15
to mobile-c...@googlegroups.com
I don't know of any current issues with memory usage related to attachments.

I'm not sure I completely understand your environment usage, based on your notes.  Can you confirm the following?

1. A server that has been up for a day or two has had 500,000 sessions in that time
2. Each sync gateway node averages 85 reads/sec
3. Each sync gateway node averages less than one write per second - and zero writes in the last five days?
4. You only have 94 open connections

In particular, how does the 500k number from #1 align with the 94 number from #4?  How are you measuring the number of connections?

Thanks,
Adam

Brett Harrison

unread,
Apr 2, 2015, 8:42:07 PM4/2/15
to mobile-c...@googlegroups.com
1.  I was wrong about the 500,000 sessions.  Those were 500,000 sessions since HAProxy was started (started on March 25th).
The sync_gateway has crashed and restarted since then.  The last time it restarted was 2 days and 12 hours ago, so the current running instance of the sync_gateway has not handled all of those sessions.  It has handled 2.5 days of traffic now.

2 and 3 these number came from the couchbase server dashboard (gets per sec) and (sets per sec).  
I have not changed any of the documents in 5 days.  The only time the documents change is when I change them with the REST protocol by using the admin port.
The app users only pull the documents.  Basically all the documents are read only on the app.  I'm not sure why the couchbase server is showing any writes and assumed that the sync_gateway was saving little documents periodically.

4.  To get the current connections I used lsof -p <pid> | grep -i established | wc -l on the sync_gateway server.

A new update to my app went live in the app store (yesterday) where I changed the Puller to not use continuous mode, so I expect less traffic on the sync_gateway going forward.  That would mean less connections.

Memory usage continues to increase however.

                              total       used       free     shared    buffers     cached

Mem:       6153524    4030928    2122596        216     152852     242036

From the top command...

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                     

      538 root      20   0 4085080 3.365g   5960 S   1.0 57.3  80:19.47 sync_gateway

compare to my earlier top

      538 root      20   0 3765652 2.349g   5960 S   1.3 40.0  35:11.40 sync_gateway 

It's now using 57% of the memory where before it was using 40%


Zachary Gramana

unread,
Apr 3, 2015, 6:47:07 PM4/3/15
to mobile-c...@googlegroups.com
Brett,

I’ve opened a new issue for this on GH, https://github.com/couchbase/sync_gateway/issues/768, so let’s move the conversation over there.

There rest of the response is there.

Best,
Zack

--
You received this message because you are subscribed to the Google Groups "Couchbase Mobile" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mobile-couchba...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mobile-couchbase/880c1f2a-86e9-415e-98a3-4f29028948d3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages