HELP - Go program ocassionally freezing under heavy load without any clue or trace

1,665 views
Skip to first unread message

brai...@gmail.com

unread,
Jul 10, 2013, 11:24:31 AM7/10/13
to golan...@googlegroups.com, Alexandru Mocioi, Dragos STOICA
It's urgent, please give me some ideas, with what other parameters or options should I compile in order to detect the reason for a completely frozen program!

Hardware and software environment
- Linux ubuntu 2.6.38-8-server #42-Ubuntu SMP Mon Apr 11 03:49:04 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux
- 4 core Intel(R) Xeon(R) CPU 5160  @ 3.00GHz
- 8 Gb RAM, SCSI RAID

Application description and interactions with other services
Go program is acting as a middleware http server (net/http) talking with >400 clients (browsers+jQuery)
- CouchDB as database through a pool (rrpool) of 50 connections with simple http  (on the same machine)
- CouchDB-Lucene as a full text indexer (on the same machine)
- PostgreSQL database through github.com/bmizerany/pq module (on another machine)
- Memcache for session caches through github.com/bradfitz/gomemcache/memcache
- goroutines that spawn external programs (xelatex & php scripts) as PDF generators and wait for completion

The program was running OK (2 weeks) at moderate rate of load, at ~4000 documents incoming documents per day!
Suddently, on heavy loads, the programs started to freeze totally with:
- Not even a single line (of incoming requests, or anything else) is present in the log!
- Killing the process with -9 is moving it into a -defunct status for one or two minutes until gone and can be restarted again!
- A "hart-beat" simple goroutine that is keeping writing messages on log at every 5 secs is not writting any more
- the http event loop for dispatching request handlers is frozen, a simple "echo, i'm alive handler" is not responding

At that time, CouchDB is not saying something strange in logs, it is running perfectly, PostgreSQL is running OK, lucene logs are clear!
The main program is NOT eating memory (it stays at 480 mb RAM) or CPU (it stays on freezing at 0%).
I tried to compile it to run on just a single core, with -race flag ... anything but I've got the same thing!
On heavy load it just simply freeze at random places in the program!
=========

Please tell me if there are some tools in order to debug the program in order to know if it's completely frozen what other compiling options are available to check deadlocks or something!
I've set the GOGCTRACE=2 environment variable and got in the log file information about the garbage collection like:
scvg41: inuse: 9, idle: 5, sys: 15, released: 2, consumed: 12 (MB)
gc318(1): 3+3+0 ms, 9 -> 4 MB 69792 -> 31708 (4289323-4257615) objects, 0(0) handoff, 0(0) steal, 0/0/0 yields
gc319(1): 4+0+0 ms, 4 -> 4 MB 31709 -> 31686 (4289324-4257638) objects, 0(0) handoff, 0(0) steal, 0/0/0 yields
gc320(1): 3+0+0 ms, 4 -> 4 MB 31684 -> 31684 (4289324-4257640) objects, 0(0) handoff, 0(0) steal, 0/0/0 yields

Other ideas?
Thanks in advance,
Teo

Kyle Lemons

unread,
Jul 10, 2013, 1:48:03 PM7/10/13
to brai...@gmail.com, golang-nuts, Alexandru Mocioi, Dragos STOICA
Send SIGQUIT (^/) to the process or enable net/http/pprof and visit the daemon's /debug/goroutines?level=2 (or something like that) for a goroutine trace, perhaps?



--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Rodrigo Kochenburger

unread,
Jul 10, 2013, 1:52:19 PM7/10/13
to golan...@googlegroups.com, Alexandru Mocioi, Dragos STOICA
Have you tried running w/ GOMAXPROCS > 1 but w/ the race detector enabled?

You could also try something like this to try to debug information: http://dustin.github.io/2013/07/04/siginfo.html

Also, try generating a core dump that you can inspect on GDB: http://man7.org/linux/man-pages/man5/core.5.html

Dave Cheney

unread,
Jul 10, 2013, 7:35:11 PM7/10/13
to Rodrigo Kochenburger, golang-nuts, Alexandru Mocioi, Dragos STOICA
Some suggestions:

- build your binary with the race detector enabled, go install -race
$YOURCOMMAND, do the same with your tests, go test -race ./...
- when you get to the point your application is hung, send it a
SIGQUIT, this will panic the process and return a full stack trace of
all the running goroutines, paste it here.
- can you post the sourtce

Constantin Teodorescu

unread,
Jul 11, 2013, 3:57:21 AM7/11/13
to golan...@googlegroups.com, Rodrigo Kochenburger, Alexandru Mocioi, Dragos STOICA
On Thursday, July 11, 2013 2:35:11 AM UTC+3, Dave Cheney wrote:
Some suggestions:

- build your binary with the race detector enabled, go install -race
$YOURCOMMAND, do the same with your tests, go test -race ./...

done that
 
- when you get to the point your application is hung, send it a
SIGQUIT, this will panic the process and return a full stack trace of
all the running goroutines, paste it here.

I'm attaching it as file, I noticed that ALL THE THREADS are in [IO wait] or [chan receive] state , with the exception of 
goroutine 40887 [running]:
runtime/race._Cfunc___tsan_read(0xdabffb8, 0xabf020, 0x68670b)
In the freeze moment, even the "TIC-TAC clock is working" is quiet, nothing appears in log!
We got another 3secs loop that is doing a syscall.Unlink("/tmp/t/still-alive") .. THIS ONE IS WORKING!!!

 
- can you post the source

Yes, mid.go is attached!
 

On Thu, Jul 11, 2013 at 3:52 AM, Rodrigo Kochenburger <div...@gmail.com> wrote:
> Have you tried running w/ GOMAXPROCS > 1 but w/ the race detector enabled?


Yes! Done that with -race and 1 processors , 2 processores ... got the same freezing process randomly ! :-(
mid_loop_2013-07-11_02-56-00.log
mid.go

Dave Cheney

unread,
Jul 14, 2013, 9:21:10 PM7/14/13
to Constantin Teodorescu, golang-nuts, Rodrigo Kochenburger, Alexandru Mocioi, Dragos STOICA
Hello,

I had a look at your code and the stack trace and I couldn't see
anything obvious. There was a _lot_ of code to go through so I didn't
read it all but looking at your description I cannot see that there is
a problem on the Go side -- your application appears to be doing work,
or this case, waiting for input from clients or responses from
couchdb.

You have 36 connections outstanding to couchdb, 28 sending data to the
client and 151 waiting for a response.

Some suggestions

1. look at couchdb, is it complaining about load, connection
exhaustion etc. I don't know couchdb at all so I cannot assist you
there
2. look at your process, have you run out of file descriptors, you
have at least 215 fds in use, the number may be higher if you are also
doing file io from your server process.
3. You probably want to look into setting a deadline with your http
server to reap unused connections.

Cheers

Dave

Andrew Deane

unread,
Jul 16, 2013, 4:05:10 AM7/16/13
to golan...@googlegroups.com, Rodrigo Kochenburger, Alexandru Mocioi, Dragos STOICA
Hi,

In the past I have had issues similar to this when using unbuffered channels when signalling between a series of goroutines.

If you can easily replicate the freeze, try adding buffers to your quit, exitChan, ceasChan, and dispatcherChan channels and check the results. From its name alone I suspect that it is the dispatcherChan that is the real issue.

Andrew.
Reply all
Reply to author
Forward
0 new messages