Crashing without messages - how do I debug?

Tom Maiaroto

unread,

Aug 19, 2014, 9:29:21 AM8/19/14

to golan...@googlegroups.com

Hi,

I have an application that's crashing after some time (a few hours), but it leaves no messages in the log output.

It's a long running application that gathers data from the internet every hour or so. It does this using some goroutines that gather and then store the data to a database and write out to a log file.

I noticed some things were blocking, and an observer pattern I used was just leaking memory like crazy. So I cleaned that all up and found a solution for logging that works better. Even still, I disabled the logging feature and it still crashes. However, the memory usage doesn't just climb immediately. It still slowly climbs, but I've seen it come down a little too. I'm still not 100% confident there isn't a small memory leak of course...But that's just the problem. It's hard for me to track down.

Is there anything I can do to get more details when the application exits?

Code here: https://github.com/SocialHarvest/harvester

Maybe something goes wrong with the http.ListenAndServe() ? And it's not crashing, but just exiting?

It's using about 40-50% memory when it crashes. I am also running this app on a VPS with 512MB of RAM. I hope I don't need a more powerful server. I wanted to keep this app fairly lean.

Any ideas for more detailed logging would be much appreciated, thanks!

Tamás Gulácsi

unread,

Aug 19, 2014, 9:38:27 AM8/19/14

to golan...@googlegroups.com

Dump stderr and the exitcode, too.

James Bardin

unread,

Aug 19, 2014, 10:03:42 AM8/19/14

to golan...@googlegroups.com

Try getting rid of the log.Fatal around the ListenAndServe? Log.Fatal calls an os.Exit, and maybe the process is dying before other message are getting printed out.

Another possibility is it's being killed by the OOM-killer, which would terminate the process immediately. Have you checked syslog/dmesg?

You can also hit it with a SIGQUIT after the memory has grown a bit to force it to crash and print a stack trace if you're unsure what going on within your process.

Tom Maiaroto

unread,

Aug 19, 2014, 11:45:37 AM8/19/14

to golan...@googlegroups.com

Sorry for being a complete noob here (and I'm going to Google around a bit of course), but how can I dump stderr and the exitcode?

Also, I believe it is getting killed by the OOM killer (I didn't think to check the syslog). Here's part of the syslog: https://gist.github.com/tmaiaroto/f219889c5f702a76619a

Does this ring any bells? I realized I don't need Postgres running locally (I'm using RDS) so I've stopped that process. I'm not sure that's the problem though, but perhaps I'm opening too many connections?

Thanks guys!

James Bardin

unread,

Aug 19, 2014, 11:52:25 AM8/19/14

to Tom Maiaroto, golan...@googlegroups.com

That's absolutely OOM.

You either need more memory, or you need to figure out where you're wasting memory.

If you are leaking something, you can start here http://blog.golang.org/profiling-go-programs, and look at using the "net/http/pprof" package to help you debug the running app.

--
You received this message because you are subscribed to a topic in the Google Groups "golang-nuts" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-nuts/eBMuzJxaeno/unsubscribe.
To unsubscribe from this group and all its topics, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tom Maiaroto

unread,

Aug 19, 2014, 12:24:39 PM8/19/14

to golan...@googlegroups.com, tom.ma...@gmail.com

I've run the profiling before and saw the goroutines keep going. I thought that was ok though?

How do I ensure they are returning and how can I see those which have not returned?

Did I miss it under one of the links in the web front-end for pprof?

I haven't ran pprof in a while though, so I'll re-run it and get a new graph. Is there anything in particular I should look for?

Thanks.

James Bardin

unread,

Aug 19, 2014, 1:52:56 PM8/19/14

to Tom Maiaroto, golan...@googlegroups.com

On Tue, Aug 19, 2014 at 12:24 PM, Tom Maiaroto <tom.ma...@gmail.com> wrote:

I've run the profiling before and saw the goroutines keep going. I thought that was ok though?
How do I ensure they are returning and how can I see those which have not returned?

It's not OK if you keep adding to the number of goroutines. They need to return at some point.

Did I miss it under one of the links in the web front-end for pprof?

The stack trace will show all the goroutines

Tom Maiaroto

unread,

Aug 19, 2014, 2:20:52 PM8/19/14

to golan...@googlegroups.com, tom.ma...@gmail.com

on http://localhost:6060/debug/pprof/

i see:

profiles:

848 block

4463 goroutine

1180 heap

10 threadcreate

Are these numbers supposed to decrease? Or is it just a count of how many goroutines executed since the app started (meaning they could have returned)?

James Bardin

unread,

Aug 19, 2014, 2:32:26 PM8/19/14

to Tom Maiaroto, golan...@googlegroups.com

Each of those is a link you can click on for more detail.

4463 is the number of currently running goroutines.

Tom Maiaroto

unread,

Aug 19, 2014, 3:12:32 PM8/19/14

to golan...@googlegroups.com, tom.ma...@gmail.com

yea, i've clicked them and look at them a bit. i just don't know what to look for. what would indicate a process i really need to ensure is returned or something. what the most expensive ones were, etc.

so those numbers do decrease then?

James Bardin

unread,

Aug 19, 2014, 3:43:58 PM8/19/14

to Tom Maiaroto, golan...@googlegroups.com

On Tue, Aug 19, 2014 at 3:12 PM, Tom Maiaroto <tom.ma...@gmail.com> wrote:

yea, i've clicked them and look at them a bit. i just don't know what to look for. what would indicate a process i really need to ensure is returned or something. what the most expensive ones were, etc.
so those numbers do decrease then?

There's probably a large numbers of goroutines in the same blocks of code, which would give you a hint as to where they're coming from.

Goroutines act just like function calls, and they need to return to free their resources. If you start a goroutine and it gets blocked somewhere, those resources are never going to be freed. Remember that many things have inherent concurrency -- every request you handle with the http.Server starts at least one goroutine, so does every request with http.Client.

Tom Maiaroto

unread,

Aug 19, 2014, 3:52:56 PM8/19/14

to golan...@googlegroups.com, tom.ma...@gmail.com

I see a lot of:

goroutine 10723 [select]:
net/http.(*persistConn).readLoop(0xc208605ce0)
	/usr/local/go/src/pkg/net/http/transport.go:868 +0x829
created by net/http.(*Transport).dialConn
	/usr/local/go/src/pkg/net/http/transport.go:600 +0x93f

goroutine 10845 [chan receive]:
database/sql.(*DB).connectionOpener(0xc208446d80)
	/usr/local/go/src/pkg/database/sql/sql.go:583 +0x48
created by database/sql.Open
	/usr/local/go/src/pkg/database/sql/sql.go:442 +0x27c

goroutine 10724 [select]:
net/http.(*persistConn).writeLoop(0xc208605ce0)
	/usr/local/go/src/pkg/net/http/transport.go:885 +0x38f
created by net/http.(*Transport).dialConn
	/usr/local/go/src/pkg/net/http/transport.go:601 +0x957

...

I read somewhere to insert a lot of rows I'd have to wrap a function in a goroutine and pass it a new connection

https://github.com/SocialHarvest/harvester/blob/master/lib/config/database.go#L198

I'm wondering if this was incorrect advice... Or maybe only applied to certain database drivers...

Basically I call StoreRow() in this case, from a variety of places (and in a loop with hundreds of results) like so:

https://github.com/SocialHarvest/harvester/blob/master/lib/harvester/harvester.go#L61

...looking at my comments, I just seem to have a lot of questions around best practices for the process of shoving as much data into a database as quickly as possible.

Dave Cheney

unread,

Aug 19, 2014, 5:54:29 PM8/19/14

to golan...@googlegroups.com, tom.ma...@gmail.com

I think the database angle is a false positive.

I suspect you're dialing without a timeout, so a slow remote server will cause the number of goroutines to increase with time and you'll get killed by the OOM killer.

Tom Maiaroto

unread,

Aug 19, 2014, 10:37:08 PM8/19/14

to golan...@googlegroups.com, tom.ma...@gmail.com

It's an RDS Postgres micro... Is that not going to be fast enough? It's only a couple hundred inserts every few minutes. Each of which gets some parsing and analyzing. I use another service across HTTP to get data like geolocation and then save to the database. So I imagine inserts should be somewhat staggered and not 200 every second... Even still, if that was the case, isn't that manageable?

Again, I guess I'm at a loss for finding good examples of how to handle high insert volume. If I dial with a timeout, do you suspect that there will be timeouts in this case then? And if so, I just lose the data? Hmmm

Dave Cheney

unread,

Aug 19, 2014, 10:42:00 PM8/19/14

to Tom Maiaroto, golan...@googlegroups.com

On 20 Aug 2014, at 12:37, Tom Maiaroto <tom.ma...@gmail.com> wrote:

It's an RDS Postgres micro... Is that not going to be fast enough? It's only a couple hundred inserts every few minutes. Each of which gets some parsing and analyzing. I use another service across HTTP to get data like geolocation and then save to the database. So I imagine inserts should be somewhat staggered and not 200 every second... Even still, if that was the case, isn't that manageable?

Again, I guess I'm at a loss for finding good examples of how to handle high insert volume. If I dial with a timeout, do you suspect that there will be timeouts in this case then? And if so, I just lose the data? Hmmm

Depends entirely on your application. I think you probably need to focus on limiting the amount of work being accepted by your server.

--

Tom Maiaroto

unread,

Aug 19, 2014, 10:42:27 PM8/19/14

to golan...@googlegroups.com, tom.ma...@gmail.com

I mean, I'm going to try to follow this http://www.acloudtree.com/how-to-shove-data-into-postgres-using-goroutinesgophers-and-golang/ ... best i can ... though i do need to analyze the data before inserting. perhaps waiting on hundreds of http requests is the issue?

Dave Cheney

unread,

Aug 19, 2014, 10:47:12 PM8/19/14

to Tom Maiaroto, golan...@googlegroups.com

https://github.com/SocialHarvest/harvester/blob/master/lib/harvester/harvester.go#L61

Why bother running the function in another goroutine when the caller waits for it to finish? That isn't really concurrency.

--

Tom Maiaroto

unread,

Aug 19, 2014, 10:54:08 PM8/19/14

to golan...@googlegroups.com, tom.ma...@gmail.com

Yea, I yanked that from somewhere. I wasn't completely sure why but it did do a copy of the session/connection. Was it in an attempt to manage the pooling perhaps? (which I think is already managed) ... I'm not sure. I'm going to try removing that again now.

Tom Maiaroto

unread,

Aug 20, 2014, 1:36:50 AM8/20/14

to golan...@googlegroups.com, tom.ma...@gmail.com

Ok, that may have been it. After removing that, the memory usage only spiked to 18% and the program hasn't crashed yet. It took a few minutes (like 10-15) but it's now down to 15%. It's dropping very slowly. Any way to recover more quickly? Or is it ok to let it do its thing?

Thanks for your help. Very much appreciated.

Tom Maiaroto

unread,

Aug 20, 2014, 9:26:27 AM8/20/14

to golan...@googlegroups.com, tom.ma...@gmail.com

Ah, still crashing. It just takes a good 5 or 6 hours. Guess I have more areas to keep looking into.

James Bardin

unread,

Aug 20, 2014, 9:42:00 AM8/20/14

to Tom Maiaroto, golan...@googlegroups.com

If you have a long-running service communicating over a network, you need to put timeouts everywhere. You may still be leaking resources in other ways, but this is something that can be easily taken care of.

Judging by the name "SocialHarvester", you're probably talking to a number of remote APIs, which from experience will fail in every way possible. So I'll repeat that again; everything you do over the network needs a timeout. You need Dial timeouts, timeouts for the http clients, timeouts in the http server, and hopefully the database drivers you're using handle timeouts correctly too.

Once you eliminate the possibility of other systems wasting resources, then it may be easier to trace problems in your own code.

--

Tom Maiaroto

unread,

Aug 20, 2014, 9:47:36 AM8/20/14

to golan...@googlegroups.com, tom.ma...@gmail.com

Yes, it does gather data from other APIs. I didn't realize I should be specifically setting timeouts for everything, that's good to know. I figured there was some sort of default timeout on all requests? I guess even if that was the case, maybe it's too long.

I'll start there, thanks!

James Bardin

unread,

Aug 20, 2014, 10:05:18 AM8/20/14

to Tom Maiaroto, golan...@googlegroups.com

On Wed, Aug 20, 2014 at 9:47 AM, Tom Maiaroto <tom.ma...@gmail.com> wrote:

Yes, it does gather data from other APIs. I didn't realize I should be specifically setting timeouts for everything, that's good to know. I figured there was some sort of default timeout on all requests? I guess even if that was the case, maybe it's too long.
I'll start there, thanks!

There were a couple default timeouts added in 1.3 which is nice, The DefaultTransport has a 30s Dial timeout, a 30s KeepAlive interval, and a 10s TLSHandshakeTimeout.

This solves a lot of problems, because a failure dialing can hang forever before there's an active tcp session, and the keepalive will eventually take down most truly dead tcp connections. Any defaults added to the client or server communication though is likely to wrong for most users, and may break existing code.

Also, adding timeouts to (or just logging the timing of) critical sections of your code will likely fish out some problems where things aren't returning in a timely fashion. For example, production systems will almost always log all slow queries at the database level.

Reply all

Reply to author

Forward