500 errors on /disco/ctrl/nodeinfo

22 views
Skip to first unread message

Bob Corsaro

unread,
Apr 14, 2015, 10:21:34 AM4/14/15
to disc...@googlegroups.com
I've inherited a disco cluster and there's an issue I'm trying to resolve.

Usually the disco status webpage fails to load. Looking at the debug console, /disco/ctrl/nodeinfo is failing with 500. Sometimes it works though.  I also see a lot of 500s on /disco/ctrl/jobinfo. How can I begin to debug these issues?


Tim Spurway

unread,
Apr 14, 2015, 6:44:57 PM4/14/15
to disc...@googlegroups.com
Hi Bob,

Well, I can say that random 500s are a pretty rare occurrence in current disco builds, and really points to something being wrong in the installation.  I would:

- figure out what version of disco is running
- if getting 500s on the console, try `ddfs` and `disco` commands from command line and see if errors occur there as well
- capture master logs during 500s
- determine if there are any system resource issues (disk space, cpu, network issues)
- determine if possible to upgrade to latest version
- inspect master logs to find tracebacks to help diagnose root issues
- inspect/capture logs during GC to determine if that is breaking
- do detailed memory inspection during GC to determine if an OOM issue exists on any of the nodes

please let us know if you can get some more information!

tim



On Mon, Apr 13, 2015 at 1:57 PM, Bob Corsaro <rcor...@gmail.com> wrote:
I've inherited a disco cluster and there's an issue I'm trying to resolve.

Usually the disco status webpage fails to load. Looking at the debug console, /disco/ctrl/nodeinfo is failing with 500. Sometimes it works though.  I also see a lot of 500s on /disco/ctrl/jobinfo. How can I begin to debug these issues?


--
You received this message because you are subscribed to the Google Groups "Disco-development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to disco-dev+...@googlegroups.com.
To post to this group, send email to disc...@googlegroups.com.
Visit this group at http://groups.google.com/group/disco-dev.
For more options, visit https://groups.google.com/d/optout.

Bob Corsaro

unread,
Apr 15, 2015, 12:57:04 PM4/15/15
to disc...@googlegroups.com, tspu...@gmail.com
I think it was a disk space issue. I added some nodes and deleted a bunch of disco:results:* files from ddfs and the cluster seems to have settled down. Shouldn't these disco:results files be garbage collected? They were being collected since October 2014. Also, the new nodes have been added for a few days now and the cluster is still very imbalanced. Is this normal?

All jobs are running with no errors, so I'm mostly happy.
Reply all
Reply to author
Forward
0 new messages