Better way to log Nomad errors?

1,653 views
Skip to first unread message

Justin DynamicD

unread,
Jul 6, 2018, 2:37:11 AM7/6/18
to Nomad
So taking a (very) familiar scenario, if I am to run a docker container with an incorrect config, nomad makes it 's extremely painful to diagnose the error thanks to poor logging and it's tendency to immediately cleanup after itself.

Basically what will happen is the container gets cleaned up immediately unless I play "quickly log while its starting so I have a chance to see the errors" because the most detail I get out of nomad logs is:

07/05/18 23:13:09TerminatedExit Code: 1, Exit Message: "Docker container exited with non-zero exit code: 1"

This amount of detail is basically useless.  If I remote onto the server to get more details (already less than ideal) I have to beat the nomad cleanup from deleting the container and all the logs.  Basically I revert to running a dev-nomad cluster wher I can more rapidly chekc logs befoer it has a chance to cleanup ... but honestly that's still me chasing the docker logs.  

Is there a way we can turn on verbose logging to actually give us a chance to read the error without having to basically recreate the service in a more friendly environment?  Basically, is there away to get better logging out of Nomad and capture whatever a job may be spewing forth?

Michael Schurter

unread,
Jul 9, 2018, 8:04:35 PM7/9/18
to Justin DynamicD, Nomad
Using a log aggregator like the ELK stack is the recommended way to ensure logs live past the lifetime of not just the allocation - but of the node the allocation ran on. There are a variety of ways to use and configure log aggregation with Nomad, but unfortunately we don't have any best practices documented at this point.

If you're using Docker, Nomad does support a Docker logging stanza. By default these logs are collected by Nomad and written to the local log files you're familiar with. You can then configure your Docker tasks to log to Logstash using Docker's builtin gelf logging plugin.

While we've considered various ways of storing logs past the initial GC of an allocation, it would never be as useful as a full log aggregation system so we have not prioritized it. Watch issue #1137 for updates (although the duplicate issue 3478 is actually a better description of what I think you're looking for).

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/0db12a4d-9672-4242-a1f7-8b9cf57f89c4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Justin DynamicD

unread,
Jul 10, 2018, 1:15:53 PM7/10/18
to Nomad
First off, thank you for your response.  I appreciate the feedback.  I do want to unpack your statements a bit to make sure I understand all you're saying as well as clarify my own stance.

First off, while I totally understand that solutions like ELK and Splunk include solutions for my quandary, it's a bit like suggesting a battleship when I'm asking for a boat to cross the river.  This isn't a data-lake scenario, I simply need the last few seconds of logs from a single container.  ELK and the like are wonderful aggregators but _within_ a tech I should be able to find the same logs.  Again:  Once things are running I can see logs from the scheduler, and this is good.  I'm simply saying we should be able to see more than a simple exit status 0 or 1 if Nomad is going to aggressively cleanup (or at least give us the option to delay cleanup by x).  The details in initialization of a new job is just lacking, and setting up a 3 node ELK stack just for that insight seems overkill and not necessarily appropriate.  That's once heck of a vagrant setup for devs ;)

You do mention "By default these logs are collected by Nomad and written to the local log files you're familiar with" ... as I think about it, I guess my issue is fairly unique to the docker driver.  Other batches/tasks will end up with their logs in stdout and be just fine ... but docker container logs are a different beast as they die with the container and the docker driver aggressively cleans up after itself.  As the original nomad job logs remain, this is where my question comes in:  is there a way that it can grab more than just that exit status?  Once things are running it redirects great, but that initialization gap is a fairly decent sized pain point.


Thank you again for your response.  

Michael Schurter

unread,
Jul 10, 2018, 3:50:58 PM7/10/18
to Justin DynamicD, Nomad
On Tue, Jul 10, 2018 at 10:15 AM Justin DynamicD <justin....@gmail.com> wrote:
First off, thank you for your response.  I appreciate the feedback.  I do want to unpack your statements a bit to make sure I understand all you're saying as well as clarify my own stance.

My pleasure! Debugging-containers-at-a-distance is hard, and we'd love to make it easier!
 
...


You do mention "By default these logs are collected by Nomad and written to the local log files you're familiar with" ... as I think about it, I guess my issue is fairly unique to the docker driver.  Other batches/tasks will end up with their logs in stdout and be just fine ... but docker container logs are a different beast as they die with the container and the docker driver aggressively cleans up after itself.  As the original nomad job logs remain, this is where my question comes in:  is there a way that it can grab more than just that exit status?  Once things are running it redirects great, but that initialization gap is a fairly decent sized pain point.

Hm, I guess I'm unclear what the initialization gap you're referring to is. The task event you include ("Exit Code: 1 ...") occurs if the Docker container's entrypoint is run and then exits with a non-zero status code. Since at this point the entrypoint has been executed neither Docker nor Nomad have any additional metadata about why the entrypoint exited. It's up to the entrypoint's code (eg redis, your application, etc) to log why it's exiting and those logs are accessible via the nomad logs command, API, and UI until the allocation is garbage collected.

Are you asking that Nomad include somethling like the last line logged in the task event? If so, feel free to open an issue. That seems like an interesting enhancement to pursue.

I'm just having trouble imagining what other data Nomad could provide. I've found myself facing this error a lot too, so if there's a way to improve it: I'd love to!

-schmichael
 

Justin DynamicD

unread,
Jul 12, 2018, 6:28:43 PM7/12/18
to Nomad
That may help (include last line logged in the task event) ... 

I think the best way to demonstrate the gap would be to setup a simple example a/b scenario where you can see exactly where my pain point is .  Unfortunately I'm a little strapped for time so wont get to this for a couple days.  But perhaps if I can get some time to post/outline the pain point you may see either :

1.  some obvious step I'm missing.
2.  more clarity on if this is a simple thing for nomad or a much larger undertaking (i fully agree you don't want to go too far down the logging rabbit hole).
Reply all
Reply to author
Forward
0 new messages