How to monitor/alert on container/pod death or restart

David Rosenstrauch

unread,

Aug 8, 2018, 3:59:01 PM8/8/18

to Kubernetes user discussion and Q&A

As we're getting ready to go to production with our k8s-based system,
we're trying to pin down exactly how we're going to do all the needed
monitoring/alerting for it. We can easily collect many of the metrics
we need (using kube-state-metrics to feed into prometheus, and/or
Datadog) and alert off of those.

However, there's other important k8s-related info about our system that
we need to be able to access, monitor, and alert on, most notably things
like:

* If a container crashes and is restarted by k8s

* If k8s kills a container and restarts it (e.g., due to exceeding cpu
or memory limits, or due to repeated failure of liveness check)

* If k8s kills a container but cannot restart it

* If an entire pod crashes and is restarted by k8s

etc.

How would would go about gaining access to those k8s-related events in
an automated fashion, and setting up monitoring/alerting off of those?

Thanks,

DR

Marcio Garcia

unread,

Aug 8, 2018, 4:45:15 PM8/8/18

to kubernet...@googlegroups.com

Hi David,

You can use DataDog to achieve this.

> --
> You received this message because you are subscribed to the Google Groups
> "Kubernetes user discussion and Q&A" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kubernetes-use...@googlegroups.com.
> To post to this group, send email to kubernet...@googlegroups.com.
> Visit this group at https://groups.google.com/group/kubernetes-users.
> For more options, visit https://groups.google.com/d/optout.
>

David Rosenstrauch

unread,

Aug 8, 2018, 5:06:08 PM8/8/18

to kubernet...@googlegroups.com

Thanks for the response, Marcio. We've actually recently started using
Datadog already. (At least in dev/qa.) But DD is a bit of a sea of
metrics, and I'm not clear how we would accomplish one of the specific
tasks I've mentioned - for example, alerting when k8s has killed a
container or pod. Any pointers on how I might go about setting up an
alert like that?

Thanks,

DR

On 08/08/2018 04:45 PM, Marcio Garcia wrote:
> Hi David,
>
> You can use DataDog to achieve this.
>
> On 8/8/18, David Rosenstrauch <dar...@darose.net> wrote:
>> As we're getting ready to go to production with our k8s-based system,
>> we're trying to pin down exactly how we're going to do all the needed
>> monitoring/alerting for it. We can easily collect many of the metrics
>> we need (using kube-state-metrics to feed into prometheus, and/or
>> Datadog) and alert off of those.
>>
>> However, there's other important k8s-related info about our system that
>> we need to be able to access, monitor, and alert on, most notably things
>> like:
>>
>> * If a container crashes and is restarted by k8s

Tim Hockin

unread,

Aug 8, 2018, 5:11:20 PM8/8/18

to Kubernetes user discussion and Q&A

Most of what you're asking for is available via the k8s API, if you watch it.

On Wed, Aug 8, 2018 at 12:58 PM David Rosenstrauch <dar...@darose.net> wrote:

As we're getting ready to go to production with our k8s-based system,
we're trying to pin down exactly how we're going to do all the needed
monitoring/alerting for it. We can easily collect many of the metrics
we need (using kube-state-metrics to feed into prometheus, and/or
Datadog) and alert off of those.

However, there's other important k8s-related info about our system that
we need to be able to access, monitor, and alert on, most notably things
like:

* If a container crashes and is restarted by k8s

Represented the in the pod.status block

* If k8s kills a container and restarts it (e.g., due to exceeding cpu
or memory limits, or due to repeated failure of liveness check)

Also in pod.status

* If k8s kills a container but cannot restart it

in pod.status and/or events depending on exactly what you want to know

* If an entire pod crashes and is restarted by k8s

There's not really a concept of a pod "crashing", just containers being restarted.

etc.

How would would go about gaining access to those k8s-related events in
an automated fashion, and setting up monitoring/alerting off of those?

Thanks,

DR

Marcio Garcia

unread,

Aug 8, 2018, 5:16:03 PM8/8/18

to kubernet...@googlegroups.com

David,

In Datadog events you can see the killed pods.

But, if you have containers that need to be killed because they don't die when receiving a stop, you'll see a lot of events like: KILLED, DESTROYED, and this is not necessarily

an error, could be only a container being restarted, keep that in mind.

Agrawal, Punit

unread,

Aug 8, 2018, 5:34:22 PM8/8/18

to kubernet...@googlegroups.com

David,

What we do is export the kubernetes cluster events to Cloud PubSub using Stackdriver Export and then we have SumoLogic setup to ingest logs from PubSub.

Then we use the SumoLogic Scheduled Search Capabilities to send alerts based on certain events.

Punit Agrawal

Site Reliability Engineer, Lead

New Product Development

Rodrigo Campos

unread,

Aug 8, 2018, 9:58:11 PM8/8/18

to kubernet...@googlegroups.com

It really depends on the monitoring solution. Usually this metrics are exported and you can just predicate on them, in the language they provide.

In my case, I'm using a hosted solution (signalfx) that gives you a daemon set and sends that metric to them. You can then predicate. We have alerts when restarts increase significantly, the number of pods ready, cpu used on average for each app, etc.

Does this help?

--
You received this message because you are subscribed to the Google Groups "Kubernetes user discussion and Q&A" group.

To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-users+unsubscribe@googlegroups.com.
To post to this group, send email to kubernetes-users@googlegroups.com.

Chaitanya Potu

unread,

Sep 6, 2018, 3:42:57 PM9/6/18

to Kubernetes user discussion and Q&A

Use Prometheus and alert manager for setting up these kind of monitoring and alerts.

Agrawal, Punit

unread,

Sep 6, 2018, 4:02:59 PM9/6/18

to kubernet...@googlegroups.com

We pipe the k8s events into sumologic using a http collector and then use sumologic alerting.

punit agrawal

dev-ops lead

new product development

ebay

--

You received this message because you are subscribed to the Google Groups "Kubernetes user discussion and Q&A" group.

To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-use...@googlegroups.com.
To post to this group, send email to kubernet...@googlegroups.com.

Reply all

Reply to author

Forward