Custom monitoring with stackdriver

jorge.g...@endocode.com

unread,

Nov 3, 2018, 4:48:46 AM11/3/18

to Google Stackdriver Discussion Forum

Hi,

I'm planning to migrate Nagios to stackdriver and I can't find documentation on how to do it: like everyone my applications run in custom ports that I've to monitor if they're open and listening, and also it would be ideal if I could execute my own binary scripts (exec plugin perhaps?).

Finally I see no option to force the execution of a check from the stackdriver console or dashboard.

Anyone suffers from the same disease?

Thanks and best regards.

Igor Peshansky

unread,

Nov 3, 2018, 12:46:10 PM11/3/18

to jorge.g...@endocode.com, Google Stackdriver Discussion Forum

Stackdriver works differently than Nagios. In Stackdriver, you ingest metrics from a VM by installing an agent on your VM to scrape the system and applications and write values to Stackdriver periodically, or by adding custom code that uses the client libraries to write metrics. Then you can set up dashboards that plot the ingested metrics as graphs or set up alerts to notify you when some metric values meet certain conditions.

The Stackdriver Linux monitoring agent has the exec plugin bundled, so you can enable it and configure it to scrape metrics. Then you can use https://cloud.google.com/monitoring/agent/custom-metrics-agent to ingest whatever you're scraping into Stackdriver. You can also do it with the other bundled plugins that you can make to work for you. For system entities and any of the third-party applications mentioned at https://cloud.google.com/monitoring/agent/plugins/, the Stackdriver Linux agent also ingests a curated set of metrics that has predefined dashboards.

If you want to verify that your externally serving application is up and available, you can instead add a Stackdriver Uptime check (https://cloud.google.com/monitoring/uptime-checks/management), which will probe your application's externally visible address periodically from outside of your VM and verify that it's serving correctly. You can then set up alerts in Stackdriver to notify you when the uptime checks fail consistently.

Hope this helps.

Igor
-- sent from a mobile device, please excuse tyops and omissns

--
© 2016 Google Inc. 1600 Amphitheatre Parkway, Mountain View, CA 94043

Email preferences: You received this email because you signed up for the Google Stackdriver Discussion Google Group (google-stackdr...@googlegroups.com) to participate in discussions with other members of the GoogleStackdriver community.
---
You received this message because you are subscribed to the Google Groups "Google Stackdriver Discussion Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-stackdriver-d...@googlegroups.com.
To post to this group, send email to google-stackdr...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-stackdriver-discussion/cc009480-5786-4091-9cd6-b58d96d7f0cb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jorge González

unread,

Nov 5, 2018, 7:19:53 AM11/5/18

to ig...@google.com, google-stackdr...@googlegroups.com

Hi Igor,

Many thanks for your reply, it's very helpful but only half of what I need.

Extracting from your email, the answer to active checks is no, Stackdriver doesn't have active checks but those from the console (not the agent) and are very simple. Also, no, the checks cannot be forced to be executed. Therefore the monitoring is mostly left to the passive agent that reports on whatever is being monitored.

I spent about 3 hours reading and trying to understand and execute what is written in the custom metrics, and I failed miserably, perhaps I had a bad 3 hours or maybe is not very easy to do.

Yes, to monitor external serving ports I know I can use uptime-checks, the problem is when they are internal, and I have many (and I believe a lot of people also do, it's an extremely common practice), so moving from Nagios (or Nagios-like) to Stackdriver seems to be a very complicated process, and not getting the same functionality at the same time.

Thanks again Igor, and best regards.

--

Jorge Gonzalez
DevOps Engineer | Endocode AG
jorge.g...@endocode.com

Endocode AG, Brückenstraße 5A, 10179 Berlin
+49 30 1206 4472 | in...@endocode.com | www.endocode.com

Vorstandsvorsitzender: Mirko Boehm
Vorstände: Dr. Thomas Fricke, Sebastian Sucker

Aufsichtsratsvorsitzende: Alexandra Boehm

Registergericht: Amtsgericht Charlottenburg - HRB 150748 B

Igor Peshansky

unread,

Nov 5, 2018, 8:19:23 AM11/5/18

to Jorge González, Google Stackdriver Discussion Forum

Actually, the way I read the Nagios docs, the Stackdriver agent performs active checks (namely, it pings the collection points periodically, rather than have an external service initiate the checks by prodding the agent). The concepts don't quite map because one does not then ask the agent for the data — rather, the agent sends the information to the Stackdriver servers.

If you give us an example of what you're trying to do with active checks, we may be able to suggest a way to accomplish that with the Stackdriver agent.

Igor
-- sent from a mobile device, please excuse tyops and omissns

Jorge González

unread,

Nov 5, 2018, 8:25:53 AM11/5/18

to ig...@google.com, google-stackdr...@googlegroups.com

Hi Igor,

Thanks again for your answer. What I would like to know is how to perform the re-execution of a single check to get updated values instantly instead of waiting for the agent to execute itself (and all its configured checks).

Let's say I get a KO for a service or check (port 1234 is open), and I want to immediately recheck if that's the case or if it was a false positive (for whatever the reason).

Any case, I understand my other assumptions were correct, right?

Best regards.

Igor Peshansky

unread,

Nov 5, 2018, 9:16:33 AM11/5/18

to Jorge González, Google Stackdriver Discussion Forum, Joy Wang

Jorge,

As you've found, there's no way to trigger an agent collection manually. Even with Uptime Monitoring, the checks are automatic and periodic. The general way to deal with false positives in Stackdriver (or any other timeseries-based system) is to set up your alert conditions to compensate for an odd failure (e.g., alert if 3 out of 5 checks fail, but there are better ways).

Probing internal endpoints is on the Uptime Monitoring roadmap, but I'll let our PMs talk about timelines. In the meantime, you can simulate an uptime check by writing an exec script whose data collection will curl an endpoint (or a set of endpoints) and verify the results.

You can definitely implement a custom collection mechanism where your exec script does not report a point until it's tried the corresponding endpoint a few times.

Hope this helps.

Igor
-- sent from a mobile device, please excuse tyops and omissns

Message has been deleted

Jorge González

unread,

Nov 19, 2018, 12:39:28 PM11/19/18

to luis.gruber...@circuitsandbox.net, Igor Peshansky, google-stackdr...@googlegroups.com, j...@google.com

Hi,

I have developed a way to get mostly what I want, though I strongly dislike it, so I would like to ask for a better direction if possible.

I have a script that generates an JSON data file, something like:

#!/bin/bash

echo '{
  "timeSeries": [
    {
      "metric": {
        "type": "custom.googleapis.com/port-open",
        "labels": {
          "my_label": "port"
        }
      },
      "resource": {
        "type": "gce_instance",
        "labels": {
          "project_id": "my-project",
          "instance_id": "my-instance",
          "zone": "my-zone-"
        }
      },
      "points": [
        {
          "interval": {
            "endTime": "'$(date -Iseconds)'"
          },
          "value": {
            "doubleValue": "'$(netstat -antl | grep LIST | grep -w 8080 | wc -l)'"
          }
        }
      ]
    }
  ]
}' > /tmp/data.json

Another script gets an OAuth token, executes the previous and sends the data:

#!/bin/bash

#Generate data

bash data.sh 2>/dev/null

#Get my token

ACCESS_TOKEN="$(gcloud auth application-default print-access-token)" 2>/dev/null

#Send the data

curl -H "Authorization: Bearer ${ACCESS_TOKEN}" \

-vX POST https://monitoring.googleapis.com/v3/projects/my-project/timeSeries -d @/tmp/data.json \

--header "Content-Type: application/json" 2>/dev/null

Since I am sending the data manually, I need no stackdriver agent, just anything (a while loop could do) that makes sure that is executed every X, but there has to be an easier way to do this with the agent. Right?

Thanks and kind regards.

On Mon, 5 Nov 2018 at 15:18, <luis.gruber...@circuitsandbox.net> wrote:

Circuit conversation "[google-stackdriver-discussion] Custom
monitoring with stackdriver" created for this email thread.

View at: https://adfs01.circuitsandbox.net/#/conversation/3ffcabeb-a5c8-4460-86db-8985d8c85157

Igor Peshansky

unread,

Nov 19, 2018, 12:57:26 PM11/19/18

to jorge.g...@endocode.com, luis.gruber...@circuitsandbox.net, google-stackdr...@googlegroups.com, j...@google.com

You can use the exec plugin to execute your netstat command and generate the data. You would then use the custom metrics via the agent mechanism to redirect those exec metrics into your chosen custom metric ("custom.googleapis.com/port-open"). The agent will take care of periodically invoking your command and filling in the rest of the request (monitored resource, OAuth token, etc).

Note that the tcpconns collectd plugin, which ships with the agent, can basically do the equivalent of your netstat command. It would interfere somewhat with the default config, so you'd need to use the aggregation filter instead of "AllPortsSummary true", but it's doable. You might find the exec approach easier.

Igor

Jorge González

unread,

Nov 19, 2018, 1:43:09 PM11/19/18

to Igor Peshansky, luis.gruber...@circuitsandbox.net, google-stackdr...@googlegroups.com, j...@google.com

Hi Igor,

Thank you once again for your quick answer.

My problem is that I find the exec extremely complicated. I have tried to follow the documentation for adding it and I have failed miserably. Not sure I'm that slow or incapable of understanding it, or the documentation just doesn't help: adding a custom metric seems terribly complicated. I see no useful easy example neither.

Best regards.

Igor Peshansky

unread,

Nov 20, 2018, 2:17:21 PM11/20/18

to Jorge González, luis.gruber...@circuitsandbox.net, google-stackdr...@googlegroups.com, Joy Wang

Thanks, Jorge, that's good feedback. I've opened an internal tracking bug to improve our documentation with a more complete end-to-end example. Stay tuned.

Igor

Jorge González

unread,

Nov 21, 2018, 5:41:35 AM11/21/18

to Igor Peshansky, luis.gruber...@circuitsandbox.net, google-stackdr...@googlegroups.com, j...@google.com

Thanks one more time! Keep me posted, please.

Reply all

Reply to author

Forward