How to monitor a batch job with Prometheus, pushgateway and Grafana ?

6,758 views
Skip to first unread message

Benjamin BALET

unread,
Jul 27, 2016, 4:06:20 PM7/27/16
to Prometheus Developers
Hi,

I'm new here and I saw that someone suggested to write a tutorial about bacth jobs. I'm studying Promotheus and I've some questions about that topic.

I have some batch jobs written with Powershell (the language doesn't matter here as pushgateway has a built-in HTTP API that I can use with Invoke-WebRequest). Of course, I don't want to feed Promotheus with the log events but with some metrics, such as:

 1. date/time of last execution and outcome.
 2. duration of execution
 3. In some cases, number of records (e.g. one of my batch jobs updates the list of users and it is interresting to get the opportunity to know the actual numbers of users and to perform some correlations with that metric).

Let's take an example of a batch job which role is to "update_users". This job is used withe two ALM systems that we will call "QC" and "PC" (the same code is used, but with two targets).

I'm a bit confused with the definitions of the documentation, mainly because English is not my language.

The job is "update_users", but can I use any string for the instance name? I mean something else than an IP address and here use "QC" and "PC" instead of the server name?

Now comes the problem of the metric types:

 1. Is The outcome a counter or a gauge? If it is a counter I will need two counters. One for the number of successful executions, and another one for the number of failed executions. If it is a gauge it can have two values 0 or 1 for OK and KO ?
 2. The duration of execution should be a gauge with a timestamp.
 3. The number of records should be a gauge with a timestamp.

If I'm correct, the message that I should push would have this content:

    # HELP job_update_users_outcome Outcome of the bacth job
    # TYPE job_update_users_outcome counter
    job_update_users_outcome{label="OK"} 1 1398355504000
    # Another example with an unfortunate outcome
    job_update_users_outcome{label="KO"} 1 1398355504000
    # HELP job_update_users_duration duration of the script execution in seconds.
    # TYPE job_update_users_duration gauge
    job_update_users_duration 2398.28 1398355504000
    # HELP job_update_users_records number of records.
    # TYPE job_update_users_records gauge
    job_update_users_duration 3000 1398355504000
    EOF

Will I get meaningful graphs In grafana with this approach that will give me the 3 metrics I've explained in the beginning of the question ?

Benjamin

Brian Brazil

unread,
Jul 27, 2016, 4:22:04 PM7/27/16
to Benjamin BALET, Prometheus Developers
On 27 July 2016 at 21:06, Benjamin BALET <benjami...@gmail.com> wrote:
Hi,

I'm new here and I saw that someone suggested to write a tutorial about bacth jobs. I'm studying Promotheus and I've some questions about that topic.

I have some batch jobs written with Powershell (the language doesn't matter here as pushgateway has a built-in HTTP API that I can use with Invoke-WebRequest). Of course, I don't want to feed Promotheus with the log events but with some metrics, such as:

 1. date/time of last execution and outcome.
 2. duration of execution
 3. In some cases, number of records (e.g. one of my batch jobs updates the list of users and it is interresting to get the opportunity to know the actual numbers of users and to perform some correlations with that metric).

Let's take an example of a batch job which role is to "update_users". This job is used withe two ALM systems that we will call "QC" and "PC" (the same code is used, but with two targets).

I'm a bit confused with the definitions of the documentation, mainly because English is not my language.

The job is "update_users", but can I use any string for the instance name? I mean something else than an IP address and here use "QC" and "PC" instead of the server name?

I'd not have an instance label as this is not a single process you're monitoring. Instead have some other label with QC/AM.
 
Now comes the problem of the metric types:

 1. Is The outcome a counter or a gauge? If it is a counter I will need two counters. One for the number of successful executions, and another one for the number of failed executions. If it is a gauge it can have two values 0 or 1 for OK and KO ?

Gauge with 0 or 1, there'd be no label on it.

It's best to also export the time of the last success, as that's what you'd want to alert on.

As you want the last execution, also export that as a Gauge.
 
 2. The duration of execution should be a gauge with a timestamp.
 3. The number of records should be a gauge with a timestamp.

Yes, however don't use timestamps.

Brian
 

If I'm correct, the message that I should push would have this content:

    # HELP job_update_users_outcome Outcome of the bacth job
    # TYPE job_update_users_outcome counter
    job_update_users_outcome{label="OK"} 1 1398355504000
    # Another example with an unfortunate outcome
    job_update_users_outcome{label="KO"} 1 1398355504000
    # HELP job_update_users_duration duration of the script execution in seconds.
    # TYPE job_update_users_duration gauge
    job_update_users_duration 2398.28 1398355504000
    # HELP job_update_users_records number of records.
    # TYPE job_update_users_records gauge
    job_update_users_duration 3000 1398355504000
    EOF

Will I get meaningful graphs In grafana with this approach that will give me the 3 metrics I've explained in the beginning of the question ?

Benjamin

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Julius Volz

unread,
Jul 27, 2016, 5:02:08 PM7/27/16
to Brian Brazil, Benjamin BALET, Prometheus Developers
Also check out the best practices around monitoring batch jobs: https://prometheus.io/docs/practices/instrumentation/#batch-jobs

Julius Volz

unread,
Jul 27, 2016, 5:05:49 PM7/27/16
to Brian Brazil, Benjamin BALET, Prometheus Developers
Yeah, don't include client-side sample timestamps. *Do* send the last succesful run timestamp as the *value* of a metric though.

Then you can alert on expression like:

  time() - myjob_last_successful_run_timestamp_seconds > 3600

...to alert you when the batch job hasn't run in an hour, for example. See also the example about that in https://www.digitalocean.com/community/tutorials/how-to-query-prometheus-on-ubuntu-14-04-part-2

Benjamin BALET

unread,
Jul 28, 2016, 3:22:56 AM7/28/16
to Julius Volz, Brian Brazil, Prometheus Developers
Hi,

Thank you for replies. I've read the documentation, but a tutorial would be nice for noobs like me.

So if I understood:
1. For "instance", I can optionnally send the server where the job was executed, but it is not mandatory.
2. In my example, all metric types are gauge.
3. If I want to distinguish the job that targets my "PC" system from the one that targets "QC", I'd label the metrics. For example, {label="QC"} or {label="PC"}.
4. The last successful execution is a timestamp sent with a metric type "gauge" in case of success.

I'd send this message in case of success:

# HELP job_update_users_outcome Outcome of the bacth job (0=failed, 1=success).
# TYPE job_update_users_outcome gauge
job_update_users_outcome{label="QC"} 1
# HELP job_update_users_duration duration of the script execution in seconds.
# TYPE job_update_users_duration gauge
job_update_users_duration{label="QC"} 2398.280
# HELP job_update_users_records number of records.
# TYPE job_update_users_records gauge
job_update_users_duration{label="QC"} 3000
# HELP job_update_users_last_successful_run_timestamp Last successful run (not sent if failed).
# TYPE job_update_users_last_successful_run_timestamp gauge
job_update_users_last_successful_run_timestamp{label="QC"} 1398355504000
EOF

But if my job fails, I would send:

# HELP job_update_users_outcome Outcome of the bacth job (0=failed, 1=success).
# TYPE job_update_users_outcome gauge
job_update_users_outcome{label="QC"} 0
# HELP job_update_users_duration duration of the script execution in seconds.
# TYPE job_update_users_duration gauge
job_update_users_duration{label="QC"} 2541.333
# HELP job_update_users_records number of records.
# TYPE job_update_users_records gauge
job_update_users_duration{label="QC"} 3000
EOF

If my job targets the "PC" system and fails, I'd send:

# HELP job_update_users_outcome Outcome of the bacth job (0=failed, 1=success).
# TYPE job_update_users_outcome gauge
job_update_users_outcome{label="PC"} 0
# HELP job_update_users_duration duration of the script execution in seconds.
# TYPE job_update_users_duration gauge
job_update_users_duration{label="PC"} 3331.124
# HELP job_update_users_records number of records.
# TYPE job_update_users_records gauge
job_update_users_duration{label="PC"} 2800
EOF

Brian Brazil

unread,
Jul 28, 2016, 3:43:00 AM7/28/16
to Benjamin BALET, Julius Volz, Prometheus Developers
On 28 July 2016 at 08:22, Benjamin BALET <benjami...@gmail.com> wrote:
Hi,

Thank you for replies. I've read the documentation, but a tutorial would be nice for noobs like me.

So if I understood:
1. For "instance", I can optionnally send the server where the job was executed, but it is not mandatory.
2. In my example, all metric types are gauge.
3. If I want to distinguish the job that targets my "PC" system from the one that targets "QC", I'd label the metrics. For example, {label="QC"} or {label="PC"}.
4. The last successful execution is a timestamp sent with a metric type "gauge" in case of success.

I'd send this message in case of success:


I don't think you want the instance label here (if you do, you should be using the node exporter rather than the pushgateway).
Put the "label" here as a grouping label, otherwise QC and PC will overwrite each other.
 
# HELP job_update_users_outcome Outcome of the bacth job (0=failed, 1=success).
# TYPE job_update_users_outcome gauge
job_update_users_outcome{label="QC"} 1

No need for the job_ prefix here, it's not adding anything.
 
# HELP job_update_users_duration duration of the script execution in seconds.
# TYPE job_update_users_duration gauge
job_update_users_duration{label="QC"} 2398.280

We'd recommend putting _seconds in the name for clarity.
 
# HELP job_update_users_records number of records.
# TYPE job_update_users_records gauge
job_update_users_duration{label="QC"} 3000

You've the wrong metric name here.
 
# HELP job_update_users_last_successful_run_timestamp Last successful run (not sent if failed).
# TYPE job_update_users_last_successful_run_timestamp gauge
job_update_users_last_successful_run_timestamp{label="QC"} 1398355504000

This should be in seconds, as that's what the time() function uses and is the standard for this sort of metric (and Prometheus generally).

Brian



--
Reply all
Reply to author
Forward
Message has been deleted
0 new messages