Help with PromQL for monitoring a kubernetes cron job

846 views
Skip to first unread message

Karsten Köhler

unread,
Jan 27, 2020, 11:13:43 AM1/27/20
to Prometheus Users
Hey there,

I am trying to set up monitoring for a kubernetes cron job and have some trouble with the query language. This is the current query I have:

topk(1, kube_job_created{job_name=~"some-cron-job.+"}) == on(job_name) group_right kube_job_status_succeeded


With that query I wanted to fetch the name of the job that was last executed and then match the name of that job to its' exit status. Unfortunately, the query returns no results. I checked the kube_job_created and kube_job_status_succeeded and they both have a label job_name with the same value in their results:

kube_job_created
kube_job_created
{endpoint="http",instance="100.106.0.21:8080",job="kube-state-metrics",job_name="some-cronjob-1579306020",namespace="some-namespace",pod="prometheus-operator-kube-state-metrics",service="prometheus-operator-kube-state-metrics"} 1579306021
kube_job_created
{endpoint="http",instance="100.106.0.21:8080",job="kube-state-metrics",job_name="some-cronjob-1579392420",namespace="some-namespace",pod="prometheus-operator-kube-state-metrics",service="prometheus-operator-kube-state-metrics"} 1579392432
kube_job_created
{endpoint="http",instance="100.106.0.21:8080",job="kube-state-metrics",job_name="some-cronjob-1579478820",namespace="some-namespace",pod="prometheus-operator-kube-state-metrics",service="prometheus-operator-kube-state-metrics"} 1579478825
kube_job_created
{endpoint="http",instance="100.106.0.21:8080",job="kube-state-metrics",job_name="some-cronjob-1579565220",namespace="some-namespace",pod="prometheus-operator-kube-state-metrics",service="prometheus-operator-kube-state-metrics"} 1579565232
...


kube_job_status_succeeded
kube_job_status_succeeded
{endpoint="http",instance="100.106.0.21:8080",job="kube-state-metrics",job_name="some-cronjob-1579306020",namespace="some-namespace",pod="prometheus-operator-kube-state-metrics",service="prometheus-operator-kube-state-metrics"} 1
kube_job_status_succeeded
{endpoint="http",instance="100.106.0.21:8080",job="kube-state-metrics",job_name="some-cronjob-1579392420",namespace="some-namespace",pod="prometheus-operator-kube-state-metrics",service="prometheus-operator-kube-state-metrics"} 1
kube_job_status_succeeded
{endpoint="http",instance="100.106.0.21:8080",job="kube-state-metrics",job_name="some-cronjob-1579478820",namespace="some-namespace",pod="prometheus-operator-kube-state-metrics",service="prometheus-operator-kube-state-metrics"} 1
kube_job_status_succeeded
{endpoint="http",instance="100.106.0.21:8080",job="kube-state-metrics",job_name="some-cronjob-1579565220",namespace="some-namespace",pod="prometheus-operator-kube-state-metrics",service="prometheus-operator-kube-state-metrics"} 1
...


As a result for my query mentioned above I expected to see the value of kube_job_status_succeeded for the job with the greatest timestamp (or something similar). I think I am missing something in my query. Could you please help me figuring out what I do wrong?

I appreciate any advice on this.


Cheers,
Karsten

Brian Candler

unread,
Jan 27, 2020, 12:39:58 PM1/27/20
to Prometheus Users
It's simple: you are using the == operator between the two sets of timeseries.  One has value 1, and the other has a value like 1579306021 - therefore they are not equal, and so the left-hand timeseries is dropped from the result set.

You can try using * instead of == as the operator.

Karsten Köhler

unread,
Jan 28, 2020, 5:13:19 AM1/28/20
to Prometheus Users
Hey Brian, many thanks for your response. I thought the operator intends to compare the labels, not the values. If I use * the query works fine.

I have one follow-up question for that. The query I have right now only tells me about the result of the last job. I now want to extend that to give me the results of all jobs of the last x hours.
I use the following query to get all jobs that were created up to one hour ago
time() - kube_job_created{job_name=~"some-cron-job-.+"} < 3600

which results in 
{endpoint="http",instance="100.118.0.24:8080",job="kube-state-metrics",job_name="some-cronjob-1580200800",namespace="some-namespace",pod="prometheus-operator-kube-state-metrics",service="prometheus-operator-kube-state-metrics"} 3199.5999999046326
{endpoint="http",instance="100.118.0.24:8080",job="kube-state-metrics",job_name="some-cronjob-1580201400",namespace="some-namespace",pod="prometheus-operator-kube-state-metrics",service="prometheus-operator-kube-state-metrics"} 2607.5999999046326
{endpoint="http",instance="100.118.0.24:8080",job="kube-state-metrics",job_name="some-cronjob-1580202000",namespace="some-namespace",pod="prometheus-operator-kube-state-metrics",service="prometheus-operator-kube-state-metrics"} 2000.5999999046326

If I try to use that query as the left part of my original query like this
time() - kube_job_created{job_name=~"some-cronjob.+"} < 3600 * on(job_name) group_right kube_job_status_succeeded{job_name=~"some-cronjob.+"}

which results in the following error
Error executing query: invalid parameter 'query': parse error at char 142: vector matching only allowed between instant vectors

I guess the left part of the query is not an instant vector anymore, but I don't know what other type it is then? To me, it still looks like an instant vector (see the result above).

If I omit the <3600, the query works again, but I'm not sure how to interpret the result then. The result looks like that:
time() - kube_job_created{job_name=~"some-cronjob.+"} * on(job_name) group_right kube_job_status_succeeded{job_name=~"some-cronjob.+"}

{endpoint="http",instance="100.118.0.24:8080",job="kube-state-metrics",job_name="some-cronjob-1576761000",namespace="some-namespace",pod="prometheus-operator-kube-state-metrics",service="prometheus-operator-kube-state-metrics"} 1580204721.851
{endpoint="http",instance="100.118.0.24:8080",job="kube-state-metrics",job_name="some-cronjob-1580199000",namespace="some-namespace",pod="prometheus-operator-kube-state-metrics",service="prometheus-operator-kube-state-metrics"} 5719.851000070572
{endpoint="http",instance="100.118.0.24:8080",job="kube-state-metrics",job_name="some-cronjob-1580199600",namespace="some-namespace",pod="prometheus-operator-kube-state-metrics",service="prometheus-operator-kube-state-metrics"} 5112.851000070572

The kube_job_status_succeeded value of the first job (some-cronjob-1576761000) is 0, but in the result above the value is 1580204721.851. Just for my understanding, shouldn't this also be 0, because it multiplies something with 0?


To summarize my question: I want to know how to modify my query, so that it returns the status_succeeded (or something similar) for all jobs that ran in the last x hours.

Again, I appreciate any advice.


Cheers,
Karsten

Brian Candler

unread,
Jan 28, 2020, 10:12:00 AM1/28/20
to Prometheus Users
On Tuesday, 28 January 2020 10:13:19 UTC, Karsten Köhler wrote:
Hey Brian, many thanks for your response. I thought the operator intends to compare the labels, not the values. If I use * the query works fine.
 
I have one follow-up question for that. The query I have right now only tells me about the result of the last job. I now want to extend that to give me the results of all jobs of the last x hours.
I use the following query to get all jobs that were created up to one hour ago
time() - kube_job_created{job_name=~"some-cron-job-.+"} < 3600

which results in 
{endpoint="http",instance="100.118.0.24:8080",job="kube-state-metrics",job_name="some-cronjob-1580200800",namespace="some-namespace",pod="prometheus-operator-kube-state-metrics",service="prometheus-operator-kube-state-metrics"} 3199.5999999046326
{endpoint="http",instance="100.118.0.24:8080",job="kube-state-metrics",job_name="some-cronjob-1580201400",namespace="some-namespace",pod="prometheus-operator-kube-state-metrics",service="prometheus-operator-kube-state-metrics"} 2607.5999999046326
{endpoint="http",instance="100.118.0.24:8080",job="kube-state-metrics",job_name="some-cronjob-1580202000",namespace="some-namespace",pod="prometheus-operator-kube-state-metrics",service="prometheus-operator-kube-state-metrics"} 2000.5999999046326

If I try to use that query as the left part of my original query like this
time() - kube_job_created{job_name=~"some-cronjob.+"} < 3600 * on(job_name) group_right kube_job_status_succeeded{job_name=~"some-cronjob.+"}

which results in the following error
Error executing query: invalid parameter 'query': parse error at char 142: vector matching only allowed between instant vectors

I guess the left part of the query is not an instant vector anymore, but I don't know what other type it is then? To me, it still looks like an instant vector (see the result above).


I think it is just operator precedence.  Try adding parentheses, i.e.

(time() - kube_job_created{job_name=~"some-cronjob.+"} < 3600) * on(job_name) group_right kube_job_status_succeeded{job_name=~"some-cronjob.+"}


 
If I omit the <3600, the query works again, but I'm not sure how to interpret the result then. The result looks like that:
time() - kube_job_created{job_name=~"some-cronjob.+"} * on(job_name) group_right kube_job_status_succeeded{job_name=~"some-cronjob.+"}

{endpoint="http",instance="100.118.0.24:8080",job="kube-state-metrics",job_name="some-cronjob-1576761000",namespace="some-namespace",pod="prometheus-operator-kube-state-metrics",service="prometheus-operator-kube-state-metrics"} 1580204721.851
{endpoint="http",instance="100.118.0.24:8080",job="kube-state-metrics",job_name="some-cronjob-1580199000",namespace="some-namespace",pod="prometheus-operator-kube-state-metrics",service="prometheus-operator-kube-state-metrics"} 5719.851000070572
{endpoint="http",instance="100.118.0.24:8080",job="kube-state-metrics",job_name="some-cronjob-1580199600",namespace="some-namespace",pod="prometheus-operator-kube-state-metrics",service="prometheus-operator-kube-state-metrics"} 5112.851000070572

The kube_job_status_succeeded value of the first job (some-cronjob-1576761000) is 0, but in the result above the value is 1580204721.851. Just for my understanding, shouldn't this also be 0, because it multiplies something with 0?

Operator precedence again: * binds more tightly than -, so it's interpreted as time() - (foo * bar)


Regards,

Brian.

Karsten Köhler

unread,
Jan 29, 2020, 2:44:05 AM1/29/20
to Prometheus Users
Yes, you are right. With the parenthesis everything works as I expected. Thanks a lot for your help :)

Brian Candler

unread,
Jan 29, 2020, 4:48:07 AM1/29/20
to Prometheus Users
Great.

Just one point to beware if you hadn't already noticed: time() doesn't mean the current time, it means the time for which the query is evaluated.  If you're just doing an instant query, it won't be a problem.  But if you try to graph one of these expressions, then as the time axis sweeps across a range of times, the value of "time()" will also sweep in step.

Karsten Köhler

unread,
Jan 29, 2020, 6:42:41 AM1/29/20
to Prometheus Users
That's a good hint, thank you. I have no plans to graph one of those expressions, but I'd rather use them for alerting. 
Reply all
Reply to author
Forward
0 new messages