Ceph Exporter

Vaibhav Bhembre

unread,

Jan 5, 2016, 12:34:49 PM1/5/16

to Prometheus Developers

Hello friends,

We have just released our version of Ceph's Exporter (https://github.com/digitalocean/ceph_exporter) that we have been using. It is written in Go and uses Ceph's official Go client (https://github.com/ceph/go-ceph) to issue stats calls to a Ceph cluster.

We are currently tracking the following metrics:

Cluster-level

Total storage capacity of cluster.
Used capacity of the cluster.
Available capacity.
Total no. of RADOS objects present.
No. of Degraded PGs.
No. of Unclean PGs.
No. of Undersized PGs.
No. of Stale PGs.
No. of Degraded RADOS objects.
Percentage of Degraded RADOS objects.
No. of OSDs in DOWN state.

Pool-level

Total storage capacity of each pool.
Total no. of RADOS objects in each pool.
No. of Read/Write IO attempts.

Monitor-level

Total storage capacity of each monitor.
Used storage capacity.
Available storage capacity.
Percentage of available storage capacity.
Total size of Monitor backing store (FileStore).
Extent of FileStore occupied by SSTables.
Extent of FileStore occupied by logs.
Extent of FileStore occupied by other misc stuff.
Clock skew on the monitors.
Latency of each of the monitors.
No. of nodes. in Monitor Quorum.

This just scratches the surface of what we can do with a tool like this though. All credit goes to Prometheus developers (you guys!) for not only building an amazing metrics and monitoring platform but also making it extremely easy to build such integrations with it.

It'd be quite helpful to make sure that this tool meets the standards set by this group. We have made sure to use the Exporter guidelines(https://goo.gl/98hIjJ) for the naming and other conventions.

Any feedback would be appreciated.

Brian Brazil

unread,

Jan 5, 2016, 12:56:48 PM1/5/16

to Vaibhav Bhembre, Prometheus Developers

It's always great to see more exporters!

I note this in the readme "Hence, no additional setup is necessary other than having a working ceph cluster." Does this mean that no metrics are exposed if the cluster is broken? Generally you should be running one exporter per daemon so that even when things are failing, you can still get the metrics you need to debug.

Some comments on the metrics:

https://github.com/digitalocean/ceph_exporter/blob/master/collectors/cluster_usage.go#L58

The _total suffix usually indicates a counter, avoid it with other types such as gauges. objects_total seems inconsistent in not having the 'cluster' in there. You should be using ConstMetric rather than setting metrics in collect, this avoids races.

https://github.com/digitalocean/ceph_exporter/blob/master/collectors/health.go#L51

The _count suffix is part of a Summary or Histogram, avoid it with other types. Generally you can remove suffixes such as _count and _total without any loss of meaning. degraded_objects_percent - if this can be calculated from other metrics, then don't export it. In general avoid percentages, ratios are preferred for consistency (and should be calculated in Prometheus).

https://github.com/digitalocean/ceph_exporter/blob/master/collectors/pool_usage.go#L69

pool_read_io_total sounds like a counter rather than a guage. I wouldn't usually put 'io' in a name, as read implies that.

https://github.com/digitalocean/ceph_exporter/blob/master/collectors/monitors.go#L87

monitor_total_kbs should be converted to bytes both as that's our standard, and to be consistent with the rest of the exporter. For monitor_clock_skew and monitor_latency put the units in the metric name (preferably seconds).

Brian

Any feedback would be appreciated.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Brian Brazil

www.robustperception.io

Vaibhav Bhembre

unread,

Jan 5, 2016, 2:51:09 PM1/5/16

to Prometheus Developers, vai...@digitalocean.com

Thanks for your detailed review Brian! :) Appreciate you taking time to go through the source and listing your thoughts. I will try to address those inline.

In the current version, we are tracking and reporting the behavior of a cluster across all states; it primarily includes surfacing the specific issues when things are not working as desired. When a cluster enters a degraded or broken state, there are several indicators that start glowing red. Ceph's own reporting takes care of exposing most of this information. One case where we are currently logging an error (and retrying) in the exporter is when the client connection to the cluster fails through either a network issue or a problem with the monitor quorum. Both these issues are quite significant for us and there is a separate monitoring utility that raises an alarm when such a state is entered.

I like the idea of scraping stats from individual daemons in isolation. In our scenario, we didn't find the metrics surfaced by them to be that useful for debugging various issues we encountered (unlike how it would be with other daemons like mysqld or httpd, etc.). But when the cluster is "down" (as seen externally) it could be beneficial to know about the state of running process across the ensemble. I will look into writing couple more exporters specially for monitor and OSD daemons.

Some comments on the metrics:

https://github.com/digitalocean/ceph_exporter/blob/master/collectors/cluster_usage.go#L58

The _total suffix usually indicates a counter, avoid it with other types such as gauges. objects_total seems inconsistent in not having the 'cluster' in there. You should be using ConstMetric rather than setting metrics in collect, this avoids races.

I see. Is "cluster_capacity" a fine alternative to "cluster_bytes_total" if it's a gauge? Also, the collection process is currently synchronized by https://github.com/digitalocean/ceph_exporter/blob/master/exporter.go#L68-L69. Do you feel like it could still be susceptible to any concurrency issues?

https://github.com/digitalocean/ceph_exporter/blob/master/collectors/health.go#L51

The _count suffix is part of a Summary or Histogram, avoid it with other types. Generally you can remove suffixes such as _count and _total without any loss of meaning. degraded_objects_percent - if this can be calculated from other metrics, then don't export it. In general avoid percentages, ratios are preferred for consistency (and should be calculated in Prometheus).

I will update the source to remove both the suffixes. Agreed that the percentage doesn't need to be computed inline, but the one that is being surfaced is bubbled up by Ceph itself so planning to keep it for now.

https://github.com/digitalocean/ceph_exporter/blob/master/collectors/pool_usage.go#L69

pool_read_io_total sounds like a counter rather than a guage. I wouldn't usually put 'io' in a name, as read implies that.

Good point! Truncating `io`. :) I also didn't realize Counter type had a Set(); always thought they just had Incs(). I will update the type.

https://github.com/digitalocean/ceph_exporter/blob/master/collectors/monitors.go#L87

monitor_total_kbs should be converted to bytes both as that's our standard, and to be consistent with the rest of the exporter. For monitor_clock_skew and monitor_latency put the units in the metric name (preferably seconds).

I can do the conversion inline if that's the requirement for recording sizes/capacity. For both clock skews and latency, measuring time in seconds is too low a resolution. But I see your point and I will add the resolution in the metric name. :)

Brian

Any feedback would be appreciated.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

And please let me know (or feel free to open issues/contribute) if there are other changes you would like to see therein.

--
Brian Brazil
www.robustperception.io

Brian Brazil

unread,

Jan 5, 2016, 3:06:21 PM1/5/16

to Vaibhav Bhembre, Prometheus Developers

That sounds more like blackbox montoring, I wouldn't bundle that in with whitebox monitoring like this.

I like the idea of scraping stats from individual daemons in isolation. In our scenario, we didn't find the metrics surfaced by them to be that useful for debugging various issues we encountered (unlike how it would be with other daemons like mysqld or httpd, etc.). But when the cluster is "down" (as seen externally) it could be beneficial to know about the state of running process across the ensemble. I will look into writing couple more exporters specially for monitor and OSD daemons.

Some comments on the metrics:

https://github.com/digitalocean/ceph_exporter/blob/master/collectors/cluster_usage.go#L58

The _total suffix usually indicates a counter, avoid it with other types such as gauges. objects_total seems inconsistent in not having the 'cluster' in there. You should be using ConstMetric rather than setting metrics in collect, this avoids races.

I see. Is "cluster_capacity" a fine alternative to "cluster_bytes_total" if it's a gauge?

cluster_capacity_bytes or cluster_bytes would both be fine.

Also, the collection process is currently synchronized by https://github.com/digitalocean/ceph_exporter/blob/master/exporter.go#L68-L69. Do you feel like it could still be susceptible to any concurrency issues?

That'll prevent races, but you still have the other issues arising from not using ConstMetric such as labels not going away when they should.

https://github.com/digitalocean/ceph_exporter/blob/master/collectors/health.go#L51

The _count suffix is part of a Summary or Histogram, avoid it with other types. Generally you can remove suffixes such as _count and _total without any loss of meaning. degraded_objects_percent - if this can be calculated from other metrics, then don't export it. In general avoid percentages, ratios are preferred for consistency (and should be calculated in Prometheus).

I will update the source to remove both the suffixes. Agreed that the percentage doesn't need to be computed inline, but the one that is being surfaced is bubbled up by Ceph itself so planning to keep it for now.

If there's a percentage that's covered by other metrics, it's best to drop it.

https://github.com/digitalocean/ceph_exporter/blob/master/collectors/pool_usage.go#L69

pool_read_io_total sounds like a counter rather than a guage. I wouldn't usually put 'io' in a name, as read implies that.

Good point! Truncating `io`. :) I also didn't realize Counter type had a Set(); always thought they just had Incs(). I will update the type.

The presence of Set() in Go's Counter is a API wart in my opinion, and everything that I'm aware of that's currently using it should be using ConstMetric instead.

https://github.com/digitalocean/ceph_exporter/blob/master/collectors/monitors.go#L87

monitor_total_kbs should be converted to bytes both as that's our standard, and to be consistent with the rest of the exporter. For monitor_clock_skew and monitor_latency put the units in the metric name (preferably seconds).

I can do the conversion inline if that's the requirement for recording sizes/capacity. For both clock skews and latency, measuring time in seconds is too low a resolution. But I see your point and I will add the resolution in the metric name. :)

All values in Prometheus are floats, so you shouldn't have a problem using seconds.

Brian

Brian

Any feedback would be appreciated.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

And please let me know (or feel free to open issues/contribute) if there are other changes you would like to see therein.

--
Brian Brazil
www.robustperception.io

--

You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Brian Brazil

www.robustperception.io

Vaibhav Bhembre

unread,

Jan 6, 2016, 12:23:25 PM1/6/16

to Prometheus Developers, vai...@digitalocean.com

The repository is updated with the take-aways we had from the above conversation (other than the Gauge to ConstMetric conversion for a few instances, I will look at those selectively). Thanks for all your help, Brian!

Brian

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

And please let me know (or feel free to open issues/contribute) if there are other changes you would like to see therein.

--
Brian Brazil
www.robustperception.io

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Brian Brazil
www.robustperception.io

chie...@gmail.com

unread,

Nov 21, 2016, 1:46:01 AM11/21/16

to Prometheus Developers

Please help me. How to install Ceph_exporter?
I has installed prometheus + node_exporter

Ben Kochie

unread,

Nov 21, 2016, 2:11:05 AM11/21/16

to chie...@gmail.com, Prometheus Developers

There are two Ceph exporters you can try. The documentation for them is included on their homepages.

https://github.com/digitalocean/ceph_exporter

https://github.com/jcollie/ceph_exporter

On Mon, Nov 21, 2016 at 7:46 AM, <chie...@gmail.com> wrote:

Please help me. How to install Ceph_exporter?
I has installed prometheus + node_exporter

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.

To post to this group, send email to prometheus-developers@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/c5be99ca-9301-4361-a359-ac77aed2e96b%40googlegroups.com.

Ben Kochie

unread,

Nov 21, 2016, 2:11:52 AM11/21/16

to chie...@gmail.com, Prometheus Developers

Please take additional questions to the Prometheus users list: https://groups.google.com/forum/#!forum/prometheus-users

On Mon, Nov 21, 2016 at 8:11 AM, Ben Kochie <sup...@gmail.com> wrote:

There are two Ceph exporters you can try. The documentation for them is included on their homepages.

https://github.com/digitalocean/ceph_exporter

https://github.com/jcollie/ceph_exporter

On Mon, Nov 21, 2016 at 7:46 AM, <chie...@gmail.com> wrote:

Please help me. How to install Ceph_exporter?
I has installed prometheus + node_exporter

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsubscri...@googlegroups.com.

robu...@gmail.com

unread,

Nov 10, 2017, 6:27:14 AM11/10/17

to Prometheus Developers

hi Vaibhav Bhembre

I tried to install your ceph_exporter but failed with:

-----------
user@vm23[/0]:/opt/ceph_exporter/ceph_exporter # go install exporter.go
exporter.go:24:2: cannot find package "github.com/ceph/go-ceph/rados" in any of:
/usr/lib/go-1.7/src/github.com/ceph/go-ceph/rados (from $GOROOT)
($GOPATH not set)
exporter.go:25:2: cannot find package "github.com/digitalocean/ceph_exporter/collectors" in any of:
/usr/lib/go-1.7/src/github.com/digitalocean/ceph_exporter/collectors (from $GOROOT)
($GOPATH not set)
exporter.go:26:2: cannot find package "github.com/prometheus/client_golang/prometheus" in any of:
/usr/lib/go-1.7/src/github.com/prometheus/client_golang/prometheus (from $GOROOT)
($GOPATH not set)
exporter.go:27:2: cannot find package "github.com/prometheus/client_golang/prometheus/promhttp" in any of:
/usr/lib/go-1.7/src/github.com/prometheus/client_golang/prometheus/promhttp (from $GOROOT)
($GOPATH not set)
-----------

What I did was git-cloning https://github.com/digitalocean/ceph_exporter
and than running "go install exporter.go".
Since I'm unfortunately a "no-GO-er" ;-/ this is probably me missing a dependency or something else on how to build the package in go.
Since docker container is not an option here, I would be glad of any hint on how to build the ceph_exporter.

best regards

Conor Broderick

unread,

Nov 10, 2017, 6:31:54 AM11/10/17

to robu...@gmail.com, Prometheus Developers

Your Go environment isn't setup correctly.

You can find setup instructions here.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.

To post to this group, send email to prometheus-developers@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/7c6b29a1-c601-4a36-b225-4e3790da2ba5%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Conor Broderick

www.robustperception.io

robu...@gmail.com

unread,

Nov 10, 2017, 6:45:33 AM11/10/17

to Prometheus Developers

sorry for the fuzz,
figured it out finally.
For the ext poor fellow:

mkdir <somepath>
cd <somepath>
export GOPATH=<somepath>
go get -u github.com/digitalocean/ceph_exporter

that's it.
If it complains about missing rados headers, obviously one have to install them like: