Ceph Exporter

891 views
Skip to first unread message

Vaibhav Bhembre

unread,
Jan 5, 2016, 12:34:49 PM1/5/16
to Prometheus Developers
Hello friends,

We have just released our version of Ceph's Exporter (https://github.com/digitalocean/ceph_exporter) that we have been using. It is written in Go and uses Ceph's official Go client (https://github.com/ceph/go-ceph) to issue stats calls to a Ceph cluster.

We are currently tracking the following metrics:

Cluster-level
  • Total storage capacity of cluster.
  • Used capacity of the cluster.
  • Available capacity.
  • Total no. of RADOS objects present.
  • No. of Degraded PGs.
  • No. of Unclean PGs.
  • No. of Undersized PGs.
  • No. of Stale PGs.
  • No. of Degraded RADOS objects.
  • Percentage of Degraded RADOS objects.
  • No. of OSDs in DOWN state.
Pool-level
  • Total storage capacity of each pool.
  • Total no. of RADOS objects in each pool.
  • No. of Read/Write IO attempts.
Monitor-level
  • Total storage capacity of each monitor.
  • Used storage capacity.
  • Available storage capacity.
  • Percentage of available storage capacity.
  • Total size of Monitor backing store (FileStore).
  • Extent of FileStore occupied by SSTables.
  • Extent of FileStore occupied by logs.
  • Extent of FileStore occupied by other misc stuff.
  • Clock skew on the monitors.
  • Latency of each of the monitors.
  • No. of nodes. in Monitor Quorum.
This just scratches the surface of what we can do with a tool like this though. All credit goes to Prometheus developers (you guys!) for not only building an amazing metrics and monitoring platform but also making it extremely easy to build such integrations with it.

It'd be quite helpful to make sure that this tool meets the standards set by this group. We have made sure to use the Exporter guidelines(https://goo.gl/98hIjJ) for the naming and other conventions.

Any feedback would be appreciated.

Brian Brazil

unread,
Jan 5, 2016, 12:56:48 PM1/5/16
to Vaibhav Bhembre, Prometheus Developers
It's always great to see more exporters!

I note this in the readme "Hence, no additional setup is necessary other than having a working ceph cluster." Does this mean that no metrics are exposed if the cluster is broken? Generally you should be running one exporter per daemon so that even when things are failing, you can still get the metrics you need to debug.

Some comments on the metrics:


The _total suffix usually indicates a counter, avoid it with other types such as gauges. objects_total seems inconsistent in not having the 'cluster' in there. You should be using ConstMetric rather than setting metrics in collect, this avoids races.


The _count suffix is part of a Summary or Histogram, avoid it with other types. Generally you can remove suffixes such as _count and _total without any loss of meaning. degraded_objects_percent - if this can be calculated from other metrics, then don't export it. In general avoid percentages, ratios are preferred for consistency (and should be calculated in Prometheus).


pool_read_io_total sounds like a counter rather than a guage. I wouldn't usually put 'io' in a name, as read implies that.


monitor_total_kbs should be converted to bytes both as that's our standard, and to be consistent with the rest of the exporter. For monitor_clock_skew and monitor_latency put the units in the metric name (preferably seconds).

Brian 

Any feedback would be appreciated.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Vaibhav Bhembre

unread,
Jan 5, 2016, 2:51:09 PM1/5/16
to Prometheus Developers, vai...@digitalocean.com
Thanks for your detailed review Brian! :) Appreciate you taking time to go through the source and listing your thoughts. I will try to address those inline.
In the current version, we are tracking and reporting the behavior of a cluster across all states; it primarily includes surfacing the specific issues when things are not working as desired. When a cluster enters a degraded or broken state, there are several indicators that start glowing red. Ceph's own reporting takes care of exposing most of this information. One case where we are currently logging an error (and retrying) in the exporter is when the client connection to the cluster fails through either a network issue or a problem with the monitor quorum. Both these issues are quite significant for us and there is a separate monitoring utility that raises an alarm when such a state is entered.

I like the idea of scraping stats from individual daemons in isolation. In our scenario, we didn't find the metrics surfaced by them to be that useful for debugging various issues we encountered (unlike how it would be with other daemons like mysqld or httpd, etc.). But when the cluster is "down" (as seen externally) it could be beneficial to know about the state of running process across the ensemble. I will look into writing couple more exporters specially for monitor and OSD daemons.

Some comments on the metrics:


The _total suffix usually indicates a counter, avoid it with other types such as gauges. objects_total seems inconsistent in not having the 'cluster' in there. You should be using ConstMetric rather than setting metrics in collect, this avoids races.

I see. Is "cluster_capacity" a fine alternative to "cluster_bytes_total" if it's a gauge? Also, the collection process is currently synchronized by https://github.com/digitalocean/ceph_exporter/blob/master/exporter.go#L68-L69. Do you feel like it could still be susceptible to any concurrency issues?
 


The _count suffix is part of a Summary or Histogram, avoid it with other types. Generally you can remove suffixes such as _count and _total without any loss of meaning. degraded_objects_percent - if this can be calculated from other metrics, then don't export it. In general avoid percentages, ratios are preferred for consistency (and should be calculated in Prometheus).

 I will update the source to remove both the suffixes. Agreed that the percentage doesn't need to be computed inline, but the one that is being surfaced is bubbled up by Ceph itself so planning to keep it for now.


pool_read_io_total sounds like a counter rather than a guage. I wouldn't usually put 'io' in a name, as read implies that.

Good point! Truncating `io`. :) I also didn't realize Counter type had a Set(); always thought they just had Incs(). I will update the type.


monitor_total_kbs should be converted to bytes both as that's our standard, and to be consistent with the rest of the exporter. For monitor_clock_skew and monitor_latency put the units in the metric name (preferably seconds).

I can do the conversion inline if that's the requirement for recording sizes/capacity. For both clock skews and latency, measuring time in seconds is too low a resolution. But I see your point and I will add the resolution in the metric name. :)


Brian 

Any feedback would be appreciated.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


And please let me know (or feel free to open issues/contribute) if there are other changes you would like to see therein.


--

Brian Brazil

unread,
Jan 5, 2016, 3:06:21 PM1/5/16
to Vaibhav Bhembre, Prometheus Developers
That sounds more like blackbox montoring, I wouldn't bundle that in with whitebox monitoring like this.
 

I like the idea of scraping stats from individual daemons in isolation. In our scenario, we didn't find the metrics surfaced by them to be that useful for debugging various issues we encountered (unlike how it would be with other daemons like mysqld or httpd, etc.). But when the cluster is "down" (as seen externally) it could be beneficial to know about the state of running process across the ensemble. I will look into writing couple more exporters specially for monitor and OSD daemons.

Some comments on the metrics:


The _total suffix usually indicates a counter, avoid it with other types such as gauges. objects_total seems inconsistent in not having the 'cluster' in there. You should be using ConstMetric rather than setting metrics in collect, this avoids races.

I see. Is "cluster_capacity" a fine alternative to "cluster_bytes_total" if it's a gauge?

cluster_capacity_bytes or cluster_bytes would both be fine.
 
Also, the collection process is currently synchronized by https://github.com/digitalocean/ceph_exporter/blob/master/exporter.go#L68-L69. Do you feel like it could still be susceptible to any concurrency issues?

That'll prevent races, but you still have the other issues arising from not using ConstMetric such as labels not going away when they should.
 


The _count suffix is part of a Summary or Histogram, avoid it with other types. Generally you can remove suffixes such as _count and _total without any loss of meaning. degraded_objects_percent - if this can be calculated from other metrics, then don't export it. In general avoid percentages, ratios are preferred for consistency (and should be calculated in Prometheus).

 I will update the source to remove both the suffixes. Agreed that the percentage doesn't need to be computed inline, but the one that is being surfaced is bubbled up by Ceph itself so planning to keep it for now.

If there's a percentage that's covered by other metrics, it's best to drop it.
 


pool_read_io_total sounds like a counter rather than a guage. I wouldn't usually put 'io' in a name, as read implies that.

Good point! Truncating `io`. :) I also didn't realize Counter type had a Set(); always thought they just had Incs(). I will update the type.

The presence of Set() in Go's Counter is a API wart in my opinion, and everything that I'm aware of that's currently using it should be using ConstMetric instead.
 


monitor_total_kbs should be converted to bytes both as that's our standard, and to be consistent with the rest of the exporter. For monitor_clock_skew and monitor_latency put the units in the metric name (preferably seconds).

I can do the conversion inline if that's the requirement for recording sizes/capacity. For both clock skews and latency, measuring time in seconds is too low a resolution. But I see your point and I will add the resolution in the metric name. :)

All values in Prometheus are floats, so you shouldn't have a problem using seconds.

Brian
 


Brian 

Any feedback would be appreciated.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
And please let me know (or feel free to open issues/contribute) if there are other changes you would like to see therein.


--

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

Vaibhav Bhembre

unread,
Jan 6, 2016, 12:23:25 PM1/6/16
to Prometheus Developers, vai...@digitalocean.com
The repository is updated with the take-aways we had from the above conversation (other than the Gauge to ConstMetric conversion for a few instances, I will look at those selectively). Thanks for all your help, Brian!


Brian 
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
And please let me know (or feel free to open issues/contribute) if there are other changes you would like to see therein.


--

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

chie...@gmail.com

unread,
Nov 21, 2016, 1:46:01 AM11/21/16
to Prometheus Developers

Please help me. How to install Ceph_exporter?
I has installed prometheus + node_exporter

Ben Kochie

unread,
Nov 21, 2016, 2:11:05 AM11/21/16
to chie...@gmail.com, Prometheus Developers
There are two Ceph exporters you can try.  The documentation for them is included on their homepages.



On Mon, Nov 21, 2016 at 7:46 AM, <chie...@gmail.com> wrote:

Please help me. How to install Ceph_exporter?
I has installed prometheus + node_exporter
--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.

Ben Kochie

unread,
Nov 21, 2016, 2:11:52 AM11/21/16
to chie...@gmail.com, Prometheus Developers
Please take additional questions to the Prometheus users list: https://groups.google.com/forum/#!forum/prometheus-users

On Mon, Nov 21, 2016 at 8:11 AM, Ben Kochie <sup...@gmail.com> wrote:
There are two Ceph exporters you can try.  The documentation for them is included on their homepages.


On Mon, Nov 21, 2016 at 7:46 AM, <chie...@gmail.com> wrote:

Please help me. How to install Ceph_exporter?
I has installed prometheus + node_exporter

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsubscri...@googlegroups.com.

robu...@gmail.com

unread,
Nov 10, 2017, 6:27:14 AM11/10/17
to Prometheus Developers
hi Vaibhav Bhembre

I tried to install your ceph_exporter but failed with:

-----------
user@vm23[/0]:/opt/ceph_exporter/ceph_exporter # go install exporter.go
exporter.go:24:2: cannot find package "github.com/ceph/go-ceph/rados" in any of:
/usr/lib/go-1.7/src/github.com/ceph/go-ceph/rados (from $GOROOT)
($GOPATH not set)
exporter.go:25:2: cannot find package "github.com/digitalocean/ceph_exporter/collectors" in any of:
/usr/lib/go-1.7/src/github.com/digitalocean/ceph_exporter/collectors (from $GOROOT)
($GOPATH not set)
exporter.go:26:2: cannot find package "github.com/prometheus/client_golang/prometheus" in any of:
/usr/lib/go-1.7/src/github.com/prometheus/client_golang/prometheus (from $GOROOT)
($GOPATH not set)
exporter.go:27:2: cannot find package "github.com/prometheus/client_golang/prometheus/promhttp" in any of:
/usr/lib/go-1.7/src/github.com/prometheus/client_golang/prometheus/promhttp (from $GOROOT)
($GOPATH not set)
-----------

What I did was git-cloning https://github.com/digitalocean/ceph_exporter
and than running "go install exporter.go".
Since I'm unfortunately a "no-GO-er" ;-/ this is probably me missing a dependency or something else on how to build the package in go.
Since docker container is not an option here, I would be glad of any hint on how to build the ceph_exporter.

best regards

Conor Broderick

unread,
Nov 10, 2017, 6:31:54 AM11/10/17
to robu...@gmail.com, Prometheus Developers
Your Go environment isn't setup correctly. 

You can find setup instructions here.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.

To post to this group, send email to prometheus-developers@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

robu...@gmail.com

unread,
Nov 10, 2017, 6:45:33 AM11/10/17
to Prometheus Developers
sorry for the fuzz,
figured it out finally.
For the ext poor fellow:

mkdir <somepath>
cd <somepath>
export GOPATH=<somepath>
go get -u github.com/digitalocean/ceph_exporter

that's it.
If it complains about missing rados headers, obviously one have to install them like:

apt-get install librados-dev

than try again

Reply all
Reply to author
Forward
0 new messages