On the Cloud Foundry Runtime team, we're beginning to discuss monitoring for deployments that may not have internet access. I was hoping to get feedback from the community in this decision that will likely affect the architecture of Cloud Foundry.
To briefly go over where we are and how we got here:
I really like Datadog. It's easy to use, the charts look great, and their support has been fantastic. But Datadog just won't work in everyone's circumstances (e.g. non-AWS deployments with no internet access). We're about at the point where it's time to consider options for monitoring that will not require external internet access.
We're thinking that statsd is our most obvious candidate to replace Datadog. It seems to be a buzzworthy tool, and Datadog can optionally use a customized statsd backend (potentially saving us some work).
However, many of our metrics currently rely on tags, which are specific to Datadog. For instance, we would record CPU load average of each component and use tags to specify that a data point was for tag job of "dea" and index of 0. Then we can filter the data by tags, so we could see the average CPU load average for all DEAs, or we could show a dashboard with a chart just for each Cloud Controller. Once I learned that we were accomplishing this through a non-standard statsd backend, I began to wonder if needing tags at all was a statsd antipattern. Surely our use case is very typical for users of statsd.
So it's going to take some non-trivial effort to switch to statsd, and we haven't even touched on frontends yet. I don't think I've talked to anyone on the team who has had significant experience using statsd, so we're not certain that statsd is the best choice.
Whichever tool we use, our major use cases include:
What do you say, community? Should we go forward with statsd?
2 August 2013 15:41
2 August 2013 14:45
We've been working on what we call an "AdminUI". It started out as just a tool to nicely show all of the useful data that the /varz endpoints return. But, as we organized the data it quickly became the main monitoring tool our admins use to keep track of the system. Starting out as a read-only tool, we've started to add more things like the ability to kick off new DEAs and to send out email notifications when things go bad (like components are down). I was hoping to show it at the conference next month and get feedback on whether the community would be interested in seeing it as new sub-project, but if there's interest before that we can look at doing it sooner.
In full disclosure, we started on it back in the v1 days and are working hard to port it to v2.
-Doug
On Thu, Aug 1, 2013 at 12:39 AM, Mark Rushakoff <mrush...@pivotallabs.com> wrote
Whichever tool we use, our major use cases include:
- Record arbitrary time series data
- View charts from the recorded data and from realtime data
- Alert on-call support when the recorded meets certain predefined criteria
What do you say, community? Should we go forward with statsd?
+1 for statsd as the collection mechanism.
To unsubscribe from this group and stop receiving emails from it, send an email to vcap-dev+u...@cloudfoundry.org.
To unsubscribe from this group and stop receiving emails from it, send an email to vcap-dev+u...@cloudfoundry.org.