|Cloud Foundry Runtime Architecture Discussion: statsd?||Mark Rushakoff||7/31/13 11:39 PM|
On the Cloud Foundry Runtime team, we're beginning to discuss monitoring for deployments that may not have internet access. I was hoping to get feedback from the community in this decision that will likely affect the architecture of Cloud Foundry.
To briefly go over where we are and how we got here:
I really like Datadog. It's easy to use, the charts look great, and their support has been fantastic. But Datadog just won't work in everyone's circumstances (e.g. non-AWS deployments with no internet access). We're about at the point where it's time to consider options for monitoring that will not require external internet access.
We're thinking that statsd is our most obvious candidate to replace Datadog. It seems to be a buzzworthy tool, and Datadog can optionally use a customized statsd backend (potentially saving us some work).
However, many of our metrics currently rely on tags, which are specific to Datadog. For instance, we would record CPU load average of each component and use tags to specify that a data point was for tag job of "dea" and index of 0. Then we can filter the data by tags, so we could see the average CPU load average for all DEAs, or we could show a dashboard with a chart just for each Cloud Controller. Once I learned that we were accomplishing this through a non-standard statsd backend, I began to wonder if needing tags at all was a statsd antipattern. Surely our use case is very typical for users of statsd.
So it's going to take some non-trivial effort to switch to statsd, and we haven't even touched on frontends yet. I don't think I've talked to anyone on the team who has had significant experience using statsd, so we're not certain that statsd is the best choice.
Whichever tool we use, our major use cases include:
What do you say, community? Should we go forward with statsd?
|Re: [vcap-dev] Cloud Foundry Runtime Architecture Discussion: statsd?||Mike Youngstrom||8/1/13 9:34 AM|
I'm not very experienced with monitoring tools and metric data collection and such.
However, in lieu of a non internet solution from you guys we created a simple collector historian that just puts that data into a relational database. This seems to work well enough for us at the moment. It allows us to use our enterprise reporting solution to create some charts and dashboards for reporting and such as well as Nagios for component monitoring and alerts.
I'd be happy to clean it up and contribute it if you thought it'd be useful to anyone else.
The schema is rather normalized with the tags and such so queries are somewhat ugly but it seems to work.
I'm interested in see what you come up with for a true non public internet solution.
|Re: Cloud Foundry Runtime Architecture Discussion: statsd?||dug...@gmail.com||8/2/13 6:45 AM|
We've been working on what we call an "AdminUI". It started out as just a tool to nicely show all of the useful data that the /varz endpoints return. But, as we organized the data it quickly became the main monitoring tool our admins use to keep track of the system. Starting out as a read-only tool, we've started to add more things like the ability to kick off new DEAs and to send out email notifications when things go bad (like components are down). I was hoping to show it at the conference next month and get feedback on whether the community would be interested in seeing it as new sub-project, but if there's interest before that we can look at doing it sooner.
In full disclosure, we started on it back in the v1 days and are working hard to port it to v2.
|Re: [vcap-dev] Cloud Foundry Runtime Architecture Discussion: statsd?||David Laing||8/2/13 7:41 AM|
There is definitely interest - https://groups.google.com/a/cloudfoundry.org/forum/#!searchin/vcap-dev/cf-console/vcap-dev/-qaQqPWXlpM/4JdIkJ5WWUAJ
Any chance you could release your AdminUI to github.com/cloudfoundry-community ?--
Open source @ City Index - github.com/cityindex
|Re: [vcap-dev] Cloud Foundry Runtime Architecture Discussion: statsd?||Jamie Van Dyke||8/2/13 8:06 AM|
I hate to be a 'me too', but I'm also building a web dashboard. It covers BOSH and Cloud Foundry. Give me a few weeks and I'll have something you can all get your teeth into.
In simple terms, it's a set of rails workers, api and js front end. In its current state the workers poll the BOSH and CF api's for information, and put it in a redis db. The js front end queries the rails api and it pull the information out of redis. That way there's no blocking requests.
Urgent issues at work have put it on hold for a week, but I'll be back on it soon.
|Re: [vcap-dev] Cloud Foundry Runtime Architecture Discussion: statsd?||David Laing||8/2/13 9:47 AM|
|Re: [vcap-dev] Cloud Foundry Runtime Architecture Discussion: statsd?||Mike Youngstrom||10/17/13 3:23 PM|
Have you considered adding support for vcOPS? Could be a good fit for a number organizations especially those using vmware.
On Thu, Aug 1, 2013 at 12:39 AM, Mark Rushakoff <mrush...@pivotallabs.com> wrote:
|Re: [vcap-dev] Cloud Foundry Runtime Architecture Discussion: statsd?||Wayne E. Seguin||10/17/13 7:18 PM|
StatsD is an excellent tool choice for solving the first use case for collecting metric names. Use bucket name spacing schemes instead of tags.
For the other use cases of charting and alerting a search and discussion of those spaces independently should be done and then tied together.
Wayne E. Seguin
wayneeseguin on irc.freenode.net
|Re: [vcap-dev] Cloud Foundry Runtime Architecture Discussion: statsd?||David Laing||10/18/13 12:17 AM|
|Re: Cloud Foundry Runtime Architecture Discussion: statsd?||Simon Johansson||3/13/14 5:34 AM|
Sorry to bring up an old topic like this.
But we already have graphite/statsd infrastructure and would love to use it to collect stats for debugging and alerting.
Please let me know if this is not on the road map, in that case I will probably pick it up since we need it. :)
|Re: [vcap-dev] Re: Cloud Foundry Runtime Architecture Discussion: statsd?||David Lee||3/13/14 7:39 AM|
On the Pivotal CF side of things we are just about to release our Ops Metrics add-on. This tool takes all the varz and BOSH health data and exposes them back out via JMX.
We've considered releasing this on the OSS side but:
1. We aren't sure if the community prefers some other protocol (such as statsd) and wouldn't find JMX that useful.
2. We wanted to make more changes to the upstream (varz), which might make a better integration point.
(Also, we do have plans to rework the generated metrics to make it easier to interpret.)
Do you have any other constraints on your side? For example, if JMX were to become available soon, would using another project (say jmxtrans) to get the data over to statsd be okay? Do you have any debugging and alerting processes that would not be well solved by this?
To unsubscribe from this group and stop receiving emails from it, send an email to vcap-dev+u...@cloudfoundry.org.
|Re: [vcap-dev] Re: Cloud Foundry Runtime Architecture Discussion: statsd?||Simon Johansson||3/13/14 8:36 AM|
Well, JMX exporting is better than no exporting :) Also the more open source stuff there is in the ecosystem the more traction CF will get. I say release it!.
Ive been bitten by jmxtrans before where it basically ate up all my FDs because I had l lots and lots of checks.
Im sure there is other JMX -> Statsd bridges out there.
Also I've poked around a bit in the collector code base and it seems like it would be trivial to add Statsd support.
|Re: [vcap-dev] Re: Cloud Foundry Runtime Architecture Discussion: statsd?||Simon Johansson||3/17/14 3:55 AM|
I just sent a pull request for adding graphite as a historian for collector.
I chose graphite directly over statsd since everything will be under a index key so bucketing is not necessary.
|Re: [vcap-dev] Re: Cloud Foundry Runtime Architecture Discussion: statsd?||James Bayer||3/17/14 8:05 AM|
thanks for the PR simon! at pivotal, we use a customized hosted version of graphite for run.pivotal.io, but we use the DataDogHQ plugin which is specific to their SaaS.
david lee is the PM for metrics and logging and he and a small team are working on overhauling the system and app metrics architecture to move from the varz/collector approach to something that looks conceptually like loggregator. instead of having separate collection and transport mechanisms and systems for system logs, app logs, system metrics and app metrics we want to have a unified multi-tenant system whereby the system logs/metrics are just another tenant (albeit with special configs perhaps). the team is about ready to share some material with the community for feedback and review including write-up and diagrams, etc.
the reason why that is important is that there are tradeoffs for spending time in the collector code base vs delivering on this unified approach. i'll ask david and his team to review this submission and hopefully it's something we can accept easily even if we call it an experimental community-contributed feature.
|Re: [vcap-dev] Re: Cloud Foundry Runtime Architecture Discussion: statsd?||Simon Johansson||3/17/14 8:46 AM|
> instead of having separate collection and transport mechanisms and systems for system logs, app logs, system metrics and app metrics we want to have a unified multi-tenant system whereby the system logs/metrics are just another tenant.
Sounds really interesting. Looking forward to it :) Do you have any ETA?
> the reason why that is important is that there are tradeoffs for spending time in the collector code base vs delivering on this unified approach. i'll ask david and his team to review this submission and hopefully it's something we can accept easily even if we call it an experimental community contributed feature.
As luck would have it, my code is 100% flawless, so a merge should be no problem ;).
But in the highly unlikely event that my PR would be deemed unfit, no hard feelings, if it allows you guys to deliver shiny new cool stuff quicker!