Cloud Foundry Runtime Architecture Discussion: statsd?

393 views
Skip to first unread message

Mark Rushakoff

unread,
Aug 1, 2013, 2:39:06 AM8/1/13
to vcap...@cloudfoundry.org

On the Cloud Foundry Runtime team, we're beginning to discuss monitoring for deployments that may not have internet access. I was hoping to get feedback from the community in this decision that will likely affect the architecture of Cloud Foundry.

To briefly go over where we are and how we got here:

  • Prior to AWS, we were using OpenTSDB
  • We decided to use CloudWatch for Amazon AWS deployments of Cloud Foundry
  • We decided to use Datadog instead of CloudWatch
    • IIRC the reasons we used Datadog included but were not limited to:
      • Higher resolution when viewing charts
      • More charting functions
      • Out-of-the box integration with PagerDuty
      • Costs money now, but lets us focus on higher priorities until appropriate to revisit

I really like Datadog. It's easy to use, the charts look great, and their support has been fantastic. But Datadog just won't work in everyone's circumstances (e.g. non-AWS deployments with no internet access). We're about at the point where it's time to consider options for monitoring that will not require external internet access.

We're thinking that statsd is our most obvious candidate to replace Datadog. It seems to be a buzzworthy tool, and Datadog can optionally use a customized statsd backend (potentially saving us some work).

However, many of our metrics currently rely on tags, which are specific to Datadog. For instance, we would record CPU load average of each component and use tags to specify that a data point was for tag job of "dea" and index of 0. Then we can filter the data by tags, so we could see the average CPU load average for all DEAs, or we could show a dashboard with a chart just for each Cloud Controller. Once I learned that we were accomplishing this through a non-standard statsd backend, I began to wonder if needing tags at all was a statsd antipattern. Surely our use case is very typical for users of statsd.

So it's going to take some non-trivial effort to switch to statsd, and we haven't even touched on frontends yet. I don't think I've talked to anyone on the team who has had significant experience using statsd, so we're not certain that statsd is the best choice.

Whichever tool we use, our major use cases include:

  • Record arbitrary time series data
  • View charts from the recorded data and from realtime data
  • Alert on-call support when the recorded meets certain predefined criteria

What do you say, community? Should we go forward with statsd?

Mike Youngstrom

unread,
Aug 1, 2013, 12:34:41 PM8/1/13
to vcap...@cloudfoundry.org
I'm not very experienced with monitoring tools and metric data collection and such.

However, in lieu of a non internet solution from you guys we created a simple collector historian that just puts that data into a relational database.  This seems to work well enough for us at the moment.  It allows us to use our enterprise reporting solution to create some charts and dashboards for reporting and such as well as Nagios for component monitoring and alerts.

I'd be happy to clean it up and contribute it if you thought it'd be useful to anyone else.

The schema is rather normalized with the tags and such so queries are somewhat ugly but it seems to work.

I'm interested in see what you come up with for a true non public internet solution.

Mike

dug...@gmail.com

unread,
Aug 2, 2013, 9:45:17 AM8/2/13
to vcap...@cloudfoundry.org
We've been working on what we call an "AdminUI".  It started out as just a tool to nicely show all of the useful data that the /varz endpoints return.  But, as we organized the data it quickly became the main monitoring tool our admins use to keep track of the system.  Starting out as a read-only tool, we've started to add more things like the ability to kick off new DEAs and to send out email notifications when things go bad (like components are down).  I was hoping to show it at the conference next month and get feedback on whether the community would be interested in seeing it as new sub-project, but if there's interest before that we can look at doing it sooner.

In full disclosure, we started on it back in the v1 days and are working hard to port it to v2.

-Doug

David Laing

unread,
Aug 2, 2013, 10:41:09 AM8/2/13
to vcap...@cloudfoundry.org
There is definitely interest - https://groups.google.com/a/cloudfoundry.org/forum/#!searchin/vcap-dev/cf-console/vcap-dev/-qaQqPWXlpM/4JdIkJ5WWUAJ

Any chance you could release your AdminUI to github.com/cloudfoundry-community ?
--
David Laing
Open source @ City Index - github.com/cityindex
http://davidlaing.com
Twitter: @davidlaing

Jamie Van Dyke

unread,
Aug 2, 2013, 11:06:08 AM8/2/13
to vcap...@cloudfoundry.org
I hate to be a 'me too', but I'm also building a web dashboard. It covers BOSH and Cloud Foundry. Give me a few weeks and I'll have something you can all get your teeth into.

In simple terms, it's a set of rails workers, api and js front end. In its current state the workers poll the BOSH and CF api's for information, and put it in a redis db. The js front end queries the rails api and it pull the information out of redis. That way there's no blocking requests.

Urgent issues at work have put it on hold for a week, but I'll be back on it soon.

2 August 2013 15:41
2 August 2013 14:45
We've been working on what we call an "AdminUI".  It started out as just a tool to nicely show all of the useful data that the /varz endpoints return.  But, as we organized the data it quickly became the main monitoring tool our admins use to keep track of the system.  Starting out as a read-only tool, we've started to add more things like the ability to kick off new DEAs and to send out email notifications when things go bad (like components are down).  I was hoping to show it at the conference next month and get feedback on whether the community would be interested in seeing it as new sub-project, but if there's interest before that we can look at doing it sooner.

In full disclosure, we started on it back in the v1 days and are working hard to port it to v2.

-Doug

--
Jamie van Dyke: Chief Science Officer at PharmMD Inc.
phone: 615-713-2020
web: www.pharmmd.com
twitter: @fearoffish

David Laing

unread,
Aug 2, 2013, 12:47:47 PM8/2/13
to vcap-dev

Yay!

postbox-contact.jpg
pmdlogo.jpeg
compose-unknown-contact.jpg

Mike Youngstrom

unread,
Oct 17, 2013, 6:23:31 PM10/17/13
to vcap...@cloudfoundry.org
Have you considered adding support for vcOPS?  Could be a good fit for a number organizations especially those using vmware.


Mike


On Thu, Aug 1, 2013 at 12:39 AM, Mark Rushakoff <mrush...@pivotallabs.com> wrote:

Wayne E. Seguin

unread,
Oct 17, 2013, 10:18:58 PM10/17/13
to vcap...@cloudfoundry.org
On Thu, Aug 1, 2013 at 12:39 AM, Mark Rushakoff <mrush...@pivotallabs.com> wrote

Whichever tool we use, our major use cases include:

  • Record arbitrary time series data
  • View charts from the recorded data and from realtime data
  • Alert on-call support when the recorded meets certain predefined criteria

What do you say, community? Should we go forward with statsd?


 StatsD is an excellent tool choice for solving the first use case for collecting metric names. Use bucket name spacing schemes instead of tags.

For the other use cases of charting and alerting a search and discussion of those spaces independently should be done and then tied together.

  ~Wayne


--
  ~Wayne

Wayne E. Seguin
wayneeseguin on irc.freenode.net

David Laing

unread,
Oct 18, 2013, 3:17:16 AM10/18/13
to vcap-dev

+1 for statsd as the collection mechanism.

To unsubscribe from this group and stop receiving emails from it, send an email to vcap-dev+u...@cloudfoundry.org.

si...@simonjohansson.com

unread,
Mar 13, 2014, 8:34:44 AM3/13/14
to vcap...@cloudfoundry.org
Sorry to bring up an old topic like this.

But we already have graphite/statsd infrastructure and would love to use it to collect stats for debugging and alerting.

Please let me know if this is not on the road map, in that case I will probably pick it up since we need it. :)

David Lee

unread,
Mar 13, 2014, 10:39:19 AM3/13/14
to vcap...@cloudfoundry.org
Hi Simon,

On the Pivotal CF side of things we are just about to release our Ops Metrics add-on. This tool takes all the varz and BOSH health data and exposes them back out via JMX.

We've considered releasing this on the OSS side but:
1. We aren't sure if the community prefers some other protocol (such as statsd) and wouldn't find JMX that useful.
2. We wanted to make more changes to the upstream (varz), which might make a better integration point.

(Also, we do have plans to rework the generated metrics to make it easier to interpret.)

Do you have any other constraints on your side? For example, if JMX were to become available soon, would using another project (say jmxtrans) to get the data over to statsd be okay? Do you have any debugging and alerting processes that would not be well solved by this?

Thanks,

-Dave




To unsubscribe from this group and stop receiving emails from it, send an email to vcap-dev+u...@cloudfoundry.org.

Simon Johansson

unread,
Mar 13, 2014, 11:36:54 AM3/13/14
to vcap...@cloudfoundry.org
Well, JMX exporting is better than no exporting :) Also the more open source stuff there is in the ecosystem the more traction CF will get. I say release it!.

Ive been bitten by jmxtrans before where it basically ate up all my FDs because I had l lots and lots of checks.
Im sure there is other JMX -> Statsd bridges out there.

Also I've poked around a bit in the collector code base and it seems like it would be trivial to add Statsd support.


Simon Johansson

unread,
Mar 17, 2014, 6:55:39 AM3/17/14
to vcap...@cloudfoundry.org
I just sent a pull request for adding graphite as a historian for collector.
I chose graphite directly over statsd since everything will be under a index key so bucketing is not necessary.

James Bayer

unread,
Mar 17, 2014, 11:05:46 AM3/17/14
to vcap...@cloudfoundry.org
thanks for the PR simon! at pivotal, we use a customized hosted version of graphite for run.pivotal.io, but we use the DataDogHQ plugin which is specific to their SaaS.

david lee is the PM for metrics and logging and he and a small team are working on overhauling the system and app metrics architecture to move from the varz/collector approach to something that looks conceptually like loggregator. instead of having separate collection and transport mechanisms and systems for system logs, app logs, system metrics and app metrics we want to have a unified multi-tenant system whereby the system logs/metrics are just another tenant (albeit with special configs perhaps). the team is about ready to share some material with the community for feedback and review including write-up and diagrams, etc.

the reason why that is important is that there are tradeoffs for spending time in the collector code base vs delivering on this unified approach. i'll ask david and his team to review this submission and hopefully it's something we can accept easily even if we call it an experimental community-contributed feature.
--
Thank you,

James Bayer

Simon Johansson

unread,
Mar 17, 2014, 11:46:49 AM3/17/14
to vcap...@cloudfoundry.org
 instead of having separate collection and transport mechanisms and systems for system logs, app logs, system metrics and app metrics we want to have a unified multi-tenant system whereby the system logs/metrics are just another tenant.

Sounds really interesting. Looking forward to it :) Do you have any ETA?

the reason why that is important is that there are tradeoffs for spending time in the collector code base vs delivering on this unified approach. i'll ask david and his team to review this submission and hopefully it's something we can accept easily even if we call it an experimental community contributed feature.

As luck would have it, my code is 100% flawless, so a merge should be no problem ;).

But in the highly unlikely event that my PR would be deemed unfit, no hard feelings, if it allows you guys to deliver shiny new cool stuff quicker! 

Reply all
Reply to author
Forward
0 new messages