Whether or not you'd need other tools really depends on what you want out of it.
Have a chat with Splunk sales - they can provide you with technical contacts - we got them in for a full PoC before even deciding whether Splunk was the best choice (it's a great product, but it's pretty expensive!)
Other solutions are available too - I think logstash is a free equivalent for certain use cases - I'd imagine Splunk's UI is likely superior.
Will
The always crafty @portertech has been developing 'tails' recently which is also worth a look for realtime access to server logs. There's a demo @ http://portertech.no.de/
Cheers,
Anthony
http://crankstations.com
@anthonygoddard
Vladimir
On Sun, 14 Aug 2011, Anthony Goddard wrote:
> Graylog2 is worth a mention here too, we use it for exactly this purpose - giving devs fast access to the log data they need. http://graylog2.org
> I've also used it to trace and identify problems across multiple hosts simultaneously, which is pretty neat. Logs are metrics too� ftw.
> anyone have experience with using splunk in a dev ops environment? I'm
> looking for some feedback. We are looking at it as a tool that would
> give developers access to both historical as well as "real time" (i.e.
> App or web server startup) logs without granting them direct access to
> the servers. I'm also interested in it's log correlation features that
> would help ops to to problem determination across web, app and db
> severs simultaneously.
We use splunk extensively here at NI. We love it. Up front, I will say it's expensive as hell and is a significant chunk of our systems management budget when stacked up against open source and SaaS stuff. However, we've decided it's totally worth it. Why?
Because there's a difference between "ops tools" and "devops tools." We talk about a lot of tools that pretty much only a sysadmin could love (Nagios, I'm looking at you). There is indeed an interesting explosion of log aggregation and management options especially given the NoSQL space. But I judge DevOps tools based on how much developers enjoy using them. The entire point is to enable collaboration via exposing information to them and empowering them to self-service.
This is best illustrated via a timely anecdote. We had a Splunk implementation for our internal Web systems which I used to manage. When we started up a new group in R&D for creation of SaaS products, that team started from scratch tooling-wise. We implemented other more critical stuff first (monitoring, for example), but log management was in the second tier and we looked at the field and re-selected splunk. Our ops team started implementing it (setting up forwarders across UNIX and Windows Amazon instances and Microsoft Azure as well).
During this process I got an email from our FPGA Compile Cloud developers, saying "Hey can we get the ssh keys for the boxes, or can you write us a script to get us the logs or something..." I pointed them at the dev splunk server login. Suddenly I get this IM:
| ||
|
|
|
|
|
|
|
|
|
|
|
|
◄ |
|
|
◄ |
|
|
|
|
|
|
|
|
|
|
|
Splunk can do the job but depending on volume, it's going to cost you.
You can get around it with various tricks in aggregation but only to a
degree (and I would argue at the expense of using it to its fullest).
There are some opensource alternatives depending on the functionality
of splunk you're after:
- Graylog2 has a great web interface and supports syslog and GELF
(Graylog Extended Log Format)
- Logstash has agent and centralized logging modes.
We're using a combination of the two. Graylog2 is hampered by its
usage of MongoDB (imho) and thus we can only really use it for
near-time log data - about the last 4-5 hours. We use a combination of
logstash agents (with gelf output) and the GELF log4j appender to ship
the logs over to it.
I'm looking at a long term strategy using the logstash server
implementation. In server mode, a centralized logstash instance
accepts logs and shovels them into Elastic Search. It provides a basic
web application for searching the logs. The nice part is that
ElasticSearch is "infinitely scalable" (I know, I know).
The nice thing about logstash as an agent is you can easily multiplex
log destinations so we can continue to ship stuff to Graylog2 using
the GELF output but also ship it to the logstash server instance for
long-term archival.
The only "downside" to logstash is that it only runs under JRuby. That
might turn you off but the upshot is that there is a single jarfile
that you can download and run with an embedded ES instance. The agent
mode is pretty much awesome in a bag because it can input, filter, and
output in so many different ways.
I personally always default to an opensource project of some kind if
it exists until I know what I need. Logstash is a pretty safe way to
do that.
--
John E. Vincent
http://about.me/lusis
What's the pain point on Mongo?
--
Nathaniel Eliot
T9 Productions
--
Opscode, Inc.
Adam Jacob, Chief Product Officer
T: (206) 619-7151 E: ad...@opscode.com
We have a m1.xlarge for mongo and our capped collection is 10GB.
That's the most we could fit (with the additional indexes needed for
our search patterns) to be useful. I couldn't justify bumping up
another instance size for this.
devops-t...@googlegroups.com wrote on 08/15/2011 01:36:37 PM:
> +1 for Splunk as well. We use it to monitor our VMware virtual
> infrastructure and several home grown apps. Expensive compared to
> opensource Hell ya but it has a huge ROI. You can spend the same
> amount of money on cobbling together your own logging environment.
> Somethings are just worth paying for.
I totally agree - I think the hidden costs of open source are often not sufficiently calculated.
At the most fundamental level, if I have to spend an extra man-month cobbling stuff together, that's a lot of money right there and can easily justify a five figure buy. But even that is low - that's thinking from the old world of "IT/ops is just a cost center, they would just be screwing around if they weren't implementing some open source stuff." In the new world we are helping drive innovation and product features and there's an opportunity cost to our time - that man-month is a man-month I'm not working on getting the next greatest thing to market. Companies pay people's salaries because they expect to make 10x+ that amount on their backs in revenue... Time is money.
A lot of the time, there's not something good enough to buy. Before Splunk, the best efforts were the LogLogics of the world which I regard as "write only" log stores useful for compliance and audit but not for ever looking at the logs for troubleshooting etc. I don't buy in every niche, I FOSS it up. But I don't mind doing it in this case.
devops-t...@googlegroups.com wrote on 08/15/2011 12:21:08 PM:
> Can you elaborate on what other tools you tried, and what makes
> Splunk better? I'm admittedly coming at this from an open-source
> perspective; I'd like to know where the best-of-breed FOSS options
> still fall down, in the hopes of bridging that gap.
Sure. With Splunk, it automatically pulls in and understands many different log types by default; it makes them searchable in a Google-like interface. It automatically does field extraction and generates faceted navigation even for unfamiliar types; you can drill down/exclude data with a click on the results or on the timeline. The cool thing here is that you don't have to be sending syslog or anything, it understands arbitrary logs in native formats.
It also has built-in apps for UNIX and Windows metrics,
It has a rich forwarder/indexer/etc. architecture that you can scale up as much as you want. We have light forwarders on each node in Amazon and Azure; these push to an intermediate caching forwarder specific to the environment (our test env for UI Builder, for example) and that forwards to our central server.
People can easily create saved reports, alerts, and dashboards; this isn't programming, it's simple business user accessible configuration. I'll be honest, I consider most open source graphing to be a bit of a joke. "Two lines on one graph? Inconceivable!"
When it comes down to it, it's that
1. It requires very little configuration to do the job
2. The UI and usability are extreme
I'm sure there's good FOSS out there too, but I've never seen anything that comes close on those two fronts.
E
Scott M
Some people, when confronted with a problem, think “I know, I'll use cacti.” Now they have two problems.
> Some people, when confronted with a problem, think “I know, I'll
> use cacti.” Now they have two problems.
LOL, yeah, I have yet to be shown a RRD/cacti graph that doesn't make me want to hit someone in the face. God forbid that real engineering data visualization was stuck in that era.
Ernest
So I assume, since you are doing REAL engineering, you don't do that x86 scatter computing crap right ;-) Strictly s390 based z196 mainframes with Unified Resource Manager right ;-) Then you can get some real purdy graphs.
Scott M
Regarding alternatives to Splunk I have implemented logstash however have
not rolled it out to all my nodes but intend to. I particularly like
the different outputs that it supports. I'm actually looking into
replacing ganglia-logtailer (predecessor to logster) with logstash statsd
output plugin.
https://gist.github.com/1124364
Vladimir
Whil not open source (anymore), I want to point out that LogZilla has a pretty slick web interface and it scales as well as Splunk. Also LZ is way easier to use and WAY less epensive.
Clayton Dukes has a whitepaper on Cisco.com:
http://www.cisco.com/en/US/technologies/collateral/tk869/tk769/white_paper_c11-557812.html
Best Regards
Scott M
On this topic, what graphing/visualisation packages is everyone using these days?
I would love for Cacti to have an API to reconfigure it, automating its configuration would solve a bunch of problems for us. Similar comments have otherwise been made by colleagues about the look and feel of the graphs produced by RRDTool.
Thanks,
Mark.
Graphite, Reconnoiter, Ganglia, Cacti, in-house stuff.
--
Jason Dixon
DixonGroup Consulting
http://www.dixongroup.net/
More on statsd:
http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/
Nicholas
I suspect you already know about the cli in Cacti? Here is a really short tutorial for one of the graph projects I maintain.
http://crunchtools.com/software/crunchtools/cacti/graph-mysql-stats/#ScriptedAutomated
It is annoying that Cacti can't be re-configured easily, but the recursive trees of the templates is awesome for certain types of data aquisition such as enumerating all BGP connections or domain names on a server.
As for "nice" looking, when did RRD fall so out of vogue, maybe I am just becoming an old man. Some one please attach a screen shot or two pf a library that replaces rrd that is so much better, please.
Scott M
Check out the video at 3.33 over here:
(disclaimer: I wrote Visage)
I'm happy to use a tool that produces butt-ugly graphs, but RRDtool
generates graphs that obscure and obfuscate valuable data.
RRDtool takes a similar approach to downsampling the data on graphs as
it does to downsampling the data it stores over time, and that's
tremendously dangerous when you need to understand exactly how a
system was functioning at a particular point in time.
I wrote Visage specifically to deal with this problem. Each data point
should be inspectable, and composition of graphs should be dynamic and
user driven.
On the data storage front, I'm starting to use OpenTSDB, and am in the
process of writing a Visage backend for it as well. It natively uses
gnuplot to render graphs, and suffers from the same problem as
RRDtool.
Cheers,
Lindsay
--
w: http://holmwood.id.au/~lindsay/
t: @auxesis
I'm as bitchy about RRD precision as the next guy but it's fair to
admit that you CAN get decent precision storage. IIRC 3 years of 1
minute precision of a single metric is something like 10MB per RRD?
The problem is that everything up until now has operated on 5 minute
increments. I think that's the default step for rrdcreate.
I think there are two reasons for this:
- The Nagios/Cacti/Cricket/et. al. kind of tools suck(ed) at being
asked to poll any more often.
- "Things" being polled could be negatively impacted by polling often than that
Possibly starting with netflow and now statsd/graphite types of apps,
push models are becoming much more prevalent. Mind you there are still
OTHER problems with RRDs but getting the precision you want is
possible without the smoothing.
Since softwares such as cacti store an average value in the rrd, it is
hard to compare values between two time period: it is difficult for
instance to compare values from a 1 min average rrd with values from
30 mn or 2 hours average rrds, because averages are "squashed" on the
graphs with the bigger time increments.
One of my ex-colleague mitigated this problem by creating a RRA to
store the highest value as well and not just the average. This allowed
you for instance to compare max values for bandwidth usage between now
and 12 month ago
Cheers,
Gildas
http://vuksan.com/blog/2010/12/14/misconceptions-about-rrd-storage/
As John has pointed out there is no need for averaging.
Vladimir
> On this topic, what graphing/visualisation packages is everyone using these days?
I wrote a time series storage engine because I despise RRD so much:
https://github.com/dsully/circulardb
(The original version actually was written right around the same time as RRD
was released in 1999).
It doesn't include any graphing, but I can recommend client side graphing
using the Flot JavaScript library.
http://code.google.com/p/flot/
--dan
--------------------------------------------------------------
<dsully> please describe web 2.0 to me in 2 sentences or less.
<jwb> you make all the content. they keep all the revenue.
It does. Look in the api/ folder on recent (0.8.7*) versions. It's a
little bit rough in places (not exactly idempotent), but it works OK
for bulk adding of hosts and graphs. Also check out the Autom8 plugin
for adding graphs automatically as configuration changes (e.g. as
interfaces are used).
Howie
Yes, the api is how I do build testing for mysql_stats. I it automatically adds the test host and graphs. Also, use it for new server builds.
We're also going the statsd+graphite route with a lot of success. For
a graphing dashboard system, we wrote "pencil":
https://github.com/fetep/pencil (uses graphite to render).
It lets you define graphs, have dashboards, global/cluster/host view,
navigation (drill in/out), etc.
--
petef
-Noah
Noah Campbell
415-513-3545
noahca...@gmail.com
Click on Enlarge next to the metric graphs. There is also a version of the
UI where you all graphs are rendered using flot however that needs
polishing. That looks something like this
Actually for all you Ganglia users if you are using Ganglia Web 2.0+ you
can set this override and you will see it :-)
$conf['graph_engine'] = "flot";
As I said it needs work.
Vladimir
Ok, screenshots coming this afternoon. :) It's not super pretty, but
very functional (for our needs, at least).
--
petef
-Noah
Noah Campbell
415-513-3545
noahca...@gmail.com
--
Opscode, Inc.
Adam Jacob, Chief Product Officer
T: (206) 619-7151 E: ad...@opscode.com
> noahca...@gmail.com (mailto:noahca...@gmail.com)
pencil screenshots: http://fetep.github.com/pencil/
--
petef
For your information :
> On this topic, what graphing/visualisation packages is everyone using these days?
http://sourceforge.net/projects/gbrrdgraphix/
"gbRRDGraphix is a graphical user interface built in Gambas language
to use simply RRDTool commands and 'flow-tools' Netflow utilities.
Added to the project, a Scheduler to update RRDTool database, a
complet Web Site to display all RRDtool graphics"
http://observium.org/wiki/Main_Page
"Observium is an autodiscovering PHP/MySQL/SNMP based network
monitoring which includes support for a wide range of network hardware
and operating systems including Cisco, Linux, FreeBSD, Juniper,
Foundry, HP and many more.
Observium has grown out of a lack of easy to configure and easy use
NMSes. It is intended to provide a more navigable interface to the
health and performance of your network. Its design goals include
collecting as much historical data about devices as possible, being
completely autodiscovered with little or no manual intervention, and
having a very intuitive interface.
Observium is not intended to replace a Nagios-type up/down monitoring
system, but rather to complement it with an easy to manage, intuitive
representation of historical and current performance statistics,
configuration visualisation and syslog capture."
Best Regards,
Guillaume FORTAINE
So I assume, since you are doing REAL engineering, you don't do that x86 scatter computing crap right ;-) Strictly s390 based z196 mainframes with Unified Resource Manager right ;-) Then you can get some real purdy graphs.
Scott M
On Aug 15, 2011 5:38 PM, "Ernest Mueller" <ernest....@ni.com> wrote: