Monitoring td-agent with Zabbix

566 views
Skip to first unread message

Kellan Strong

unread,
May 28, 2014, 1:28:19 PM5/28/14
to flu...@googlegroups.com
Hello All,

I am using the fluent version of ELK(EFK? FEK? whatever) with elasticsearch and kibana.

Recently I have come to an issue when td-agent goes into a defunct state on kibana but when doing a tail -f /var/log/td-agent/td-agent.log there is no indication of a failure. Catting the log shows nothing as well. However when I browse Kibana there is a stoppage of input. Restarting the agent fixes it but I'd really like some insight on what might have caused it and if anyone is monitoring td-agent in zabbix.

I am currently monitoring if td-agent is running in zabbix and it will alert for that but I want a trigger for if it goes into a state where its no longer inputing data in elasticsearch/kibana.

Thank you,

Kiyoto Tamura

unread,
May 28, 2014, 2:35:59 PM5/28/14
to flu...@googlegroups.com
Hi Kellan-

If you have not already done so, turn on verbose logging to see if you can get any additional log messages: http://docs.fluentd.org/articles/trouble-shooting#turn-on-verbose-logging That will help us identify the problem.

Thanks for reporting this problem.

Kiyoto


--
You received this message because you are subscribed to the Google Groups "Fluentd Google Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fluentd+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Check out Fluentd, the open source data collector for high-volume data streams

Masahiro Nakagawa

unread,
May 28, 2014, 4:24:10 PM5/28/14
to flu...@googlegroups.com
Hi,

> I am using the fluent version of ELK(EFK? FEK? whatever) with elasticsearch and kibana. 

Thanks! One user said EFK.

/var/log/td-agent/td-agent.log there is no indication of a failure. Catting the log shows nothing as well

This is strange. If the problem occurred excluding deadlock, logs should have error or exception messages... 

I want a trigger for if it goes into a state where its no longer inputing data in elasticsearch/kibana.

Hmm... I don't know the details of zabbix but I have one idea.
Fluentd has in_monitor_agent plugin which provides the API to get buffer status


So you can check buffer status for monitoring.
If buffer length keeps 0 within several intervals, monitoring script considers such state as an error.

BTW, If you want to send Fluentd's error message to another system, please check this article:

YAMANO Yuji

unread,
May 28, 2014, 10:30:17 PM5/28/14
to flu...@googlegroups.com, vaid....@gmail.com
On Wed, 28 May 2014 10:28:19 -0700 (PDT), Kellan Strong <vaid....@gmail.com> wrote:

> Recently I have come to an issue when td-agent goes into a defunct state on
> kibana but when doing a tail -f /var/log/td-agent/td-agent.log there is no
> indication of a failure. Catting the log shows nothing as well. However
> when I browse Kibana there is a stoppage of input. Restarting the agent
> fixes it but I'd really like some insight on what might have caused it and
> if anyone is monitoring td-agent in zabbix.

Thread dump might help you to figure out what't happening.
See https://github.com/frsyuki/sigdump for more details.

--
YAMANO Yuji
OGIS-RI

YAMANO Yuji

unread,
May 28, 2014, 11:02:12 PM5/28/14
to flu...@googlegroups.com
I'm wondering how you guys monitor td-agent.

Monitoring both status from the monitor agent and td-agent.log?
What's kind of tools you use? Treasure Agent Monitoring Service?

On Thu, 29 May 2014 05:24:09 +0900, Masahiro Nakagawa <repea...@gmail.com> wrote:

> Hmm... I don't know the details of zabbix but I have one idea.
> Fluentd has in_monitor_agent plugin which provides the API to get buffer
> status
>
> http://docs.fluentd.org/articles/monitoring#monitoring-agent
>
> So you can check buffer status for monitoring.
> If buffer length keeps 0 within several intervals, monitoring script
> considers such state as an error.
>
> BTW, If you want to send Fluentd's error message to another system, please
> check this article:
> http://docs.fluentd.org/articles/logging#capture-fluentd-logs

--
YAMANO Yuji
OGIS-RI

Kellan Strong

unread,
May 29, 2014, 11:51:56 AM5/29/14
to flu...@googlegroups.com
Thank you for your feedback sorry for low info message I was being rushed for other issues. I think the problem might be due to queue limit. Digging deeper I found this:

2014-05-28 18:00:57 -0700 [warn]: temporarily failed to flush the buffer. next_retry=2014-05-28 17:59:54 -0700 error_class="Errno::ETIMEDOUT" error="Connection timed out - connect(2)" instance=70185829420620

Followed by a bunch of ruby warns.

Here is my config.

<source>
  type syslog
  port 5140
  tag apache
  format /[^ ]* {1,2}[^ ]* [^ ]* (?<host>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*)(?:\[(?<pid>[0-9]+)\])?[^\:]*\: (?<client_ip>[^ ]*) [^ ]* (?<user>[^ ]*) \[(?<time>[^\]]*)\] (?<code>[^ ]*) (?<size>[^ ]*) (?<method>\S+) (?<url>[^ ]*)?/
  time_format %d/%b/%Y:%H:%M:%S %z
</source>


# log files in syslog format
<source>
  type tail
  path /var/local/fluent/auth.log,/var/local/fluent/cron.log,/var/local/fluent/daemon.log,/var/local/fluent/lpr.log,/var/local/fluent/kern.log,/var/local/fluent/mail.log,/var/local/fluent/netdevices.log,/var/local/fluent/sudo.log,/var/local/fluent/user.log,/var/local/fluent/syslog
  pos_file /var/log/td-agent/tail-syslog.pos
  tag system.local
  format syslog
</source>

<source>
  type tail
  path /var/log/auth.log,/var/log/cron.log,/var/log/daemon.log,/var/log/lpr.log,/var/log/kern.log,/var/logs/mail.log,/var/log/netdevices.log,/var/log/sudo.log,/var/log/user.log,/var/log/syslog
  pos_file /var/log/td-agent/local-syslog.pos
  tag system.local
  format syslog
</source>


# log files with no date
<source>
  type tail
  path /var/local/fluent/bootstrap.log,/var/local/fluent/fontconfig.log
  pos_file /var/log/td-agent/tail-none.pos
  tag system.local
  format none
</source>


# alternatives.log custom format
<source>
  type tail
  path /var/local/fluent/alternatives.log
  pos_file /var/log/td-agent/tail-alternatives.pos
  tag system.local
  format /^(?<time>[^ ]*) (?<ident>[a-zA-Z0-9_\/\.\-]*): (?<message>.*)$/
  time_format %Y-%m-%d %H:%M:%S
</source>

# dpkg.log custom log format
<source>
  type tail
  path /var/local/fluent/dpkg.log
  pos_file /var/log/td-agent/tail-dpkg.pos
  tag system.local
  format /^(?<time>[^ ]*) (?<message>.*)$/
  time_format %Y-%m-%d %H:%M:%S
</source>


# geo lookup the apache logs
<match apache.**>
  type geoip
  geoip_lookup_key client_ip

  # Set adding field with placeholder (more than one settings are required.)
  <record>
    city            ${city['host']}
    latitude        ${latitude['host']}
    longitude       ${longitude['host']}
    country_code3   ${country_code3['host']}
    country         ${country_code['host']}
    country_name    ${country_name['host']}
    dma             ${dma_code['host']}
    area            ${area_code['host']}
    region          ${region['host']}
  </record>

  # Settings for tag
  remove_tag_prefix apache.
  tag geoip.${tag}
</match>


# send everything to elasticsearch
<match **>
  type elasticsearch
  logstash_format true
  flush_interval 5s
</match>

Masahiro Nakagawa

unread,
May 29, 2014, 12:06:03 PM5/29/14
to flu...@googlegroups.com
What's kind of tools you use?

Yes. We use Treasure Agent Monitoring Service for td-agent
and we also send some metrics to Librato Metrics.



Masahiro Nakagawa

unread,
May 29, 2014, 12:11:59 PM5/29/14
to flu...@googlegroups.com
I think the problem might be due to queue limit. Digging deeper I found this: 

Queue Limit should be logged into td-agent.log.

> <match **>
>  type elasticsearch
>  logstash_format true
>  flush_interval 5s
> </match>

Elasticsearch sometimes gets stuck. So if you use Elasticsearch in production,
I recommend that increasing Elasticsearch servers for load balanced insertion.
fluent-plugin-elasticsearch supports 'hosts' option for such case.




--

Kellan Strong

unread,
May 30, 2014, 12:44:03 PM5/30/14
to flu...@googlegroups.com
Cool so clustering elasticsearch is needed to fix this...We were going to do that anyway in the future. Thank you.
Reply all
Reply to author
Forward
0 new messages