Periodic "hang" with no obvious cause

468 views
Skip to first unread message

Damon Miller

unread,
Aug 6, 2015, 4:34:07 PM8/6/15
to Fluentd Google Group
Hi folks.  I'm testing FluentD for delivery of some log data to an HTTPS destination using a simply plugin and I'm seeing periodic "hangs" where the agent will simply, well, hang.  During this time there are no log messages delivered or generated in the agent's own log.  Only when I force a restart does FluentD flush its buffer and "catch up", resuming delivery of log data.  Here are some excerpts from the td-agent logfile that show normal processing, a hang (manifested in no log messages), a restart, then a flush of the buffered data:

2015-08-06 15:26:05 +0000 [info]: => 200 (OK)
2015-08-06 15:27:05 +0000 [info]: Message Length:  1
2015-08-06 15:27:05 +0000 [info]: POST https://collectors.destionation.com/receiver/v1/http/Be65b8Kbhdvhn4IHuIw==
2015-08-06 15:27:05 +0000 [info]: => 200 (OK)
2015-08-06 15:27:23 +0000 [info]: Message Length:  4
2015-08-06 15:27:23 +0000 [info]: POST https://collectors.destination.com/receiver/v1/http/Be65b8Kbhdvhn4IHuIw==
2015-08-06 17:28:56 +0000 [info]: shutting down fluentd
2015-08-06 17:29:04 +0000 [info]: reading config file path="/etc/td-agent/td-agent.conf"
2015-08-06 17:29:04 +0000 [info]: starting fluentd-0.12.7

[abbreviated]

2015-08-06 17:29:04 +0000 [info]: initialized for Test Buffered HTTP source
2015-08-06 17:29:04 +0000 [info]: listening dRuby uri="druby://127.0.0.1:24230" object="Engine"
2015-08-06 17:29:06 +0000 [info]: Message Length:  131
2015-08-06 17:29:06 +0000 [info]: POST https://collectors.destination.com/receiver/v1/http/H9GPQLvRgnVXAoFydMSh5QY9SZLlW8CwmBbSb0Jg==
2015-08-06 17:29:06 +0000 [info]: => 200 (OK)
2015-08-06 17:29:06 +0000 [info]: => 200 (OK)
2015-08-06 17:29:07 +0000 [info]: Message Length:  45935
2015-08-06 17:29:07 +0000 [info]: POST https://collectors.destination.com/receiver/v1/http/H9GPQLvRgnVXAoFydMSh5QY9SZLlW8CwmBbSb0Jg==
2015-08-06 17:29:10 +0000 [info]: Message Length:  675
2015-08-06 17:29:10 +0000 [info]: POST https://collectors.destination.com/receiver/v1/http/Be65b8Kbhdvhn4IHuIw==
2015-08-06 17:29:10 +0000 [info]: => 200 (OK)
2015-08-06 17:29:17 +0000 [info]: => 200 (OK)
2015-08-06 17:29:18 +0000 [info]: Message Length:  45067
2015-08-06 17:29:18 +0000 [info]: POST https://collectors.destination.com/receiver/v1/http/H9GPQLvRgnVXAoFydMSh5QY9SZLlW8CwmBbSb0Jg==

So you see, roughly two hours passed while FluentD wasn't processing anything and the data was being queued locally (disk buffer).  When I noticed this, I tried to restart FluentD using the init script but that always hangs while trying to stop it.  I then issued a "kill -9 ruby", which then allows the init script to start FluentD.  Immediately after that, the buffers are flushed, e.g. 45,935 messages being posted.  This is running on an Amazon Linux instance in EC2 (kernel 3.14.35-28.38.amzn1.x86_64).  I'm happy to post config data, e.g. for one of the six source/destination pairs on this host:

--

# Data incoming from Syslog-NG
<source>
  type tcp
  port 5556
  bind 127.0.0.1
  tag test1.tcp
  format none
  log_level info
</source>

# Relay data via HTTPS
<match customer1.**>
  buffer_type file
  buffer_path /var/log/td-agent/buffer/customer1
  buffer_chunk_limit 16m
  buffer_queue_limit 256
  flush_interval 5s
  disable_retry_limit false
  retry_limit 15
  retry_wait 1s
  type test_http_buf
  hostname collectors.destination.com
  protocol https
  url /receiver/v1/http/Be65b8Kbhdvhn4IHuIw==
  port 443
</match>

--

I have the monitor agent enabled and it does sometimes show an increase in the disk buffer size for one of the destinations but that is expected due to bandwidth throttling being tested on that host.  Other than that, I'm kind of at a loss for what to monitor from a performance perspective or more generally how to identify bottlenecks in FluentD's configuration here?


Thanks in advance for any guidance.


Best,

Damon

Mr. Fiber

unread,
Aug 7, 2015, 2:26:05 AM8/7/15
to Fluentd Google Group
I have the monitor agent enabled and it does sometimes show an increase in the disk buffer size for one of the destinations but that is expected due to bandwidth throttling being tested on that host.

It means other plugins work while test_http_buf blocked or all processing are hang?
If former, could you show me test_http_buf plugin code?
Fluentd separates output plugin execution using threads by default.


Masahiro

--
You received this message because you are subscribed to the Google Groups "Fluentd Google Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fluentd+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Damon Miller

unread,
Aug 7, 2015, 12:15:27 PM8/7/15
to Fluentd Google Group
Absolutely.  The code is attached ("test_http_buf.rb").  Note that the HTTP endpoint may respond with an HTTP/429, indicating that throttling is occurring due to excessive bandwidth use.  That is expected and what I meant when I mentioned that the disk buffer was occasionally used.  That seems to be working fine.  When traffic is allowed again, FluentD flushes the buffer and "catches up".  I just can't determine why it occasionally hangs without logging anything or delivering data.


Thank you!

Damon
test_http_buf.rb

Damon Miller

unread,
Aug 10, 2015, 6:42:56 PM8/10/15
to Fluentd Google Group
This happened again and I had to restart FluentD to get data flowing again.  When I write "restart", I mean that I tried the init script but that hung so I had to manually kill the Ruby processes to be able to restart.  Here are log excerpts to show where the events happen:

Hang occurs at 10 Aug 12:31 UTC, I noticed 20 minutes later, then manually restarted:

2015-08-10 12:31:10 +0000 [info]: Message Length:  1
2015-08-10 12:31:10 +0000 [info]: POST https://collectors.destination.com/receiver/v1/http/Be65b8Kbhdvhn4IHuIw==
2015-08-10 12:31:10 +0000 [info]: => 200 (OK)
2015-08-10 12:31:28 +0000 [info]: Message Length:  4
2015-08-10 12:31:28 +0000 [info]: POST https://collectors.destination.com/receiver/v1/http/Be65b8Kbhdvhn4IHuIw==
2015-08-10 12:51:23 +0000 [info]: shutting down fluentd
2015-08-10 12:51:23 +0000 [info]: shutdown from Test HTTP source
2015-08-10 12:51:23 +0000 [info]: shutdown from Test HTTP source

After the restart, data begins processing again and the buffer is flushed (e.g. 49,287 messages POSTed):

2015-08-10 12:53:15 +0000 [info]: Message Length:  18
2015-08-10 12:53:15 +0000 [info]: POST https://collectors.destination.com/receiver/v1/http/85_H9GPQLvRgnVXAoFydMSh5QY9SZLlW8CwmBbSb0Jg==
2015-08-10 12:53:15 +0000 [info]: Message Length:  4
2015-08-10 12:53:15 +0000 [info]: POST https://collectors.destination.com/receiver/v1/http/Be65b8Kbhdvhn4IHuIw==
2015-08-10 12:53:15 +0000 [info]: => 200 (OK)
2015-08-10 12:53:15 +0000 [info]: => 200 (OK)
2015-08-10 12:53:17 +0000 [info]: Message Length:  49287
2015-08-10 12:53:17 +0000 [info]: POST https://collectors.destination.com/receiver/v1/http/85_H9GPQLvRgnVXAoFydMSh5QY9SZLlW8CwmBbSb0Jg==

It had been about 70 hours since the previous restart when FluentD hung.  Is there anything I can do to diagnose the cause of the hang?  Could I restart nightly (assuming the init script can actually restart)?


Best,


Damon

On Friday, August 7, 2015 at 2:26:05 AM UTC-4, repeatedly wrote:

Mr. Fiber

unread,
Aug 11, 2015, 6:07:01 AM8/11/15
to Fluentd Google Group
Hmm... could you send a signal to fluentd for dumping instance state?


fluentd uses sigdump so we can export memory, thread and other state to file for debugging.

YAMANO Yuji

unread,
Aug 19, 2015, 10:15:15 PM8/19/15
to flu...@googlegroups.com, windha...@gmail.com
Damon,

Any progess on the issue?

I had a similar problem at previous project. In my case, it stuck in ruby's
native code.

Thread #<Thread:0x007f1e03898898> status=sleep priority=0
/usr/lib64/fluent/ruby/lib/ruby/1.9.1/net/http.rb:800:in `block in connect'
/usr/lib64/fluent/ruby/lib/ruby/1.9.1/timeout.rb:55:in `timeout'
/usr/lib64/fluent/ruby/lib/ruby/1.9.1/timeout.rb:100:in `timeout'
/usr/lib64/fluent/ruby/lib/ruby/1.9.1/net/http.rb:800:in `connect'
/usr/lib64/fluent/ruby/lib/ruby/1.9.1/net/http.rb:756:in `do_start'
/usr/lib64/fluent/ruby/lib/ruby/1.9.1/net/http.rb:745:in `start'
/usr/lib64/fluent/ruby/lib/ruby/1.9.1/net/http.rb:1285:in `request'
/usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluent-plugin-sumologic-0.0.1/lib/fluent/plugin/out_sumologic.rb:54:in `write'
/usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.45/lib/fluent/buffer.rb:296:in `write_chunk'
/usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.45/lib/fluent/buffer.rb:276:in `pop'
/usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.45/lib/fluent/output.rb:310:in `try_flush'
/usr/lib64/fluent/ruby/lib/ruby/gems/1.9.1/gems/fluentd-0.10.45/lib/fluent/output.rb:132:in `run'

I don't know if the problem is fixed because I left the project.
I haven't seen it with newer td-agent shipped with Ruby 2 at current project.
>>> email to fluentd+u...@googlegroups.com <javascript:>.

Damon Miller

unread,
Aug 22, 2015, 9:44:09 PM8/22/15
to Fluentd Google Group, windha...@gmail.com
Yamano,


Unfortunately I haven't made any progress.  It took me a little while but I did implement Masahiro's suggestion of adding sigdump support to my plugin.  Since then I haven't experenced a "total" hang but I have had one instance of data for one output stopping completely.  I had to restart immediately to remedy that and wasn't able to capture a stack trace.  However, I am currently alarming because that same instance of the output plugin hasn't sent data in approximately 40 minutes, despite the fact that I am continuing to receive incoming data.  It seems likely that I'm in a "bad" state and the only option I have to address the problem is restarting FluentD.  There are two Ruby processes--one the parent of the other.  I sent a CONT signal to each and the results are attached.  PID 26114 is the child process, presumably responsible for actual processing of data given the much larger stack trace.  The two stack traces are attached.  To be honest, I don't really know how to interpret these.  Could you tell me what function/method led to conclude that the problem was Ruby's native code in your case?  I see "block in connect" from the HTTP client and I also see that in my case, though followed by a call to "connect".  I also see a number of sleeping threads in FluentD's output plugin code ("output.rb").  I'd love some help on this...

The stack traces are attached but also here inline.

PARENT PROCESS (26111):

Sigdump at 2015-08-23 01:33:09 +0000 process 26111 (/usr/sbin/td-agent)
  Thread #<Thread:0x007fb69b19e7a0> status=run priority=0
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/sigdump-0.2.2/lib/sigdump.rb:38:in `backtrace'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/sigdump-0.2.2/lib/sigdump.rb:38:in `dump_backtrace'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/sigdump-0.2.2/lib/sigdump.rb:24:in `block in dump_all_thread_backtrace'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/sigdump-0.2.2/lib/sigdump.rb:23:in `each'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/sigdump-0.2.2/lib/sigdump.rb:23:in `dump_all_thread_backtrace'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/sigdump-0.2.2/lib/sigdump.rb:16:in `block in dump'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/sigdump-0.2.2/lib/sigdump.rb:107:in `open'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/sigdump-0.2.2/lib/sigdump.rb:107:in `_open_dump_path'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/sigdump-0.2.2/lib/sigdump.rb:14:in `dump'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/sigdump-0.2.2/lib/sigdump.rb:7:in `block in setup'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/supervisor.rb:252:in `call'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/supervisor.rb:252:in `waitpid'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/supervisor.rb:252:in `supervise'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/supervisor.rb:134:in `start'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/command/fluentd.rb:167:in `<top (required)>'
      /opt/td-agent/embedded/lib/ruby/site_ruby/2.1.0/rubygems/core_ext/kernel_require.rb:73:in `require'
      /opt/td-agent/embedded/lib/ruby/site_ruby/2.1.0/rubygems/core_ext/kernel_require.rb:73:in `require'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/bin/fluentd:6:in `<top (required)>'
      /opt/td-agent/embedded/bin/fluentd:23:in `load'
      /opt/td-agent/embedded/bin/fluentd:23:in `<top (required)>'
      /usr/sbin/td-agent:7:in `load'
      /usr/sbin/td-agent:7:in `<main>'
  Built-in objects:
    59,100: TOTAL
    29,039: T_STRING
    12,244: T_NODE
     7,368: T_ARRAY
     4,731: T_DATA
     1,868: T_CLASS
     1,558: T_OBJECT
       850: FREE
       610: T_REGEXP
       426: T_HASH
       147: T_ICLASS
       126: T_MODULE
        59: T_RATIONAL
        38: T_STRUCT
        19: T_FILE
         9: T_FLOAT
         4: T_BIGNUM
         3: T_MATCH
         1: T_COMPLEX
  All objects:
    21,868: String
     3,753: RubyVM::InstructionSequence
     3,740: Array
       831: Class
       586: Gem::Requirement
       465: Regexp
       386: Gem::Dependency
       352: Hash
       204: Gem::Version
       173: Proc
       126: Module
       115: RubyVM::Env
       100: Encoding
        99: Gem::Specification
        99: Gem::StubSpecification
        99: Time
        91: Gem::StubSpecification::StubLine
        59: Rational
        38: Range
        21: Fluent::Config::ConfigureProxy
        16: MatchData
        14: Fluent::Config::Element
        13: OptionParser::Switch::RequiredArgument
        13: Mutex
        11: OptionParser::Switch::NoArgument
         9: Float
         6: OptionParser::OptionMap
         5: IO
         4: File
         4: Fluent::Config::V1Parser
         4: StringScanner
         4: Bignum
         3: Object
         3: OptionParser::List
         3: Fluent::Registry
         3: IRB::Notifier::LeveledNotifier
         3: JSON::Ext::Generator::State
         2: OptionParser::Switch::OptionalArgument
         2: UnboundMethod
         2: OptionParser::Switch::PlacedArgument
         2: ThreadSafe::Cache
         2: Binding
         1: Fluent::Supervisor
         1: Fluent::Supervisor::LoggerInitializer
         1: OptionParser::CompletingHash
         1: Fluent::Log
         1: Fluent::StatusClass
         1: Thread::Backtrace
         1: IRB::Notifier::NoMsgNotifier
         1: IRB::StdioOutputMethod
         1: IRB::Notifier::CompositeNotifier
         1: Resolv
         1: Resolv::Hosts
         1: OptionParser
         1: URI::Generic
         1: Resolv::DNS
         1: URI::Parser
         1: Monitor
         1: Gem::PathSupport
         1: Fluent::EngineClass
         1: Gem::Platform
         1: Resolv::DNS::Config
         1: OpenSSL::X509::Store
         1: Fluent::PluginClass
         1: Complex
         1: ThreadGroup
         1: IOError
         1: Thread
         1: RubyVM
         1: NoMemoryError
         1: SystemStackError
         1: Random
         1: ARGF.class
         1: Data
         1: fatal
  String 437,281 bytes
   Array 2 elements
    Hash 0 pairs


CHILD PROCESS (26114):

Sigdump at 2015-08-23 01:33:11 +0000 process 26114 (/usr/sbin/td-agent)
  Thread #<Thread:0x007fb69b19e7a0> status=run priority=0
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/sigdump-0.2.2/lib/sigdump.rb:38:in `backtrace'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/sigdump-0.2.2/lib/sigdump.rb:38:in `dump_backtrace'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/sigdump-0.2.2/lib/sigdump.rb:24:in `block in dump_all_thread_backtrace'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/sigdump-0.2.2/lib/sigdump.rb:23:in `each'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/sigdump-0.2.2/lib/sigdump.rb:23:in `dump_all_thread_backtrace'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/sigdump-0.2.2/lib/sigdump.rb:16:in `block in dump'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/sigdump-0.2.2/lib/sigdump.rb:107:in `open'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/sigdump-0.2.2/lib/sigdump.rb:107:in `_open_dump_path'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/sigdump-0.2.2/lib/sigdump.rb:14:in `dump'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/sigdump-0.2.2/lib/sigdump.rb:7:in `block in setup'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.3.0/lib/cool.io/loop.rb:88:in `call'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.3.0/lib/cool.io/loop.rb:88:in `run_once'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.3.0/lib/cool.io/loop.rb:88:in `run'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/engine.rb:163:in `run'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/supervisor.rb:481:in `run_engine'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/supervisor.rb:140:in `block in start'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/supervisor.rb:266:in `call'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/supervisor.rb:266:in `main_process'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/supervisor.rb:241:in `block in supervise'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/supervisor.rb:240:in `fork'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/supervisor.rb:240:in `supervise'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/supervisor.rb:134:in `start'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/command/fluentd.rb:167:in `<top (required)>'
      /opt/td-agent/embedded/lib/ruby/site_ruby/2.1.0/rubygems/core_ext/kernel_require.rb:73:in `require'
      /opt/td-agent/embedded/lib/ruby/site_ruby/2.1.0/rubygems/core_ext/kernel_require.rb:73:in `require'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/bin/fluentd:6:in `<top (required)>'
      /opt/td-agent/embedded/bin/fluentd:23:in `load'
      /opt/td-agent/embedded/bin/fluentd:23:in `<top (required)>'
      /usr/sbin/td-agent:7:in `load'
      /usr/sbin/td-agent:7:in `<main>'
  Thread #<Thread:0x007fb68f56a5e8> status=sleep priority=0
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/output.rb:165:in `sleep'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/output.rb:165:in `wait'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/output.rb:165:in `cond_wait'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/output.rb:149:in `run'
  Thread #<Thread:0x007fb68f48b190> status=sleep priority=0
      /opt/td-agent/embedded/lib/ruby/2.1.0/net/http.rb:920:in `connect'
      /opt/td-agent/embedded/lib/ruby/2.1.0/net/http.rb:920:in `block in connect'
      /opt/td-agent/embedded/lib/ruby/2.1.0/timeout.rb:76:in `timeout'
      /opt/td-agent/embedded/lib/ruby/2.1.0/net/http.rb:920:in `connect'
      /opt/td-agent/embedded/lib/ruby/2.1.0/net/http.rb:863:in `do_start'
      /opt/td-agent/embedded/lib/ruby/2.1.0/net/http.rb:858:in `start'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/net-http-persistent-2.9.4/lib/net/http/persistent.rb:700:in `start'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/net-http-persistent-2.9.4/lib/net/http/persistent.rb:965:in `reset'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/net-http-persistent-2.9.4/lib/net/http/persistent.rb:628:in `connection_for'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/net-http-persistent-2.9.4/lib/net/http/persistent.rb:994:in `request'
      /etc/td-agent/plugin/test_http_buf.rb:88:in `block in write'
      /etc/td-agent/plugin/test_http_buf.rb:87:in `upto'
      /etc/td-agent/plugin/test_http_buf.rb:87:in `write'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/buffer.rb:325:in `write_chunk'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/buffer.rb:304:in `pop'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/output.rb:321:in `try_flush'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/output.rb:140:in `run'
  Thread #<Thread:0x007fb68f489e80> status=sleep priority=0
      /opt/td-agent/embedded/lib/ruby/2.1.0/net/http.rb:920:in `connect'
      /opt/td-agent/embedded/lib/ruby/2.1.0/net/http.rb:920:in `block in connect'
      /opt/td-agent/embedded/lib/ruby/2.1.0/timeout.rb:76:in `timeout'
      /opt/td-agent/embedded/lib/ruby/2.1.0/net/http.rb:920:in `connect'
      /opt/td-agent/embedded/lib/ruby/2.1.0/net/http.rb:863:in `do_start'
      /opt/td-agent/embedded/lib/ruby/2.1.0/net/http.rb:858:in `start'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/net-http-persistent-2.9.4/lib/net/http/persistent.rb:700:in `start'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/net-http-persistent-2.9.4/lib/net/http/persistent.rb:965:in `reset'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/net-http-persistent-2.9.4/lib/net/http/persistent.rb:628:in `connection_for'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/net-http-persistent-2.9.4/lib/net/http/persistent.rb:994:in `request'
      /etc/td-agent/plugin/test_http_buf.rb:88:in `block in write'
      /etc/td-agent/plugin/test_http_buf.rb:87:in `upto'
      /etc/td-agent/plugin/test_http_buf.rb:87:in `write'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/buffer.rb:325:in `write_chunk'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/buffer.rb:304:in `pop'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/output.rb:321:in `try_flush'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/output.rb:140:in `run'
  Thread #<Thread:0x007fb68f488b70> status=sleep priority=0
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/output.rb:165:in `sleep'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/output.rb:165:in `wait'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/output.rb:165:in `cond_wait'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/output.rb:149:in `run'
  Thread #<Thread:0x007fb68f486910> status=sleep priority=0
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/output.rb:165:in `sleep'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/output.rb:165:in `wait'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/output.rb:165:in `cond_wait'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/output.rb:149:in `run'
  Thread #<Thread:0x007fb68f4859e8> status=sleep priority=0
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.3.0/lib/cool.io/loop.rb:88:in `run_once'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.3.0/lib/cool.io/loop.rb:88:in `run'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/plugin/socket_util.rb:126:in `run'
  Thread #<Thread:0x007fb68f484e58> status=sleep priority=0
      /opt/td-agent/embedded/lib/ruby/2.1.0/drb/drb.rb:959:in `accept'
      /opt/td-agent/embedded/lib/ruby/2.1.0/drb/drb.rb:959:in `accept'
      /opt/td-agent/embedded/lib/ruby/2.1.0/drb/drb.rb:1614:in `main_loop'
      /opt/td-agent/embedded/lib/ruby/2.1.0/drb/drb.rb:1464:in `block in run'
  Thread #<Thread:0x007fb68f536d60> status=sleep priority=0
      /opt/td-agent/embedded/lib/ruby/2.1.0/webrick/server.rb:170:in `select'
      /opt/td-agent/embedded/lib/ruby/2.1.0/webrick/server.rb:170:in `block in start'
      /opt/td-agent/embedded/lib/ruby/2.1.0/webrick/server.rb:32:in `start'
      /opt/td-agent/embedded/lib/ruby/2.1.0/webrick/server.rb:160:in `start'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/plugin/in_monitor_agent.rb:215:in `block in start'
  Thread #<Thread:0x007fb68f5368d8> status=sleep priority=0
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.3.0/lib/cool.io/loop.rb:88:in `run_once'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.3.0/lib/cool.io/loop.rb:88:in `run'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/plugin/socket_util.rb:126:in `run'
  Thread #<Thread:0x007fb68f536518> status=sleep priority=0
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.3.0/lib/cool.io/loop.rb:88:in `run_once'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.3.0/lib/cool.io/loop.rb:88:in `run'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/plugin/socket_util.rb:126:in `run'
  Thread #<Thread:0x007fb68f536158> status=sleep priority=0
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.3.0/lib/cool.io/loop.rb:88:in `run_once'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.3.0/lib/cool.io/loop.rb:88:in `run'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/plugin/socket_util.rb:126:in `run'
  Thread #<Thread:0x007fb68f535d20> status=sleep priority=0
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.3.0/lib/cool.io/loop.rb:88:in `run_once'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/cool.io-1.3.0/lib/cool.io/loop.rb:88:in `run'
      /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.7/lib/fluent/plugin/socket_util.rb:126:in `run'
  Built-in objects:
   191,173: TOTAL
   103,613: FREE
    42,287: T_STRING
    15,087: T_DATA
    12,949: T_ARRAY
     6,445: T_NODE
     3,170: T_OBJECT
     2,632: T_CLASS
     2,321: T_HASH
     1,384: T_STRUCT
       646: T_REGEXP
       173: T_ICLASS
       157: T_MODULE
       121: T_MATCH
       100: T_FILE
        59: T_RATIONAL
        19: T_BIGNUM
         9: T_FLOAT
         1: T_COMPLEX


Thank you!

Damon
sigdump-26111.log
sigdump-26114.log

Mr. Fiber

unread,
Aug 24, 2015, 9:25:41 AM8/24/15
to Fluentd Google Group
From your log, `connect` seems blocked.
How about setting open_timeout to your HTTP client?

YAMANO Yuji

unread,
Aug 24, 2015, 10:50:15 PM8/24/15
to flu...@googlegroups.com, windha...@gmail.com
Damon,

> Could you tell me what function/method led to conclude that the
> problem was Ruby's native code in your case? I see "block in
> connect" from the HTTP client and I also see that in my case, though
> followed by a call to "connect".

I don't remember exactly. In my case, the connection was established.
So after taking a look at code in net/http.rb, ext/socket and ext/openssl,
I concluded it might stick in somewhere in TCPSocket or SSLSocket.

YAMANO Yuji

unread,
Aug 24, 2015, 10:59:01 PM8/24/15
to flu...@googlegroups.com, windha...@gmail.com
I talked with colleagues working for the project this morning.
The problem is not fixed yet. They restart fluentd periodically:-(

Damon Miller

unread,
Aug 27, 2015, 10:10:27 AM8/27/15
to Fluentd Google Group
Thank you for your suggestion!  I added support for open and read timeouts, set them to five seconds, and I'm monitoring now to see if anything changes.


Best,

Damon

Damon Miller

unread,
Sep 10, 2015, 11:30:02 PM9/10/15
to Fluentd Google Group
This worked perfectly!  Implementing open and read timeouts completely eliminated the hang problem.  I've got another question but I'll post that in a new thread.  Thanks so much for the responses!

Damon
...

Chingis Dugarzhapov

unread,
Nov 16, 2016, 8:49:16 AM11/16/16
to Fluentd Google Group
Hi Damon,

I am experiencing the same "hanging" issue, set up babysitting scripts to restart fluentd, but they are not reliable. Can you describe which open and read timeouts you changed and where?

Thanks in advance!
Chingis
Reply all
Reply to author
Forward
0 new messages