logrotate kills td-agent completely

723 views
Skip to first unread message

Mark Moorcroft

unread,
Sep 23, 2014, 3:14:13 PM9/23/14
to flu...@googlegroups.com

As a followup to the issues with stopping and starting td-agent (fluentd) I am finding that logrotate stops td-agent and never restarts. 


/var/log/td-agent/td-agent.log {
  daily
  rotate 30
  compress
  delaycompress
  notifempty
  create 640 td-agent td-agent
  sharedscripts
  postrotate
    pid=/var/run/td-agent/td-agent.pid
    test -s $pid && kill -USR1 "$(cat $pid)"
  endscript
}



So "kill -USR1 $pid" doesn't restart td-agent so a fresh log file gets opened. It kills it entirely.


I am trying to figure out what's going on with /etc/init.d/td-agent as well, because that doesn't work either.

Mark Moorcroft

unread,
Sep 23, 2014, 4:01:16 PM9/23/14
to flu...@googlegroups.com

This is very frustrating. I can't find anything in the logs to indicate why logrotate killed td-agent on 3 of my hosts. And manually sending a "kill -USR1" now appears to do what you would expect.

Also, as I am testing the init.d script td-agent appears to be taking about 30 seconds to shut down, which is well within the 60 second default count down. Yet using "service td-agent stop" most often fails with an error and the lock/pid files are never erased. This despite the fact td-agent is no longer running.

Kiyoto Tamura

unread,
Sep 23, 2014, 4:24:18 PM9/23/14
to flu...@googlegroups.com
Hi Mark-

Just to make sure: all of this is happening on CentOS 5?

--
You received this message because you are subscribed to the Google Groups "Fluentd Google Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fluentd+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Check out Fluentd, the open source data collector to unify log management.

Mark Moorcroft

unread,
Sep 23, 2014, 4:29:10 PM9/23/14
to flu...@googlegroups.com

In my testing I find that an initd shutdown seems to take exactly 60 seconds to report "process finished code=0" in the logs. Increasing the countdown timer to 90 seconds seems to have given me a reliable shutdown on 9 different clients. I guess if shutdown takes exactly 60 seconds, and the count is 60 seconds you are left with a split second of wiggle room. Probably changing the counter to 61 would have been just as effective?

I will have to watch what happens with logrotate. I don't see td-agent being very useful if you have to constantly monitor to see if it's dead or if it's failing to reconnect to the server.


On Tuesday, September 23, 2014 12:14:13 PM UTC-7, Mark Moorcroft wrote:

Mark Moorcroft

unread,
Sep 23, 2014, 4:31:19 PM9/23/14
to flu...@googlegroups.com

All of my findings and testing were on centOS 6 up until I changed the counter to 90 seconds. Then I tried it on both 5 and 6. The logrotate failures were all on CentOS6.

Mark Moorcroft

unread,
Sep 23, 2014, 10:21:33 PM9/23/14
to flu...@googlegroups.com

Another followup, on one of my CentOS5 boxes when I restart td-agent I get one of these for each running process:

kernel: ruby-timer-thr[12281]: segfault at 000000004167be68 rip 00002af6cb302442 rsp 000000004167be60 error 6


The change to 90 second wait count in init.d seems to have stabilized restarting td-agent though.


On Tuesday, September 23, 2014 12:14:13 PM UTC-7, Mark Moorcroft wrote:

Mark Moorcroft

unread,
Sep 26, 2014, 7:07:52 PM9/26/14
to flu...@googlegroups.com

After some restarts of the client and server today, since changing the init.d from the 60 second wait loop to 90, I have yet to see init fail to restart the td-agent service. I don't know what property makes it take EXACTLY 60 seconds to log the restart event in td-agent.log, but having the wait loop exactly the same interval must have been a bad idea. 

Still on the lookout for logrotate killing the process with USR1, which obviously should not be happening.


On Tuesday, September 23, 2014 12:14:13 PM UTC-7, Mark Moorcroft wrote:

Masahiro Nakagawa

unread,
Sep 29, 2014, 9:54:32 AM9/29/14
to flu...@googlegroups.com
BTW, taking long time at shutdown is resolved at master.
I just backported it to v0.10 branch.


Next td-agent 1 and 2 will be shipped with this change.
Note that td-agent 1's Cool.io is v1.1 so updating Cool.io is needed.



--
Reply all
Reply to author
Forward
0 new messages