in_tail missing log rotation from svlogd (runit)

1,105 views
Skip to first unread message

Sean

unread,
Jan 8, 2018, 4:03:30 PM1/8/18
to Fluent-Bit
Hello!

I'm attempting to use fluent-bit to tail a log created/rotated by runit's svlogd. Fairly often, when the log is rotated, fluent-bit does not reset the file offset. Sometimes, though, it does catch it. I have been unable to determine why it is caught sometimes and not others.

Here's the configs:

[service]

    flush
5
    daemon off
    log_level debug
    parsers_file fluent
-bit-parsers.conf


[input]
    name tail
    path logs
/web/current
    parser json
    db tmp
/fluent-bit-request.db
    tag requests


[output]
    name forward
    match
*



[parser]

    name json
    format json
    time_key timestamp
    time_format
%d/%b/%Y:%H:%M:%S %z

And here are the debug log entries when the file rotation is missed:

[2018/01/08 19:11:56] [debug] [in_tail] file=logs/web/current read=547 lines=1
[2018/01/08 19:11:57] [debug] [in_tail] file=logs/web/current event
[2018/01/08 19:11:57] [debug] [input tail.0] [mem buf] size = 821
[2018/01/08 19:11:57] [debug] [in_tail] file=logs/web/current read=522 lines=1
[2018/01/08 19:11:57] [debug] [in_tail] removed logs/web/current
[2018/01/08 19:12:00] [debug] [task->buffer] worker_id=0
[2018/01/08 19:12:00] [debug] [task] created task=0x1a95cc0 id=0 OK
[2018/01/08 19:12:00] [debug] [out_forward] request 821 bytes to flush
[2018/01/08 19:12:00] [debug] [out_fw] 2 entries tag='requests' tag_len=8
[2018/01/08 19:12:00] [debug] [task] destroy task=0x1a95cc0 (task_id=0)
[2018/01/08 19:12:00] [debug] [dyntag tail.0] 0x1acb760 destroy (tag=requests)
[2018/01/08 19:12:05] [debug] [in_tail] append new file: logs/web/current
[2018/01/08 19:12:05] [debug] [in_tail] add to scan queue logs/web/current, offset=999240
[2018/01/08 19:12:05] [debug] [in_tail] file=logs/web/current promote to TAIL_EVENT
[2018/01/08 19:12:06] [debug] [in_tail] file=logs/web/current event
[2018/01/08 19:12:07] [debug] [in_tail] file=logs/web/current event

As you can see, after the rotation the offset is not reset and kept at it's previous value (999240 this case). To correct the problem, the db file must be removed and fluent-bit restarted.

Here is the sosreport:

$ fluent-bit --sosreport -c current/config/fluent-bit.conf
Fluent-Bit v0.12.11
Copyright (C) Treasure Data

[2018/01/08 20:58:59] [ info] [engine] started
[2018/01/08 20:58:59] [debug] [in_tail] inotify watch fd=20
[2018/01/08 20:58:59] [debug] [in_tail] scanning path logs/web/current
[2018/01/08 20:58:59] [debug] [in_tail] add to scan queue logs/web/current, offset=602013
[2018/01/08 20:58:59] [debug] [router] input=tail.0 'DYNAMIC TAG'

Fluent Bit Enterprise - SOS Report
==================================
The following report aims to be used by Fluent Bit and Fluentd Enterprise
Customers of Treasure Data. For more details visit:



[Fluent Bit]
    Edition Community Edition
    Version 0.12.11
    Built Flags JSMN_PARENT_LINKS JSMN_STRICT FLB_HAVE_TLS FLB_HAVE_SQLDB FLB_HAVE_BUFFERING FLB_HAVE_FLUSH_LIBCO FLB_HAVE_FORK FLB_HAVE_PROXY_GO FLB_HAVE_REGEX FLB_HAVE_C_TLS FLB_HAVE_SETJMP FLB_HAVE_ACCEPT4 FLB_HAVE_INOTIFY

[Operating System]
    Name Linux
    Release 4.14.12-x86_64-linode92
    Version #1 SMP Fri Jan 5 15:34:44 UTC 2018

[Hardware]
    Architecture x86_64
    Processors 1

[Built Plugins]
    Inputs              cpu mem kmsg tail proc disk netif dummy head health serial stdin tcp mqtt lib forward random syslog 
    Filters             grep stdout kubernetes parser record_modifier 
    Outputs             counter es exit file forward http influxdb kafka-rest nats null plot stdout td lib flowcounter 

[SERVER] Runtime configuration
    Flush 5
    Daemon Off
    Log_Level Debug

[INPUT] Instance
    Name tail.0 (tail, id=0)
    Flags DYN_TAG 
    Threaded No
    Tag requests
    path                logs/web/current
    parser              json
    db                  tmp/fluent-bit-request.db


[OUTPUT] Instance
    Name forward.0 (forward, mask_id=1)
    Match *
    TLS Active No
    Retry Limit 1
    Host.TCP_Port 24224
    Host.Name 127.0.0.1


Has anyone else encountered this problem? Is there more debug information I can provide?

Thanks,

- Sean



Sean

unread,
Jan 8, 2018, 4:42:23 PM1/8/18
to Fluent-Bit
After more debugging, I believe I tracked down the problem.

Before svlogd rotates the file, it chmod's the file (line 208 of src/svlogd.c). This triggers an ATTRIB inotify event. fluent-bit reads the ATTRIB event on line 74 of tail_fs_inotify.c and treats it as the watched file being removed:

    /* File was removed */
   
if (ev.mask & (IN_ATTRIB | IN_IGNORED)) {
        flb_debug
("[in_tail] removed %s", file->name);
        flb_tail_file_remove
(file);
       
return 0;
   
}

Is there any reason why the IN_DELETE_SELF inotify event was not used here? I'm not too familiar with the internal workings of inotify, but the documentation seems to point to IN_DELETE_SELF as the event to watch for a file being removed.

In the meantime, I'm going to patch this on my servers and see how it goes.

Thanks,

- Sean

Eduardo Silva

unread,
Jan 11, 2018, 12:23:15 PM1/11/18
to Sean, Fluent-Bit
hi Sean,

thanks for reaching about this issue. 

Looking at inotify events description the current values should be fine, e.g:


IN_ATTRIB (*)
                                           chmod(2)), timestamps (e.g., utimensat(2)), extended
                  attributes (setxattr(2)), link count (since Linux 2.6.25;
                  e.g., for the target of link(2) and for unlink(2)), and
                  user/group ID (e.g., chown(2)).
                  Metadata changed—for example, permissions (e.g.,
IN_IGNORED
                  Watch was removed explicitly (inotify_rm_watch(2)) or
                  unmounted). See also BUGS.

but you are right too, any file permission change will lead to a wrong "deleted file" behavior. The thing is that unlink(2) syscall triggers IN_ATTRIB and not IN_DELETE or IN_IGNORED.. 

still testing...


--
You received this message because you are subscribed to the Google Groups "Fluent-Bit" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fluent-bit+unsubscribe@googlegroups.com.
To post to this group, send email to fluen...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/fluent-bit/24ccd512-2bf3-4d78-a3de-859507b18d6f%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Eduardo Silva
Open Source, Treasure Data
http://www.treasuredata.com/opensource

http://twitter.com/edsiper
  http://www.linkedin.com/in/edsiper

Eduardo Silva

unread,
Jan 11, 2018, 12:50:23 PM1/11/18
to Sean, Fluent-Bit
I think I've found the root cause of the problem:

- unlink(2) triggers IN_ATTRIB
- chmod triggers IN_ATTRIB
- we cannot trust in IN_DELETE/IN_DELETE_SELF since they will only work when no more process have open references to that file.

In the branch in_tail_attrib I pushed a workaround where it adds a double check for IN_ATTRIB case, so when that happens it will stat(2) and check if a real deletion was performed (st_nlink == 0), the commit is here:


would you please do some tests ?

regards, 

Sean West

unread,
Jan 12, 2018, 6:35:19 PM1/12/18
to Eduardo Silva, Fluent-Bit
Hi Eduardo,

Thanks for the patch! 

Initial tests look good. I have one server with the patch and one without. I'll let you know how it looks after the weekend.

- Sean

Sean West

unread,
Jan 15, 2018, 7:52:08 PM1/15/18
to Eduardo Silva, Fluent-Bit
Looks great! Thanks again.

- Sean

Eduardo Silva

unread,
Jan 16, 2018, 11:46:27 AM1/16/18
to Sean West, Fluent-Bit
thanks Sean, I've merged the fix into GIT master (upcoming v0.13)

regards, 
Reply all
Reply to author
Forward
0 new messages