duplicate log records when using in_tail plugin

3,926 views
Skip to first unread message

ruud

unread,
Feb 5, 2015, 7:23:28 AM2/5/15
to flu...@googlegroups.com
hello group,

I have the following situation:
- fluentd version fluentd-0.10.58
- elasticsearch plugin fluent-plugin-elasticsearch-0.7.0


configuration
<match fluentd.events>
    type elasticsearch
    logstash_format true
</match>

<source>
    type tail
    pos_file /var/log/td-agent/fluentd-tail.pos
    format json
    path /data/f_events/*.log
    tag fluentd.events
    read_from_head true
    path_field true
</source>

When I start from scratch, with no pos-file and an empty elasticsearch-cluster, and I put in /data/f_events a large logfile (648051 lines) and I wait a bit until everything is processed, it appears that 745500 records end up in elasticsearch.

I have experienced this with smaller files too. For a small log file I did this:
I queried all records from elasticsearch, dumped them to a file to inspect them.
I sorted the records on timestamp, located and marked the duplicates. It appeared they all occurred in the same time frame, but not all records in that time frame were duplicated.

I saw no strange things in either the logging of fluentd or elasticsearch.

Has anyone heared of this? Is this a misconfiguration of some kind, or is the in_tail-plugin broken? Do I need to supply extra information?


hopefully I can work around this issue, since we need a reliable feeder for elasticsearch in order to analyze our log records with kibana.

thanks in advance, Ruud

Mr. Fiber

unread,
Feb 5, 2015, 7:46:22 PM2/5/15
to flu...@googlegroups.com
Could you paste the your entire fluentd log?
and
If this problem is reproducible, could you write the procedure here?


Masahiro

--
You received this message because you are subscribed to the Google Groups "Fluentd Google Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fluentd+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ruud

unread,
Feb 9, 2015, 6:00:10 AM2/9/15
to flu...@googlegroups.com


On Friday, February 6, 2015 at 1:46:22 AM UTC+1, repeatedly wrote:
Could you paste the your entire fluentd log?
and
If this problem is reproducible, could you write the procedure here?

Hello Masahiro,
this problem is reproducable, but the outcome differs from time to time. Not all logfiles give a problem, but when I give a large log file to fluentd, the problem is bound to happen.

I did this to prepare the log file of fluentd for oyu;
- stop td-agent
- stop elasticsearch
- removed all cluster data in elasticsearch
- start elasticsearch
- remove the pos-file
- copy a new large logfile (9999 lines) to the logdirectory td-agent is listening to
- start td-agent

I use this sequence to be sure to have a fresh start, without interfering issues.

When elasticsearch and fluentd seem to be ready, I count the total count AND I check the count per index. The index counts are terribly wrong. I suspect there must be something very misconfigured or broken.

I analyzed the dates of the loglines and counted them per date (as they go into date indexes). These are the numbers:
2014-09-04 = 53
2014-09-08 = 55
2014-09-10 = 54
2014-09-11 = 110
2014-09-15 = 52
2014-09-16 = 223
2014-09-17 = 276
2014-09-18 = 61
2014-09-23 = 272
2014-09-24 = 54
2014-09-25 = 163
2014-09-26 = 263
2015-01-12 = 107
2015-01-13 = 4693
2015-01-16 = 208
2015-01-17 = 106
2015-01-18 = 617
2015-02-06 = 54
2015-02-13 = 324
2015-02-14 = 108
2015-02-18 = 382
2015-02-20 = 55
2015-02-25 = 275
2015-02-26 = 470
2015-02-27 = 327
2015-02-28 = 55

Now, when I dump the index counts, I get


logstash-2014.09.04 has 110 documents
logstash-2014.09.08 has 311 documents
logstash-2014.09.10 has 55 documents
logstash-2014.09.11 has 240 documents
logstash-2014.09.15 has 55 documents
logstash-2014.09.16 has 144 documents
logstash-2014.09.17 has 0 documents
logstash-2014.09.18 has 14 documents
logstash-2014.09.23 has 550 documents
logstash-2014.09.24 has 23 documents
logstash-2014.09.25 has 590 documents
logstash-2014.09.26 has 1279 documents
logstash-2015.01.12 has 352 documents
logstash-2015.01.13 has 15438 documents
logstash-2015.01.16 has 0 documents
logstash-2015.01.17 has 0 documents
logstash-2015.01.18 has 150 documents
logstash-2015.02.06 has 0 documents
logstash-2015.02.13 has 990 documents
logstash-2015.02.14 has 69 documents
logstash-2015.02.18 has 0 documents
logstash-2015.02.20 has 0 documents
logstash-2015.02.25 has 0 documents
logstash-2015.02.26 has 0 documents
logstash-2015.02.27 has 0 documents
logstash-2015.02.28 has 35 documents

Which totals 20405, so that's not 9999
Only few indices contain the right number of documents.

Now your question: the logfile of fluentd contains

2015-02-09 11:20:58 +0100 [info]: starting fluentd-0.10.58
2015-02-09 11:20:58 +0100 [info]: gem 'fluent-mixin-config-placeholders' version '0.3.0'
2015-02-09 11:20:58 +0100 [info]: gem 'fluent-mixin-plaintextformatter' version '0.2.6'
2015-02-09 11:20:58 +0100 [info]: gem 'fluent-plugin-elasticsearch' version '0.7.0'
2015-02-09 11:20:58 +0100 [info]: gem 'fluent-plugin-mongo' version '0.7.4'
2015-02-09 11:20:58 +0100 [info]: gem 'fluent-plugin-record-reformer' version '0.4.0'
2015-02-09 11:20:58 +0100 [info]: gem 'fluent-plugin-rewrite-tag-filter' version '1.4.1'
2015-02-09 11:20:58 +0100 [info]: gem 'fluent-plugin-s3' version '0.5.1'
2015-02-09 11:20:58 +0100 [info]: gem 'fluent-plugin-scribe' version '0.10.14'
2015-02-09 11:20:58 +0100 [info]: gem 'fluent-plugin-tail_path' version '0.0.2'
2015-02-09 11:20:58 +0100 [info]: gem 'fluent-plugin-td' version '0.10.23'
2015-02-09 11:20:58 +0100 [info]: gem 'fluent-plugin-td-monitoring' version '0.2.0'
2015-02-09 11:20:58 +0100 [info]: gem 'fluent-plugin-webhdfs' version '0.4.1'
2015-02-09 11:20:58 +0100 [info]: gem 'fluentd' version '0.10.58'
2015-02-09 11:20:58 +0100 [info]: using configuration file: <ROOT>

  <match fluentd.events>
    type elasticsearch
    logstash_format true
  </match>
  <source>
    type tail
    pos_file /var/log/td-agent/fluentd-events.pos

    format json
    path /data/f_events/*.log
    tag fluentd.events
    read_from_head true
  </source>
</ROOT>
2015-02-09 11:20:58 +0100 [info]: adding source type="tail"
2015-02-09 11:20:58 +0100 [info]: adding match pattern="fluentd.events" type="elasticsearch"
2015-02-09 11:20:58 +0100 [info]: following tail of /data/f_events/middel3.log
2015-02-09 11:22:00 +0100 [info]: Connection opened to Elasticsearch cluster => {:host=>"localhost", :port=>9200, :scheme=>"http"}

I will test what happens when I create a 9999 line file by adding 1 line per second instead of giving it directly to fluentd and let you know.
I hope someone has a clue what is happening here.

Ruud

ruud

unread,
Feb 9, 2015, 7:47:23 AM2/9/15
to flu...@googlegroups.com


On Monday, February 9, 2015 at 12:00:10 PM UTC+1, ruud wrote:

I have analyzed the logfile and the contents of elasticsearch a bit further. I did this:
- I made a dump of all ES records into a file, edited it to contain only the time stamps, sorted it and saved that copy
- I made a copy of the logfile that fluentd had processes, edited to contain only the time stamps, sorted the file and saved it.
- with a diff utility I compared the sorted files. They should be identical. However:

there are many changes, divided into (only) few intervals  (maybe 10): in each interval, records are either duplicated, left alone or ignored. Looking at it, there way this occurs appears very regular, but the regularity changes over time. In the beginning, each record is duplicated once, exept if two log lines have the same time stamp; then only one is duplicated. Later on, alle log lines are duplicated twice. In ignore intervals, it seems irregular however. I attached a few screenshots.


regards, Ruud
duplicationtwiceormore.png
delete.png
duplication.png

ruud

unread,
Feb 9, 2015, 10:06:11 AM2/9/15
to flu...@googlegroups.com
I have just completed a test in which I first created an empty log file in the configured directory and after that, I wrote 9999 logrecords to that file, waiting a second in between.
I did the same comparison as in the above post. ES and the logfile contain both 9999 records, the time stamps are all right.

Seems no problem here. I repeat the first test with the 9999 records in once with a smaller amount to see if that changes the behaviour and post back the result.

Ruud
 

Mr. Fiber

unread,
Feb 9, 2015, 9:24:48 PM2/9/15
to flu...@googlegroups.com
2015-02-09 11:20:58 +0100 [info]: following tail of /data/f_events/middel3.log

From this log, in_tail seems to work correctly. No tailing multiple files.

I wait your next test result.



--

ruud

unread,
Feb 10, 2015, 7:26:57 AM2/10/15
to flu...@googlegroups.com


On Tuesday, February 10, 2015 at 3:24:48 AM UTC+1, repeatedly wrote:
2015-02-09 11:20:58 +0100 [info]: following tail of /data/f_events/middel3.log

From this log, in_tail seems to work correctly. No tailing multiple files.

I wait your next test result.

It looks as if it is fairly reproducable when I follow this pattern:
- init.d/td stop
- remove pos-file
- init.d/elasticsearch stop; rm -rf /data/elasticsearch/cluster/node; init.d/elasticsearch start (ie start with a new index)
- put a log file in the watched directory
- init.d/td start

When a file is 250 lines or more, the number of records in ES will be doubled. Probably each line exactly one time. Trying this with a log file of 10, 100, 150, 200 and 225, no problems arise.
It seems that if dropping a log file in the watched directory while td-agent is running does not lead to errors.

Then I tried only stopping the agent, dropping a logfile in the watched directory and starting the agent. I tried this with files of 500, 5000, 10000 lines respectively with no problem. Only when I added a 647506 line file, problems started.
In the logging, this showed up:

2015-02-10 13:13:58 +0100 [info]: adding source type="tail"
2015-02-10 13:13:58 +0100 [info]: adding match pattern="fluentd.events" type="elasticsearch"
2015-02-10 13:13:58 +0100 [warn]: parameter 'path_key' in <source>
  type tail
  pos_file /var/log/td-agent/fluentd-events6.pos
  .......
</source> is not used.
2015-02-10 13:13:58 +0100 [info]: following tail of /data/f_events/klein5.log
2015-02-10 13:13:58 +0100 [info]: following tail of /data/f_events/groot.log
2015-02-10 13:14:04 +0100 [info]: Connection opened to Elasticsearch cluster => {:host=>"localhost", :port=>9200, :scheme=>"http"}
2015-02-10 13:14:35 +0100 [warn]: Could not push logs to Elasticsearch, resetting connection and trying again. Operation timed out after 5002 milliseconds with 0 bytes received
2015-02-10 13:14:37 +0100 [info]: Connection opened to Elasticsearch cluster => {:host=>"localhost", :port=>9200, :scheme=>"http"}
2015-02-10 13:14:50 +0100 [info]: following tail of /data/f_events/klein4.log
2015-02-10 13:14:50 +0100 [info]: following tail of /data/f_events/klein7.log
2015-02-10 13:14:50 +0100 [info]: following tail of /data/f_events/klein9.log
2015-02-10 13:14:50 +0100 [info]: following tail of /data/f_events/klein14.log

And the result was 735099 ES hits instead of 647506

I am wondering if this is reproducable on your machine. I have attached the smalles logfile that gives problems when started with an empty ES index. (250 lines).

regards, Ruud
klein10.log

Mr. Fiber

unread,
Feb 11, 2015, 8:19:29 AM2/11/15
to flu...@googlegroups.com
Okay, I will check it later.
Please wait...

--

ruud

unread,
Feb 11, 2015, 9:32:55 AM2/11/15
to flu...@googlegroups.com


On Wednesday, February 11, 2015 at 2:19:29 PM UTC+1, repeatedly wrote:
Okay, I will check it later.
Please wait...


hi Repeatedly,
in the meantime I have collected further information. This is my test:
Having td-agent running, I started with an empty logfile and added loglines to it, having a sleep intervals betweed them.
I started with low volumes and long sleeps and increased volume and speed. Each line is appr. 168 characters long.

After each run, I checked the number of ES-records with the linecount of the logfile. In this table are the number of added records per run in column 1, the sleep interval in seconds in column 2 and the number of ES-records in the last column:
 
ADD    SLEEP      EL.SRC
20 0   20
2000 1   2020
980 0.1   3000
2000 0.1   5000
5000 0.1   10000
10000 0.1   20000
40000 0.1   60000
40000    0.05   100000
100000 0.01   200000
100000 0.01   300000
200000
0   2036455











As you can see: everything works fine, until I added the lines without sleep interval between them (last entry. The difference between actual and expected number of ES-records is huge). I suspect this behaviour is related to the other tests in the sense that fluentd seemingly can be confronted with too much data at once.

In the code I found this:
                if lines.size >= MAX_LINES_AT_ONCE
                  # not to use too much memory in case the file is very large
                  read_more = true
                  break
                end
Can this all simply be a memory issue? What can I do about it in that case?

regards, Ruud
























ruud

unread,
Feb 11, 2015, 11:01:12 AM2/11/15
to flu...@googlegroups.com


On Wednesday, February 11, 2015 at 3:32:55 PM UTC+1, ruud wrote:












Can this all simply be a memory issue? What can I do about it in that case?

Hi,
In trying to find the limit that fluentd is hitting, I tried writing logrecords to a logfile with no sleep in between and increasing the number of lines until the resulting number of ES-records did not match the number of loglines.
I started out with 10000 records and increased the number in steps until 80000 without probles. When I added 100000 records, the logging of td-agent came op with this:

2015-02-11 16:36:41 +0100 [info]: following tail of /data/f_events/test_rk9.log
2015-02-11 16:36:41 +0100 [info]: following tail of /data/f_events/test_rk7.log
2015-02-11 16:36:41 +0100 [info]: following tail of /data/f_events/test_rk8.log
2015-02-11 16:37:41 +0100 [info]: following tail of /data/f_events/test_rk12.log
2015-02-11 16:37:43 +0100 [info]: Connection opened to Elasticsearch cluster => {:host=>"localhost", :port=>9200, :scheme=>"http"}
2015-02-11 16:45:45 +0100 [warn]: Could not push logs to Elasticsearch, resetting connection and trying again. Operation timed out after 5000 milliseconds with 0 bytes received
2015-02-11 16:45:47 +0100 [info]: Connection opened to Elasticsearch cluster => {:host=>"localhost", :port=>9200, :scheme=>"http"}
2015-02-11 16:46:20 +0100 [warn]: Could not push logs to Elasticsearch, resetting connection and trying again. Operation timed out after 5001 milliseconds with 0 bytes received
2015-02-11 16:46:24 +0100 [info]: Connection opened to Elasticsearch cluster => {:host=>"localhost", :port=>9200, :scheme=>"http"}
2015-02-11 16:46:58 +0100 [warn]: temporarily failed to flush the buffer. next_retry=2015-02-11 16:45:03 +0100 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Could not push logs to Elasticsearch after 2 retries. Operation timed out after 5000 milliseconds with 0 bytes received" plugin_id="object:3f89d3ee94a8"
  2015-02-11 16:46:58 +0100 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-elasticsearch-0.7.0/lib/fluent/plugin/out_elasticsearch.rb:177:in `rescue in send'
  2015-02-11 16:46:58 +0100 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-elasticsearch-0.7.0/lib/fluent/plugin/out_elasticsearch.rb:167:in `send'
  2015-02-11 16:46:58 +0100 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-elasticsearch-0.7.0/lib/fluent/plugin/out_elasticsearch.rb:161:in `write'
  2015-02-11 16:46:58 +0100 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.58/lib/fluent/buffer.rb:300:in `write_chunk'
  2015-02-11 16:46:58 +0100 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.58/lib/fluent/buffer.rb:280:in `pop'
  2015-02-11 16:46:58 +0100 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.58/lib/fluent/output.rb:315:in `try_flush'
  2015-02-11 16:46:58 +0100 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.10.58/lib/fluent/output.rb:136:in `run'
2015-02-11 16:47:40 +0100 [warn]: Could not push logs to Elasticsearch, resetting connection and trying again. Operation timed out after 5000 milliseconds with 0 bytes received
2015-02-11 16:47:42 +0100 [info]: Connection opened to Elasticsearch cluster => {:host=>"localhost", :port=>9200, :scheme=>"http"}
2015-02-11 16:48:15 +0100 [warn]: Could not push logs to Elasticsearch, resetting connection and trying again. Operation timed out after 5000 milliseconds with 0 bytes received
2015-02-11 16:48:19 +0100 [info]: Connection opened to Elasticsearch cluster => {:host=>"localhost", :port=>9200, :scheme=>"http"}
2015-02-11 16:48:27 +0100 [warn]: retry succeeded. plugin_id="object:3f89d3ee94a8"
2015-02-11 16:49:02 +0100 [warn]: Could not push logs to Elasticsearch, resetting connection and trying again. Operation timed out after 5000 milliseconds with 0 bytes received
2015-02-11 16:49:04 +0100 [info]: Connection opened to Elasticsearch cluster => {:host=>"localhost", :port=>9200, :scheme=>"http"}
2015-02-11 16:49:37 +0100 [warn]: Could not push logs to Elasticsearch, resetting connection and trying again. Operation timed out after 5000 milliseconds with 0 bytes received
2015-02-11 16:49:41 +0100 [info]: Connection opened to Elasticsearch cluster => {:host=>"localhost", :port=>9200, :scheme=>"http"}

In the meantime the number of ES-records kept increasing far beyond the expected number. I had to stop the agent to stop this. When I stopped ES reported 2569616 records instead of the actual number of  340000 records in the logfile.

Hopefully this helps in tracking this issue down.

regards, Ruud
 

Mr. Fiber

unread,
Feb 11, 2015, 8:31:30 PM2/11/15
to flu...@googlegroups.com
I tried your logfile but not reproduced.

- fluentd log

2015-02-12 10:18:44 +0900 [info]: reading config file path="fluent.conf"
2015-02-12 10:18:44 +0900 [info]: starting fluentd-0.10.59
...
2015-02-12 10:18:44 +0900 [info]: gem 'fluent-plugin-elasticsearch' version '0.7.0'
...
2015-02-12 10:18:44 +0900 [info]: using configuration file: <ROOT>
  <source>
    type tail
    path /Users/repeatedly/tmp/fluentd/es/log/*.log
    pos_file /Users/repeatedly/tmp/fluentd/es/pos_file
    format json
    read_from_head true
    refresh_interval 10s
    tag es.test
  </source>
  <match es.**>
    type copy
    <store>
      type flowcounter_simple # check in_tail emitted document number
      unit second
    </store>
    <store>
      type elasticsearch
      logstash_format true
      flush_at_shutdown true
    </store>
  </match>
</ROOT>
2015-02-12 10:18:44 +0900 [info]: adding source type="tail"
2015-02-12 10:18:44 +0900 [info]: adding match pattern="es.**" type="copy"
2015-02-12 10:18:54 +0900 [info]: following tail of /Users/repeatedly/tmp/fluentd/es/log/klein10.log
2015-02-12 10:18:55 +0900 [info]: plugin:out_flowcounter_simple count:250       indicator:num   unit:second
^C2015-02-12 10:19:09 +0900 [info]: shutting down fluentd
2015-02-12 10:19:10 +0900 [info]: Connection opened to Elasticsearch cluster => {:host=>"localhost", :port=>9200, :scheme=>"http"}
2015-02-12 10:19:11 +0900 [info]: process finished code=0

- put klein10.log

/Users/repeatedly/tmp/fluentd/es% ls
fluent.conf  klein10.log  log/
/Users/repeatedly/tmp/fluentd/es% cp klein10.log log/
/Users/repeatedly/tmp/fluentd/es% cat pos_file
/Users/repeatedly/tmp/fluentd/es/log/klein10.log        000000000000a4e8        000000000224c4f7

- es log

[2015-02-12 10:19:10,271][INFO ][cluster.metadata         ] [Scimitar] [logstash-2015.02.03] creating index, cause [auto(bulk api)], shards [5]/[1], mappings []
[2015-02-12 10:19:10,449][INFO ][cluster.metadata         ] [Scimitar] [logstash-2015.02.07] creating index, cause [auto(bulk api)], shards [5]/[1], mappings []
[2015-02-12 10:19:10,469][INFO ][cluster.metadata         ] [Scimitar] [logstash-2015.02.06] creating index, cause [auto(bulk api)], shards [5]/[1], mappings []
[2015-02-12 10:19:10,489][INFO ][cluster.metadata         ] [Scimitar] [logstash-2015.02.02] creating index, cause [auto(bulk api)], shards [5]/[1], mappings []
[2015-02-12 10:19:10,509][INFO ][cluster.metadata         ] [Scimitar] [logstash-2015.02.01] creating index, cause [auto(bulk api)], shards [5]/[1], mappings []
[2015-02-12 10:19:10,528][INFO ][cluster.metadata         ] [Scimitar] [logstash-2015.02.05] creating index, cause [auto(bulk api)], shards [5]/[1], mappings []
[2015-02-12 10:19:10,548][INFO ][cluster.metadata         ] [Scimitar] [logstash-2015.02.04] creating index, cause [auto(bulk api)], shards [5]/[1], mappings []
[2015-02-12 10:19:11,143][INFO ][cluster.metadata         ] [Scimitar] [logstash-2015.02.03] update_mapping [fluentd] (dynamic)
[2015-02-12 10:19:11,152][INFO ][cluster.metadata         ] [Scimitar] [logstash-2015.02.06] update_mapping [fluentd] (dynamic)
[2015-02-12 10:19:11,162][INFO ][cluster.metadata         ] [Scimitar] [logstash-2015.02.05] update_mapping [fluentd] (dynamic)
[2015-02-12 10:19:11,164][INFO ][cluster.metadata         ] [Scimitar] [logstash-2015.02.01] update_mapping [fluentd] (dynamic)
[2015-02-12 10:19:11,169][INFO ][cluster.metadata         ] [Scimitar] [logstash-2015.02.02] update_mapping [fluentd] (dynamic)
[2015-02-12 10:19:11,186][INFO ][cluster.metadata         ] [Scimitar] [logstash-2015.02.07] update_mapping [fluentd] (dynamic)
[2015-02-12 10:19:11,190][INFO ][cluster.metadata         ] [Scimitar] [logstash-2015.02.04] update_mapping [fluentd] (dynamic)

- check script

require 'open-uri'
require 'json'

result = open('http://localhost:9200/_stats') { |f| JSON.parse(f.read) }
total = 0
result['indices'].each_pair { |i, status|
  num = status['total']['docs']['count']
  puts "#{i} : #{num}"
  total += num
}

puts "the number of docs: #{total}"

- check script result

logstash-2015.02.03 : 38
logstash-2015.02.07 : 38
logstash-2015.02.02 : 46
logstash-2015.02.06 : 29
logstash-2015.02.01 : 36
logstash-2015.02.05 : 34
logstash-2015.02.04 : 29
the number of docs: 250


Hmm...


On Tue, Feb 10, 2015 at 9:26 PM, ruud <r.gro...@gmail.com> wrote:

--

ruud

unread,
Feb 12, 2015, 5:03:21 AM2/12/15
to flu...@googlegroups.com


On Thursday, February 12, 2015 at 2:31:30 AM UTC+1, repeatedly wrote:
I tried your logfile but not reproduced.
the number of docs: 250
Hmm...


hi Repeatedly,
after my last post, I had this feeling I was maybe blaming the wrong plugin...
Your reply definitely supports this, so I am happy with your report.

Anyhow, I noticed a difference between our configs, so I changed mine to match your version of it. It made not much of a difference, I had the same behaviour, but now with the flowcounter output to know for sure in_tail is not the issue.

Therefor, I focussed on the elasticsearch plugin.
I noticed that
- setting either reload_connections to false or reload_on_failure  to true did make any difference
- raising request_timeout to 20 makes a lot of difference: the initial-file test and the append test now pass.

I don't have a comfortable feeling about this, anyhow. It probably means that the limit is raised, but once fluentd hits the limit again, the same bug is exposed again.

I feel the plugin has to react in an adequate way to timeouts. Do you know who is the right person to report this to?

Nonetheless, I am glad to have find a way to configue td-agent so that is can be used. I will continue testing to see if it now is reliable enough to fit our needs.
Thanks for your input.

Ruud

ruud

unread,
Feb 12, 2015, 10:45:56 AM2/12/15
to flu...@googlegroups.com


On Thursday, February 12, 2015 at 11:03:21 AM UTC+1, ruud wrote:

I feel the plugin has to react in an adequate way to timeouts. Do you know who is the right person to report this to?

Nonetheless, I am glad to have find a way to configue td-agent so that is can be used. I will continue testing to see if it now is reliable enough to fit our needs.
Thanks for your input.

hi repeatedly,

Further testing shows there is a difference between the issue with large logfiles and the issue with rapidly adding lots of data to a file. The latter seems to be solved with setting the retry-parameters, but the large file (my first post) continues to give the described outcome. (that is, duplicated records and forgotten records in runs).
The deleted (forgotten) records are a serious problem, since they can result in one ore more empty index files in ES. This hit hits a kibana bug, making it hang when querying in a time interval that make kibana search the empty index file.

I would be very interested to know the result of the test with my large testfile on your computer. On my computer, it is guaranteed to fail (# of records in ES is either to high or low, depending on the length of the duplicate and delete runs), with any configuration tweaking.  You can find it at http://we.tl/kWCtgDHrcz

thanks in advance, Ruud

Mr. Fiber

unread,
Feb 12, 2015, 11:31:52 AM2/12/15
to flu...@googlegroups.com
You can find it at http://we.tl/kWCtgDHrcz

I have an account for Microsoft and Office online but I can't download it.
Please upload to other place which doesn't need an authentication / authorization.


--

ruud

unread,
Feb 16, 2015, 9:12:21 AM2/16/15
to flu...@googlegroups.com

hi repeatedly,
you can use the output of the script below. It seems that changes in the buffer parameters change the outcome, but nonetheless, I have not found a configuration that resulted in stable and correct outcome.

regards, Ruud

#!/usr/bin/env ruby
count = 600000
if (ARGV.length != 1)
        puts "usage: scriptname logfile"
        exit
end
logfile = ARGV[0]
def stamp_periode
        '2015-02-' + ("%02d" % (rand( 15) + 1)) + Time.now.strftime( 'T%T.%L+01:00')
end
def json_testboodschap
        key = ($component.nil? ? 'test-rbg-events-reception01' : $component)
        '{"@timestamp":"' + stamp_periode +
        '","componentName":"' + key + '","eventType":"TRANS' + rand(5).to_s +
        '","documentKey":"r-' + key + '-1999-' + rand( 9999).to_s +
        '","success":' + (rand( 20) / 18 < 1).to_s +  ',"message":"boodschap ' +
        rand( 1000000).to_s + '"}'
end
begin
        File.open( logfile, 'w') do |f|
                n = 0
                while n < count
                        n += 1
                        f.puts json_testboodschap
                end
        end
rescue => exp
        $stderr.puts( "exeption '#{exp}' (#{exp.class})")
        exit -1
end


Mr. Fiber

unread,
Feb 17, 2015, 2:05:34 AM2/17/15
to flu...@googlegroups.com
Thanks!

This issue was reproduced with 600000 lines.
I also tested using logstash and it was reproduced.
I tried it without logstash_format and it worked properly.

So I think elasticsearch has a problem when creating multiple indices
and iserting lots of records into these indices in short term.


Masahiro


Mr. Fiber

unread,
Feb 17, 2015, 2:12:02 AM2/17/15
to flu...@googlegroups.com
I'm not sure what the cause of this problem.
But setting replica to 0 to reduce the ES's load so
it may avoid the problem.

For safety, slowly insert seems better.

Mr. Fiber

unread,
Feb 17, 2015, 2:46:09 AM2/17/15
to flu...@googlegroups.com
I checked API response and ES returns reject exception.

{"create"=>{"_index"=>"logstash-2015.02.09", "_type"=>"fluentd", "_id"=>"AUuWe67AbI-Dj1qyphbw", "status"=>429, "error"=>"EsRejectedExecutionException[rejected execution (queue capacity 50) on org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1@3b403c5f]"}}

So maybe, your ES cluster needs more power to handle these documents.


Conclusion, this is not fluentd issue.


Masahiro

ruud

unread,
Feb 17, 2015, 2:57:07 AM2/17/15
to flu...@googlegroups.com
hi Masashiro,

thank you so much for finding the true source of the problem. I was clearly having tunnel vision about this issue. I will try to tweak the ES configuration to get rid of this and continue with testing fluentd as a logstash replacement.

One question I want to ask. Thanks to the ES exception you were able to diagnosis the cause of this issue. What did you do to receive it?

regards, Ruud

Mr. Fiber

unread,
Feb 17, 2015, 3:12:21 AM2/17/15
to flu...@googlegroups.com
One question I want to ask. Thanks to the ES exception you were able to diagnosis the cause of this issue. What did you do to receive it? 


--

ruud

unread,
Feb 18, 2015, 5:21:05 AM2/18/15
to flu...@googlegroups.com


On Tuesday, February 17, 2015 at 9:12:21 AM UTC+1, repeatedly wrote:
Putting `p` into ES ruby library.
You can check these values via REST API.

hi  Masahiro,
I have tried the rest calls on the pages in your previous post, I got lots of interesting outputs from what I tried with _cat/threadpool request,
but none of them were similiar to  your output (with the create field and the error message). I do not understand what you mean with the 'putting p'. Can you please post the ruby code you used to generate that?

I tried this:
$ curl -XGET 'http://srv1070bh.sdu.nl:9200/_cat/thread_pool?v&h=bulk.rejected'
bulk.rejected
          139


Interestingly, although it makes clear 'something has happened' in elasticsearch, it does not explain the difference between the size of the log file (647506) and the number of ES records (644889).

Ruud

no jihun

unread,
Jun 3, 2015, 8:04:04 PM6/3/15
to flu...@googlegroups.com
I'm ruud.
I am in similar situation.

10M message source.
python fluentd client > fluentd > elasticsearch
=> produce 5% more documents indexed.

so curious that now you solved dupicated indexing problem entity.

Maithri V M

unread,
May 17, 2016, 12:42:34 AM5/17/16
to Fluentd Google Group
Hello there,

I'm also facing this issue, it is happening with files just 300,000 lines of logs. I see duplicate records for the first 250 lines, but not moving beyond by any means. Both fluentd output to a file and elastic search reflect the same outcome; hence, not able to concretely attribute it to elastic search alone. Please help.

Please find my code here, as posted as a new question.

Looking forward to your valuable inputs., thanks in advance!

Bill Milligan

unread,
Oct 20, 2018, 10:58:24 PM10/20/18
to Fluentd Google Group

I'm also having the exact same issue on latest versions of everything.  And I'm seeing it not with thousands of lines of logs, but with only a handful added one at a time.  I'm getting between four and ten duplicates created.  I have no idea if the problem is on the fluent side or the elastic side but I really need a solution.

Bill Milligan

unread,
Oct 20, 2018, 10:59:31 PM10/20/18
to Fluentd Google Group

My td-agent:

<source>
  @type tail
  path c:/users/bill/Documents/lifelog/bill.log
  format json
  pos_file c:/users/bill/Documents/lifelog/bill.log.pos
  time_key Meta.Event Timestamp
  tag bill.life
</source>
<match *.**>
  @type elasticsearch
  host localhost
  port 9200
  logstash_format true
</match>
Reply all
Reply to author
Forward
0 new messages