limit size of sent data to elasticsearch

5,128 views
Skip to first unread message

ruud

unread,
Feb 15, 2016, 3:35:39 AM2/15/16
to Fluentd Google Group
hi group,
I hope you can advise me on I problem I have with large log files.

When processing large initial logfiles with fluentd (2.3.0), it turns out, not all records are visible in elasticsearch. In some cases, complete days cannot be retreived.

It turns out this happens when json records in the communication on port 9200 to elasticearch are split in two records. ES seems not to handle that situation correctly. In the worst case, when the split is inside the value of the timestamp field, The day index of ES gets corrupt en all records of that index (day) will not be found.


my question
Since ES is not very responsive about this issue, I hope there is a way to fix this at fluentd side of the communication. Is there a way to configure fluentd in such a way that in sending batches to ES, the batches are split up BETWEEN json records instead of in the middle?


info
To clarify the situation, this is what I did to find this issue:
  • started with an empty elasticsearch and empty fluentd directory
  • started tcpdump on port 9200 of the elasticsearch (and fluentd) machine
  • put a large log file in the monitored log directory
  • fluentd starts to process the logfile and sends its data to ES
  • inspect the result of tcpdump.
It appears that some records are silently forgotten in ES and the records of some days result in an error message in ES.
These match with this situation in the tcpdump -A -i any 'port 9200' output:

:    :      :
{"@timestamp":"2015-12-21T15:12:00.000+00:00","url":"192.168.1.100","component":"pdfserver","factuurnummer":"21691153","een_klantnummer":"103575592","success":true,"path":"/data/factuur-logs/a22.log"}
{"@timestamp":"2015-12-21T15:15:00.000+00:00","url":"192.168.1.100","component":"pdfserver","factuurnummer":"21680672","een_klantnummer":"103011721","success":true,"path":"/data/factuur-logs/a22.log"}
{"@timestamp":"2015-12-21T15:15:00.000+00:00","url":"192.168.1.100","component":"pdfserve  << END OF TCP RECORD
17:05:06.524011 IP 172.22.25.227.9200 > 172.22.25.227.33655: Flags [.], ack 43879, win 2388, options [nop,nop,TS val 12919195 ecr 12919195], length 0
E..4..@.@...........#..w.U.de8A...  T.......
..!...!.
17:05:06.524068 IP 172.22.25.227.33655 > 172.22.25.227.9200: Flags [.], seq 43879:65767, ack 1, win 342, options [nop,nop,TS val 12919195 ecr 12919194], length 21888
E.U...@.@..e.........w#.e8A..U.d...V.......
..!...!.r","factuurnummer":"21680672","een_klantnummer":"103011721","success":true,"path":"/data/factuur-logs/a22.log"}
{"@timestamp":"2015-12-21T15:19:00.000+00:00","url":"192.168.1.100","component":"pdfserver","factuurnummer":"21690974","een_klantnummer":"103110656","success":true,"path":"/data/factuur-logs/a22.log"}

As you can see, the json record is divided in two pieces. In this case the record was not found in ES, nothing in logging of ES. It is simply ignored.

As said, when this occurs WITHIN the value of a timestamp field, the day index of ES gets corrupted and nothing of that day will be found. In the logging of ES a message like this will be recorded at each query including that index:

 :   :   :   :
        at org.elasticsearch.search.SearchService.parseSource(SearchService.java:633)
        ... 9 more
[2016-02-11 12:37:47,881][DEBUG][action.search.type       ] [Agony] All shards failed for phase: [query]
[2016-02-11 12:37:47,878][DEBUG][action.search.type       ] [Agony] [logstash-2015.12.22][3], node[stwLqEprShKfaaP5jto1BQ], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@783242c8] lastShard [true]
org.elasticsearch.search.SearchParseException: [logstash-2015.12.22][3]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"facets":{"0":{"date_histogram":{"field":"@timestamp","interval":"30m"},"global":true,"facet_filter":{"fquery":{"query":{"filtered":{"query":{"query_string":{"query":"*"}},"filter":{"bool":{"must":[{"range":{"@timestamp":{"from":1448986135627,"to":1455190667873}}},{"range":{"@timestamp":{"from":1450652207645,"to":1450824559922}}}]}}}}}}}},"size":0}]]
        at org.elasticsearch.search.SearchService.parseSource(SearchService.java:649)
        at org.elasticsearch.search.SearchService.createContext(SearchService.java:511)
        at org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:483)
        at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:252)
        at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:206)
        at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:203)
        at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:517)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.search.facet.FacetPhaseExecutionException: Facet [0]: (key) field [@timestamp] not found
        at org.elasticsearch.search.facet.datehistogram.DateHistogramFacetParser.parse(DateHistogramFacetParser.java:160)
        at org.elasticsearch.search.facet.FacetParseElement.parse(FacetParseElement.java:93)
        at org.elasticsearch.search.SearchService.parseSource(SearchService.java:633)
        ... 9 more

I think this is really an elasticsearch issue, but I hope I can work around it by not feeding ES with input it cannot handle. Your help is greatly appreceated!

regards, Ruud

Mr. Fiber

unread,
Feb 15, 2016, 4:05:09 AM2/15/16
to Fluentd Google Group
If large batch request causes this problem, setting smallar buffer_chunk_limit in elasticsearch configuration may help.


Masahiro

--
You received this message because you are subscribed to the Google Groups "Fluentd Google Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fluentd+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ruud

unread,
Feb 15, 2016, 4:31:32 AM2/15/16
to Fluentd Google Group


On Monday, February 15, 2016 at 10:05:09 AM UTC+1, repeatedly wrote:
If large batch request causes this problem, setting smallar buffer_chunk_limit in elasticsearch configuration may help.


Masahiro


Hi Masahiro,
thanks for your hint. I will try that and report back. But I don't think this will solve the problem that json records are sent in two pieces to ES. Which seems to cause the problem. I really hope td-agent is able to control the way in which the data is sent.

If I split the large log file in small files, all of which have valid and complete json records, no problem will arise, since fluentd wil send smaller data packets to ES  and the json records are not split in two. (The files must be small enough not to exeed the fluentd-make-new-data-packet-limit)
If assume that this manual action of mine can be done by td-agent too, but I don't know this is the case. Is it possible?

Ruud

Mr. Fiber

unread,
Feb 15, 2016, 4:49:26 AM2/15/16
to Fluentd Google Group
If I split the large log file in small files, all of which have valid and complete json records, no problem will arise, since fluentd wil send smaller data packets to ES  and the json records are not split in two.

Hmm... I can't understand why this resolves the problem.

Batch request size depends on Output's buffer_chunk_limit, not data source size.
So even if you have 1TB log file, ES plugin doesn't send 1TB batch request.

And in_tail doesn't read entire file content at one read operation. Default is 1000 lines.

So smaller read_lines_limit or buffer_chunk_limit should resovle your problem.


--

ruud

unread,
Feb 15, 2016, 5:18:23 AM2/15/16
to Fluentd Google Group
I have changed the elasticsearch configuration of fluentd (not the configuration of ES itself, I thought you proposed that, so I couldnt see that a solution ;) )

I added a line buffer_chunk_limit:

<match fluentd.events>
  type copy
  <store>
    type elasticsearch
    host 172.22.25.227
    logstash_format true
    request_timeout 100
    reload_connections false
    reload_on_failure true
    disable_retry_limit false
    flush_at_shutdown true
    buffer_queue_limit 1024
    buffer_chunk_limit 512 
  </store>                 
</match>  

When restarting td-agent, the logging says

 2016-02-15 11:02:26 +0100 [info]: following tail of /data/factuur-logs/a22.log
2016-02-15 11:02:26 +0100 [warn]: Size of the emitted data exceeds buffer_chunk_limit.
2016-02-15 11:02:26 +0100 [warn]: This may occur problems in the output plugins ``at this server.``
2016-02-15 11:02:26 +0100 [warn]: To avoid problems, set a smaller number to the buffer_chunk_limit
2016-02-15 11:02:26 +0100 [warn]: in the forward output ``at the log forwarding server.``

Apart from this message, I see the same situation in the tcpdump output, namely splitting of json records:

12:03:26.912369 IP 172.22.25.227.35553 > 172.22.25.227.9200: Flags [P.], seq 21991:43879, ack 1, win 342, options [nop,nop,TS val 21361165 ecr 21361164], length 21888
E.U.kP@.@.............#.B$.'.Si-...V.......
.E.^M.E..:"2015-12-18T13:09:00.000+00:00","url":"192.168.1.137","component":"pdfserver","factuurnummer":"21695819","een_klantnummer":"103641633","success":true,"path":"/data/factuur-logs/a22.log"}
{"index":{"_index":"logstash-2015.12.18","_type":"fluentd"}}
{"@timestamp":"2015-12-18T13:09:00.000+00:00","url":"192.168.1.137","component":"pdfserver","factuurnummer":"21695010","een_klantnummer":"103569185","success":true,"path":"/data/factuur-logs/a22.log"}
{"index":{"_index":"logstash-2015.12.18","_type":"fluentd"}}
{"@timestamp":"2015-12-18T13:09:00.000+00:00","url":"192.168.1.137","component":"pdfserver","factuurnummer":"21695010","een_klantnummer":"103569185","success":true,"path":"/data/factuur-logs/a22.log"}
   :      :     :
{"index":{"_index":"logstash-2015.12.22","_type":"fluentd"}}
{"@timestamp":"2015-12-22T11:09:00.000+00:00","url":"192.168.1.137","component":"pdfserver","factuurnummer":"21694829","een_klantnummer":"103414322","success":true,"path":"/data/factuur-logs/a22.log"}
{"index":{"_index":"logstash-2015.12.22","_type":"fluentd"}}
{"@timestamp":"2015-12-22T11:09:00.000+00:00","url":"192.168.1.137","component":"pdfserver","factuurnummer":"21694829","een_klantnummer":"103414322","success":true,"path":"/data/factuur-logs/a22.log"}
{"index":{"_index":"logstash-2015.12.22","_type":"fluentd"}}
{"@timestamp":"2015-12-22T11:21:00.000+00:00","url":"192.168.1.137","component":"pdfserver","factuurnummer":"21686466","een_klantnummer":"000630343","success":true,"path":"/data/factuur-logs/a22.log"}
{"index":{"_index":"logstash-2015.12.22","_type":"fluentd"}}
{"@timestamp":"2015-12-22T11:21:0    << SPLIT TIMESTAMP
11:03:26.912663 IP 172.22.25.227.35553 > 172.22.25.227.9200: Flags [P.], seq 65767:87655, ack 1, win 342, options [nop,nop,TS val 21361165 ecr 21361164], length 21888

Resulting in the same ES behaviour.
I could not find any documentation on read_lines_limit.....


regards, Ruud

Mr. Fiber

unread,
Feb 15, 2016, 5:19:46 AM2/15/16
to Fluentd Google Group
I could not find any documentation on read_lines_limit.....

I pasted the link in the previous reply.


--

Mr. Fiber

unread,
Feb 15, 2016, 5:23:34 AM2/15/16
to Fluentd Google Group
I assume your input is in_tail. Is this right?

ruud

unread,
Feb 15, 2016, 6:00:17 AM2/15/16
to Fluentd Google Group
yes, that is the plugin I use.

I have tested a bit further. I have a 700 line log file that causes the reported errors.  When I stop ES and td-agent and split the file in two 350 line parts and restart everything, the logfiles are processed and imported in ES correctly.

But what is more: the tcpdump still shows split up json records. But by now, they don't seem to cause problems. In the 700 line log file situation, the splitup records exactly showed the point of ES failure.

I am really stumped now. I cannot imagine how to debug this. I cannot understand if this is a td-agent or a ES issue....

By the way, when I follow the link to the in_tail page, I still cannot find the string read_line*

regards, Ruud

Mr. Fiber

unread,
Feb 15, 2016, 6:34:20 AM2/15/16
to Fluentd Google Group
I am really stumped now. I cannot imagine how to debug this. I cannot understand if this is a td-agent or a ES issue....

out_elasitcsearch uses elasticsearch-ruby library.
So if broken records are generated by out_elasticsearch,
elasticsearch-ruby may have a problem.
Or if send two separated request and ES does different behaviour by different request, it is Elasticsearch problem.

By the way, when I follow the link to the in_tail page, I still cannot find the string read_line*

Here is a screenshot of in_tail page.

ruud

unread,
Feb 15, 2016, 8:37:17 AM2/15/16
to Fluentd Google Group
oops, I was looking at a cached version of the page.

I changed read_lines_limit and buffer_chunk_limit to an unreasonable low number to be sure the output is affected:
    read_lines_limit 5
    buffer_chunk_limit 50k

but the only result is that I get even less hits in kibana. Obviously more records are dropped on processing by ES.
MOreover, the tcdump output still gives data packets of the same large size:

14:20:50.153074 IP 172.22.25.227.36612 > 172.22.25.227.9200: Flags [.], seq 103:21991, ack 1, win 342, options [nop,nop,TS val 24321975 ecr 24321973], length 21888  <<< unchanged by new settings
E.U...@.@.............#..]... .....V.......
.s...s..{"index":{"_index":"logstash-2015.12.16","_type":"fluentd"}}
{"@timestamp":"2015-12-16T16:27:00.000+00:00","url":"172.22.25.185","component":"pdfserver","factuurnummer":"21688273","een_klantnummer":"103606121","success":true,"path":"/data/factuur-logs/a22.log"}
{"index":{"_index":"logstash-2015.12.16","_type":"fluentd"}}

I hope this can be fixed, else I have to give up using kibana at all.... As soon as the histograms proof to be not reliable this solution is a noop....

regards, Ruud

ruud

unread,
Feb 15, 2016, 9:11:20 AM2/15/16
to Fluentd Google Group
Now I see that splitting the file indeed has no effect as long as they are initially put in the watched directory. But when I split the file, only put one of them in the watched directory and after it has been processed according to the logging, I put in the other, there is a big difference.

In the first situation, I get 441 ES hits when querying the whole data range. In the second situation, I get 705 hits. Which is by the way a bit strange too, because the log files contain 700 lines when concatened, but that is another issue.

Concluding, the longer I test, the longer I realize I make wrong assumptions about what and where the problem is. And that fact makes it even harder to make a debug stategy. What would be a next step to test?

regards, Ruud

Mr. Fiber

unread,
Feb 15, 2016, 10:11:25 AM2/15/16
to Fluentd Google Group
50k is larger than 21888 so it seems expected behaviour.
Elasticsearch batch request includes metadata for each record so
very small buffer_chunk_limit is needed to avoid tcp packet separation.

But AFAIK, many users use default buffer parameters with Elasticsearch
so this TCP packet separation should not be affected.


Mr. Fiber

unread,
Feb 15, 2016, 10:17:55 AM2/15/16
to Fluentd Google Group
Does your network or system have specific configuration?
I remember similar problem with fluent-plugin-webhfds on limited TCP packet environment.

Mr. Fiber

unread,
Feb 15, 2016, 10:30:46 AM2/15/16
to Fluentd Google Group
In the first situation, I get 441 ES hits when querying the whole data range. In the second situation, I get 705 hits. Which is by the way a bit strange too, because the log files contain 700 lines when concatened, but that is another issue.

Sorry I can't imagine the situation correctly.
If you can write reproducible sequence and information here,
I can try it.



ruud

unread,
Feb 18, 2016, 4:10:00 AM2/18/16
to Fluentd Google Group


On Monday, 15 February 2016 16:11:25 UTC+1, repeatedly wrote:
50k is larger than 21888 so it seems expected behaviour.
Elasticsearch batch request includes metadata for each record so
very small buffer_chunk_limit is needed to avoid tcp packet separation.


hi repeatedly,
there you have a good point. 21888 is more like 21 K. Before doing anything else I have checked if using a lower buffer chunck limit has result.

I put this in the configuration:

 <match fluentd.events>
  type copy
  <store>
    type elasticsearch
    host 172.22.25.227
    logstash_format true
    request_timeout 100
    reload_connections false
    reload_on_failure true
    disable_retry_limit false
    flush_at_shutdown true
    buffer_queue_limit 1024
    read_lines_limit 15
    buffer_chunk_limit 15k                                       
  </store>


And after restarting, the td-agent.log says this:

2016-02-18 09:28:58 +0100 [info]: using configuration file: <ROOT>
  <source>
    type tail_path
    pos_file /var/lib/td-agent/events.pos
    format json
    path /data/factuur-logs/*.log
    tag fluentd.events
    read_from_head true
    refresh_interval 10s
    path_key path
  </source>

  <match fluentd.events>
    type copy
    <store>
      type elasticsearch
      host 172.22.25.227
      logstash_format true
      request_timeout 100
      reload_connections false
      reload_on_failure true
      disable_retry_limit false
      flush_at_shutdown true
      buffer_queue_limit 1024
      read_lines_limit 15
      buffer_chunk_limit 15k
    </store>
  </match>
</ROOT>
2016-02-18 09:28:58 +0100 [warn]: section <store> is not used in <match fluentd.events> of copy plugin
2016-02-18 09:28:58 +0100 [info]: following tail of /data/factuur-logs/a22.log
2016-02-18 09:28:58 +0100 [warn]: Size of the emitted data exceeds buffer_chunk_limit.
2016-02-18 09:28:58 +0100 [warn]: This may occur problems in the output plugins ``at this server.``
2016-02-18 09:28:58 +0100 [warn]: To avoid problems, set a smaller number to the buffer_chunk_limit
2016-02-18 09:28:58 +0100 [warn]: in the forward output ``at the log forwarding server.``
2016-02-18 09:29:59 +0100 [info]: Connection opened to Elasticsearch cluster =>
.......

The tcpdump still shows data packages of the same length, namely 21888.
The warn 'section store is not used' disappears when I remove read_lines_limit from the configuration, but removing this has no effect on the result.
The split up records don't appear in ES and in the last test only 470 of 700 records are retreivable.

I don't know which process determines the 21888 package size and how to configure it, but clearly data_chunk_limit is not the place to do this....

I will try to make a description of everything that is needed to reproduce my test. I hope it can be reproduced, else the real problem may be in a place that make tracking it down difficult.

thanks for your help, Ruud

ruud

unread,
Feb 22, 2016, 5:42:56 AM2/22/16
to Fluentd Google Group
This can be use to reproduce the test I did:

-  /etc/td-agent/td-agent.conf contains

<source>
  type tail_path
  pos_file /var/lib/td-agent/events.pos
  format json
  path /data/factuur-logs/*.log
  tag fluentd.events
  read_from_head true
  refresh_interval 10s
  path_key path
</source>

<match fluentd.events>
  type copy
  <store>
    type elasticsearch
    host URL_OF_LOCALHOST_FORMAT_xxx.xxx.xxx.xxx

    logstash_format true
    request_timeout 100
    reload_connections false
    reload_on_failure true
    disable_retry_limit false
    flush_at_shutdown true
    buffer_queue_limit 1024
    buffer_chunk_limit 15k
  </store>
</match>

- create directory /data/factuur-logs with the my logfile, that can be downloaded from
- in a seperate terminal, I start the packet dump with
   sudo /usr/sbin/tcpdump -A -i any 'port 9200' > /tmp/dump.txt
- sudo service elasticsearch stop; sudo /etc/init.d/td-agent stop ; sudo rm /var/lib/td-agent/events.pos; sudo rm -r /data/elasticsearch/facturen/nodes; sudo /etc/init.d/td-agent start; sudo service elasticsearch start

- wait for td-agent.log to contain 'Connection opened to Elasticsearch cluster' line
- stop tcpdump command
- determine number of elasticsearch records

On my machine, the number of ES records does not equal the number of lines in the log file that is processed by fluentd.

regards, Ruud

Mr. Fiber

unread,
Feb 22, 2016, 4:55:37 PM2/22/16
to Fluentd Google Group
I can't reproduce the problem on my Mac OS X.
* ruby 2.1.3p242
* fluentd v0.12.20
* elasitcsearch 2.1.0
* fluent-plugin-elasticsearch 1.4.0

- fluentd log

% fluentd -c fluentd.conf
2016-02-23 06:43:39 +0900 [info]: reading config file path="fluentd.conf"
2016-02-23 06:43:39 +0900 [info]: starting fluentd-0.12.20
2016-02-23 06:43:39 +0900 [info]: gem 'fluent-mixin-config-placeholders' version '0.3.0'
2016-02-23 06:43:39 +0900 [info]: gem 'fluent-mixin-rewrite-tag-name' version '0.1.0'
2016-02-23 06:43:39 +0900 [info]: gem 'fluent-plugin-elasticsearch' version '1.4.0'
2016-02-23 06:43:39 +0900 [info]: gem 'fluent-plugin-flowcounter-simple' version '0.0.3'
2016-02-23 06:43:39 +0900 [info]: gem 'fluent-plugin-forest' version '0.3.0'
2016-02-23 06:43:39 +0900 [info]: gem 'fluent-plugin-mongo' version '0.7.8'
2016-02-23 06:43:39 +0900 [info]: gem 'fluent-plugin-s3' version '0.6.5'
2016-02-23 06:43:39 +0900 [info]: gem 'fluent-plugin-secure-forward' version '0.3.4'
2016-02-23 06:43:39 +0900 [info]: gem 'fluent-plugin-td' version '0.10.28'
2016-02-23 06:43:39 +0900 [info]: gem 'fluentd' version '0.12.20'
2016-02-23 06:43:39 +0900 [info]: adding match pattern="fluentd.events" type="copy"
2016-02-23 06:43:39 +0900 [info]: adding source type="tail"
2016-02-23 06:43:39 +0900 [warn]: 'pos_file PATH' parameter is not set to a 'tail' source.
2016-02-23 06:43:39 +0900 [warn]: this parameter is highly recommended to save the position to resume tailing.
2016-02-23 06:43:39 +0900 [info]: using configuration file: <ROOT>
  <source>
    @type tail
    path /Users/repeatedly/tmp/fluentd/es_test/*.log
    tag fluentd.events
    format json
    read_from_head true
    refresh_interval 10s
  </source>
  <match fluentd.events>
    @type copy
    <store>
      @type elasticsearch

      request_timeout 100
      reload_connections false
      reload_on_failure true
      disable_retry_limit false
      flush_at_shutdown true
      buffer_queue_limit 1024
      buffer_chunk_limit 15k
      flush_interval 5s
    </store>
  </match>
</ROOT>
2016-02-23 06:43:39 +0900 [info]: following tail of /Users/repeatedly/tmp/fluentd/es_test/a22.log
2016-02-23 06:43:39 +0900 [warn]: Size of the emitted data exceeds buffer_chunk_limit.
2016-02-23 06:43:39 +0900 [warn]: This may occur problems in the output plugins ``at this server.``
2016-02-23 06:43:39 +0900 [warn]: To avoid problems, set a smaller number to the buffer_chunk_limit
2016-02-23 06:43:39 +0900 [warn]: in the forward output ``at the log forwarding server.``
2016-02-23 06:43:44 +0900 [info]: Connection opened to Elasticsearch cluster => {:host=>"localhost", :port=>9200, :scheme=>"http"}
^C2016-02-23 06:44:47 +0900 [info]: shutting down fluentd
2016-02-23 06:44:47 +0900 [info]: shutting down input type="tail" plugin_id="object:3fcc30480710"
2016-02-23 06:44:47 +0900 [info]: shutting down output type="copy" plugin_id="object:3fcc308ac7f8"
2016-02-23 06:44:47 +0900 [info]: process finished code=0

I removed pos_file for easy testing and add flush_interval 5s.

- ES log

% ./bin/elasticsearch
[2016-02-23 06:43:13,042][INFO ][node                     ] [The Symbiote] version[2.1.0], pid[24180], build[72cd1f1/2015-11-18T22:40:03Z]
[2016-02-23 06:43:13,042][INFO ][node                     ] [The Symbiote] initializing ...
[2016-02-23 06:43:13,086][INFO ][plugins                  ] [The Symbiote] loaded [], sites []
[2016-02-23 06:43:13,108][INFO ][env                      ] [The Symbiote] using [1] data paths, mounts [[/ (/dev/disk1)]], net usable_space [304gb], net total_space [930.7gb], spins? [unknown], types [hfs]
[2016-02-23 06:43:14,367][INFO ][node                     ] [The Symbiote] initialized
[2016-02-23 06:43:14,367][INFO ][node                     ] [The Symbiote] starting ...
[2016-02-23 06:43:14,429][INFO ][transport                ] [The Symbiote] publish_address {127.0.0.1:9300}, bound_addresses {127.0.0.1:9300}, {[fe80::1]:9300}, {[::1]:9300}
[2016-02-23 06:43:14,437][INFO ][discovery                ] [The Symbiote] elasticsearch/p112L1NuQuK6CKhWQKTXKQ
[2016-02-23 06:43:17,461][INFO ][cluster.service          ] [The Symbiote] new_master {The Symbiote}{p112L1NuQuK6CKhWQKTXKQ}{127.0.0.1}{127.0.0.1:9300}, reason: zen-disco-join(elected_as_master, [0] joins received)
[2016-02-23 06:43:17,470][INFO ][http                     ] [The Symbiote] publish_address {127.0.0.1:9200}, bound_addresses {127.0.0.1:9200}, {[fe80::1]:9200}, {[::1]:9200}
[2016-02-23 06:43:17,470][INFO ][node                     ] [The Symbiote] started
[2016-02-23 06:43:17,484][INFO ][gateway                  ] [The Symbiote] recovered [0] indices into cluster_state
[2016-02-23 06:43:45,044][INFO ][cluster.metadata         ] [The Symbiote] [fluentd] creating index, cause [auto(bulk api)], templates [], shards [5]/[1], mappings [fluentd]
[2016-02-23 06:43:45,268][INFO ][cluster.metadata         ] [The Symbiote] [fluentd] update_mapping [fluentd]

- the number of documents test

% curl -XGET 'http://localhost:9200/_stats' # before import
{"_shards":{"total":0,"successful":0,"failed":0},"_all":{"primaries":{},"total":{}},"indices":{}}
% curl -XGET 'http://localhost:9200/_stats' | jq -r '.indices["fluentd"]["primaries"]["docs"]["count"]'
700

I tried several times but it alway prints 700.


ruud

unread,
Feb 23, 2016, 3:33:28 AM2/23/16
to Fluentd Google Group
- the number of documents test

% curl -XGET 'http://localhost:9200/_stats' # before import
{"_shards":{"total":0,"successful":0,"failed":0},"_all":{"primaries":{},"total":{}},"indices":{}}
% curl -XGET 'http://localhost:9200/_stats' | jq -r '.indices["fluentd"]["primaries"]["docs"]["count"]'
700

I tried several times but it alway prints 700.


Hi repeatedly,
I don't have the jq command, but when I do the _stats call and edit the output, I get these counts:

 "logstash-2016.01.06":{"xxx":{"count":30,
"logstash-2015.12.28":{"xxx":{"count":34,
"logstash-2015.12.18":{"xxx":{"count":8,
"logstash-2016.01.03":{"xxx":{"count":2,
"logstash-2015.12.17":{"xxx":{"count":27,
"logstash-2015.12.16":{"xxx":{"count":10,
"logstash-2015.12.29":{"xxx":{"count":37,
"logstash-2016.01.05":{"xxx":{"count":81,
"logstash-2016.01.04":{"xxx":{"count":74,
"logstash-2015.12.23":{"xxx":{"count":36,
"logstash-2015.12.24":{"xxx":{"count":21,
"logstash-2016.01.01":{"xxx":{"count":2,
"logstash-2015.12.30":{"xxx":{"count":37,
"logstash-2015.12.21":{"xxx":{"count":71,
"logstash-2015.12.31":{"xxx":{"count":8,
"logstash-2015.12.22":{"xxx":{"count":0,

....which will differ from your counts. I am glad not everyone is hitting this issue, but it makes it harder to find someone who is experiencing this problem too.

Maybe it is a machine configuration issue, or a java issue. I have no clue where to start looking or where to ask for help, since I don't know the source of the issue.
Anything that helps me in pinpointing the cause of the trouble is greatly appreciated!

regards, Ruud

ruud

unread,
Feb 23, 2016, 3:58:16 AM2/23/16
to Fluentd Google Group


On Thursday, 18 February 2016 10:10:00 UTC+1, ruud wrote:

The tcpdump still shows data packages of the same length, namely 21888.
The warn 'section store is not used' disappears when I remove read_lines_limit from the configuration, but removing this has no effect on the result.
The split up records don't appear in ES and in the last test only 470 of 700 records are retreivable.

I don't know which process determines the 21888 package size and how to configure it, but clearly data_chunk_limit is not the place to do this....

I am still very interested in how to influence the output package length (here 21888). The data_chunk_limit seems a logical candidate, but has no effect. Any ideas?

regards, Ruud

Mr. Fiber

unread,
Feb 23, 2016, 7:16:56 AM2/23/16
to Fluentd Google Group
which will differ from your counts.

Ah, I remember this is a known issue of Elaticsearch.
Creating index needs CPU / IO resources.
So if you send lots of such requests,
Elasticsearch becomes busy and rejects several requests.


I'm not sure this problem is releated with your TCP separation issue.


ruud

unread,
Feb 23, 2016, 10:12:27 AM2/23/16
to Fluentd Google Group


On Tuesday, 23 February 2016 13:16:56 UTC+1, repeatedly wrote:
which will differ from your counts.

Ah, I remember this is a known issue of Elaticsearch.
Creating index needs CPU / IO resources.
So if you send lots of such requests,
Elasticsearch becomes busy and rejects several requests.


I'm not sure this problem is releated with your TCP separation issue.


Hi repeatedly,
again thank you for your time to help me. With help of your last post, I was able to find out it is really an elasticsearch capacity issue. Elasticsearch must have enough resources to cope with the expected bulk requests (in this case from td-agent).

Clearly, if the capacity is to low, records get rejected and that seems to correspond with tcp record boundaries.

In my elasticsearch configuration I raised the threadpool.bulk.queue_size to 2000 and after that, the empty indices and rejected records were vanished.

I still have no clue how to solve this at the td-agent side of the communication (so configuring td-agent to offer smaller bulks to ES or in a slower pace), but I am very glad I now how to handle this kind of problems in the future.

Ruud
Reply all
Reply to author
Forward
0 new messages