hi group,
I hope you can advise me on I problem I have with large log files.
When
processing large initial logfiles with fluentd (2.3.0), it turns out,
not all records are visible in elasticsearch. In some cases, complete
days cannot be retreived.
It turns out this happens when json
records in the communication on port 9200 to elasticearch are split in
two records. ES seems not to handle that situation correctly. In the
worst case, when the split is inside the value of the timestamp field,
The day index of ES gets corrupt en all records of that index (day) will
not be found.
my questionSince ES is not very
responsive about this issue, I hope there is a way to fix this at
fluentd side of the communication. Is there a way to configure fluentd
in such a way that in sending batches to ES, the batches are split up
BETWEEN json records instead of in the middle?
infoTo clarify the situation, this is what I did to find this issue:
- started with an empty elasticsearch and empty fluentd directory
- started tcpdump on port 9200 of the elasticsearch (and fluentd) machine
- put a large log file in the monitored log directory
- fluentd starts to process the logfile and sends its data to ES
- inspect the result of tcpdump.
It appears that some records are silently forgotten in ES and the records of some days result in an error message in ES.
These match with this situation in the
tcpdump -A -i any 'port 9200' output:
: : :
{"@timestamp":"2015-12-21T15:12:00.000+00:00","url":"192.168.1.100","component":"pdfserver","factuurnummer":"21691153","een_klantnummer":"103575592","success":true,"path":"/data/factuur-logs/a22.log"}
{"@timestamp":"2015-12-21T15:15:00.000+00:00","url":"192.168.1.100","component":"pdfserver","factuurnummer":"21680672","een_klantnummer":"103011721","success":true,"path":"/data/factuur-logs/a22.log"}
{"@timestamp":"2015-12-21T15:15:00.000+00:00","url":"192.168.1.100","component":"pdfserve << END OF TCP RECORD
17:05:06.524011 IP 172.22.25.227.9200 > 172.22.25.227.33655: Flags [.], ack 43879, win 2388, options [nop,nop,TS val 12919195 ecr 12919195], length 0
E..4..@.@...........#..w.U.de8A... T.......
..!...!.
17:05:06.524068 IP 172.22.25.227.33655 > 172.22.25.227.9200: Flags [.], seq 43879:65767, ack 1, win 342, options [nop,nop,TS val 12919195 ecr 12919194], length 21888
E.U...@.@..e.........w#.e8A..U.d...V.......
..!...!.r","factuurnummer":"21680672","een_klantnummer":"103011721","success":true,"path":"/data/factuur-logs/a22.log"}
{"@timestamp":"2015-12-21T15:19:00.000+00:00","url":"192.168.1.100","component":"pdfserver","factuurnummer":"21690974","een_klantnummer":"103110656","success":true,"path":"/data/factuur-logs/a22.log"}
As
you can see, the json record is divided in two pieces. In this case the
record was not found in ES, nothing in logging of ES. It is simply
ignored.
As said, when this occurs WITHIN the value of a
timestamp field, the day index of ES gets corrupted and nothing of that
day will be found. In the logging of ES a message like this will be
recorded at each query including that index:
: : : :
at org.elasticsearch.search.SearchService.parseSource(SearchService.java:633)
... 9 more
[2016-02-11 12:37:47,881][DEBUG][action.search.type ] [Agony] All shards failed for phase: [query]
[2016-02-11 12:37:47,878][DEBUG][action.search.type ] [Agony] [logstash-2015.12.22][3], node[stwLqEprShKfaaP5jto1BQ], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@783242c8] lastShard [true]
org.elasticsearch.search.SearchParseException: [logstash-2015.12.22][3]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"facets":{"0":{"date_histogram":{"field":"@timestamp","interval":"30m"},"global":true,"facet_filter":{"fquery":{"query":{"filtered":{"query":{"query_string":{"query":"*"}},"filter":{"bool":{"must":[{"range":{"@timestamp":{"from":1448986135627,"to":1455190667873}}},{"range":{"@timestamp":{"from":1450652207645,"to":1450824559922}}}]}}}}}}}},"size":0}]]
at org.elasticsearch.search.SearchService.parseSource(SearchService.java:649)
at org.elasticsearch.search.SearchService.createContext(SearchService.java:511)
at org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:483)
at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:252)
at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:206)
at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:203)
at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:517)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.search.facet.FacetPhaseExecutionException: Facet [0]: (key) field [@timestamp] not found
at org.elasticsearch.search.facet.datehistogram.DateHistogramFacetParser.parse(DateHistogramFacetParser.java:160)
at org.elasticsearch.search.facet.FacetParseElement.parse(FacetParseElement.java:93)
at org.elasticsearch.search.SearchService.parseSource(SearchService.java:633)
... 9 more
I think this is really an elasticsearch issue, but I hope I can work around it by not feeding ES with input it cannot handle. Your help is greatly appreceated!
regards, Ruud