| Duplicate events in ES index ... and some rant | Manish Sapariya | 6/13/13 3:00 AM | Hi, I am seeing same event stored multiple times in the ES index. I am using lumberjack to collect the events, which feeds to logstash. I would appreciate any guidance in isolating this problem. I am not sure what other information I could share, so I am keeping this very short for now. === rant === Over last month I have been facing one or the other issue with this setup. This causes doubt in my mind about about choosing lumberjack + logstash + ES + kibana combination. My end goal is to extract metrics regularly from large set of log files that our system generates, but I am not even able to get pass HTTP access logs. I am sure people are using this tool chain at large scale than what I have. Over last month I have run into following problems. - Excessive load causes logstash to chock with CLOSE_WAIT - I reduced amount of logs fed to logstash - reduced number of filters in logstash config - Some times when I search using Kibana, the ES throws exception, logstash could not push events to ES. I have to restart whole toolchain to recover. ( I have not discussed this on the list before this rant) - And now duplicate events. (Things have been good for last week, after reducing the filter and number of log files, but I was very nervous to see same events injected multiple times.) Some positive comments would really help motivate me. :-) Thanks and Regards, Manish |
| Re: [logstash-users] Duplicate events in ES index ... and some rant | Jordan Sissel | 6/13/13 9:34 AM | On Thu, Jun 13, 2013 at 3:00 AM, Manish Sapariya <msap...@gmail.com> wrote: Lumberjack is designed to guarantee that every event will be sent. To do this, it can sometimes send a event repeatedly. The reason for this is lumberjack spools a few hundred events up before sending them off - if that full spool is not acknowledged, it is resent. If logstash receives 1000 events in a lumberjack payload, and processes 500 of them before lumberjack think there's a timeout, lumberjack will reconnect and resend the full 1000, giving you the first 500 duplicated.
If you really hate duplicates, you can try setting '--window-size 1' on lumberjack so that each payload will only contain 1 event and lumberjack will wait for that event to be acknowledged before it sends another. You can still get duplicates in this situation, but the number of duplicates is much reduced.
excessive CLOSE_WAIT has been reported before with lumberjack and tcp inputs. I haven't seen this myself and I can't reproduce it :(
But at least you are not alone! :)
This may have to do with how well provisioned elasticsearch is. In cases of elasticsearch failures, restarting elasticsearch should be all you need to do - logstash and lumberjack shouldn't care. If they do and do not resolve themselves, it is a bug.
There's ways to prevent duplicate events by setting a 'document_id' in the elasticsearch output; done carefully, this causes duplicate events to overwrite themselves in elasticsearch instead of creating new items in elasticsearch.
Hope this helps clarify things. Let me know if you have any further questions :) -Jordan |
| Re: [logstash-users] Duplicate events in ES index ... and some rant | Manish Sapariya | 6/13/13 8:27 PM | Really appreciate the inputs Jordan. I will try experiment with lumberjack window and ES document_id. If I see the CLOSE_WAIT problem, what information can I share, which could help debug the issue. I and other person have shared the complete strack trac of Java. If something else can help, please let me know. Regards, Manish |
| Re: [logstash-users] Duplicate events in ES index ... and some rant | Mathieu Martin | 6/13/13 9:24 PM | Thanks for bringing the duplication issue up, I've been using Lumberjack and didn't notice the behaviour (probably because I didn't look).
But Jordan's answer was so enlightening that I sent it back to himself as a pull request ;-) (There's got to be a word for that)
So yeah, your rant gave birth to a new section in the documentation.
|