Duplicate events in ES index ... and some rant

Showing 1-4 of 4 messages
Duplicate events in ES index ... and some rant Manish Sapariya 6/13/13 3:00 AM
Hi,
I am seeing same event stored multiple times in the ES index.
I am using lumberjack to collect the events, which feeds to logstash.

I would appreciate any guidance in isolating this problem. I am not
sure what other information I could share, so I am keeping this very
short for now.

=== rant ===

Over last month I have been facing one or the other issue with this setup.
This causes doubt in my mind about about choosing lumberjack + logstash + ES + kibana
combination. My end goal is to extract metrics regularly from large set of
log files that our system generates, but I am not even able to get pass HTTP
access logs.

I am sure people are using this tool chain at large scale than what I have.

Over last month I have run into following problems.
 - Excessive load causes logstash to chock with CLOSE_WAIT
    - I reduced amount of logs fed to logstash
    - reduced number of filters in logstash config
 - Some times when I search using Kibana, the ES throws exception,
   logstash could not push events to ES. I have to restart whole toolchain
   to recover. ( I have not discussed this on the list before this rant)
 - And now duplicate events. (Things have been good for last week, after reducing the filter and number of log files,
   but I was very nervous to see same events injected multiple times.)

Some positive comments would really help motivate me.  :-)

Thanks and Regards,
Manish




Re: [logstash-users] Duplicate events in ES index ... and some rant Jordan Sissel 6/13/13 9:34 AM



On Thu, Jun 13, 2013 at 3:00 AM, Manish Sapariya <msap...@gmail.com> wrote:
Hi,
I am seeing same event stored multiple times in the ES index.

Lumberjack is designed to guarantee that every event will be sent. To do this, it can sometimes send a event repeatedly. The reason for this is lumberjack spools a few hundred events up before sending them off - if that full spool is not acknowledged, it is resent. If logstash receives 1000 events in a lumberjack payload, and processes 500 of them before lumberjack think there's a timeout, lumberjack will reconnect and resend the full 1000, giving you the first 500 duplicated.

If you really hate duplicates, you can try setting '--window-size 1' on lumberjack so that each payload will only contain 1 event and lumberjack will wait for that event to be acknowledged before it sends another. You can still get duplicates in this situation, but the number of duplicates is much reduced.


I am using lumberjack to collect the events, which feeds to logstash.

I would appreciate any guidance in isolating this problem. I am not
sure what other information I could share, so I am keeping this very
short for now.

=== rant ===

Over last month I have been facing one or the other issue with this setup.
This causes doubt in my mind about about choosing lumberjack + logstash + ES + kibana
combination. My end goal is to extract metrics regularly from large set of
log files that our system generates, but I am not even able to get pass HTTP
access logs.

I am sure people are using this tool chain at large scale than what I have.

Over last month I have run into following problems.
 - Excessive load causes logstash to chock with CLOSE_WAIT
    - I reduced amount of logs fed to logstash
    - reduced number of filters in logstash config

excessive CLOSE_WAIT has been reported before with lumberjack and tcp inputs. I haven't seen this myself and I can't reproduce it :(
But at least you are not alone! :)
 
 - Some times when I search using Kibana, the ES throws exception,
   logstash could not push events to ES. I have to restart whole toolchain
   to recover. ( I have not discussed this on the list before this rant)

This may have to do with how well provisioned elasticsearch is. In cases of elasticsearch failures, restarting elasticsearch should be all you need to do - logstash and lumberjack shouldn't care. If they do and do not resolve themselves, it is a bug.
 
 - And now duplicate events. (Things have been good for last week, after reducing the filter and number of log files,
   but I was very nervous to see same events injected multiple times.)


There's ways to prevent duplicate events by setting a 'document_id' in the elasticsearch output; done carefully, this causes duplicate events to overwrite themselves in elasticsearch instead of creating new items in elasticsearch.

Hope this helps clarify things. Let me know if you have any further questions :)

-Jordan
Re: [logstash-users] Duplicate events in ES index ... and some rant Manish Sapariya 6/13/13 8:27 PM
Really appreciate the inputs Jordan.
I will try experiment with lumberjack window and ES document_id.

If I see the CLOSE_WAIT problem, what information can I share, which
could help debug the issue. I and other person have shared the complete
strack trac of Java. If something else can help, please let me know.

Regards,
Manish
Re: [logstash-users] Duplicate events in ES index ... and some rant Mathieu Martin 6/13/13 9:24 PM
Thanks for bringing the duplication issue up, I've been using Lumberjack and didn't notice the behaviour (probably because I didn't look).

But Jordan's answer was so enlightening that I sent it back to himself as a pull request ;-) (There's got to be a word for that)

So yeah, your rant gave birth to a new section in the documentation.



--
Remember: if a new user has a bad time, it's a bug in logstash.
---
You received this message because you are subscribed to the Google Groups "logstash-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to logstash-user...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.