Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
OOM errors, shard destroyed, index destroyed, etc.
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  4 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
courtenay  
View profile  
 More options Jun 13 2012, 6:57 pm
From: courtenay <court3...@gmail.com>
Date: Wed, 13 Jun 2012 15:57:38 -0700 (PDT)
Local: Wed, Jun 13 2012 6:57 pm
Subject: OOM errors, shard destroyed, index destroyed, etc.

Hey all,

I've had a less than wonderful time with ES over the past few days after a
year of good service.

We have about 60gb of data in a troublesome index. 3 nodes, 2gb ram each, 1
replica (2 copies of data). It's been running fine up until the weekend.

Here's what happens:

- One of the nodes goes OOM.
- Master node locks up
- Other nodes remove master for timeout
- Cluster stops responding
- Other nodes either go OOM or lock up or start spewing errors
- Those crash too

So I have to generally restart the master and the OOM nodes, and sometimes
the whole cluster.
When it comes back up

- Cluster reports unassigned shards
- Cluster rebuilds, eventually
- Cluster crashes again with OOM.

Amazingly, it's always consistently the same shard that is unassigned.
One of the crashes actually deleted all copies of that shard! So we had to
rebuild the whole index from scratch (it's going to take about a week. On
production systems.). Grr.

So we rebuilt the cluster with way more shards (32 instead of 5), upgraded
the hardware, (now running on 5 beefy machines with 17 or 58gb of ram each).
And the while it's a different shard now that goes bad and has to be
recovered, it's still the same consistent one, and we still get OOMs.

So it appears to my uneducated self that there's a bad document there which
is causing a leak -> OOM errors which cascades failures to the whole
cluster and causes split-brain issues which cause the deletions and other
weird behavior.

- Is there a way of inspecting the contents of that shard?
- Is there a way of preventing the OOM error?

The OOM crash log and our config is
here: https://gist.github.com/21c34e632adff490c08c

Appreciate the help in advance!

-- Courtenay


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
courtenay  
View profile  
 More options Jun 13 2012, 8:22 pm
From: courtenay <court3...@gmail.com>
Date: Wed, 13 Jun 2012 17:22:48 -0700 (PDT)
Local: Wed, Jun 13 2012 8:22 pm
Subject: Re: OOM errors, shard destroyed, index destroyed, etc.

More information inline.

On Wednesday, June 13, 2012 3:57:38 PM UTC-7, courtenay wrote:

> Hey all,

> I've had a less than wonderful time with ES over the past few days after a
> year of good service.

> We have about 60gb of data in a troublesome index. 3 nodes, 2gb ram each,
> 1 replica (2 copies of data). It's been running fine up until the weekend.

> Here's what happens:

> - One of the nodes goes OOM.

This just happened again, on the same shard. (15)

[16:31:37,949][WARN ][index.engine.robin       ] [Adam II]
[discussions_production_barf][15] failed engine
java.lang.OutOfMemoryError: Java heap space

And again 20 minutes later. I managed to capture some graphs! (attached)

  Screen Shot 2012-06-13 at 5.09.11 PM.png
35K Download

  Screen Shot 2012-06-13 at 5.09.01 PM.png
94K Download

  Screen Shot 2012-06-13 at 5.08.49 PM.png
63K Download

  Screen Shot 2012-06-13 at 5.08.34 PM.png
59K Download

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Shay Banon  
View profile  
 More options Jun 14 2012, 5:44 pm
From: Shay Banon <kim...@gmail.com>
Date: Thu, 14 Jun 2012 23:44:29 +0200
Local: Thurs, Jun 14 2012 5:44 pm
Subject: Re: OOM errors, shard destroyed, index destroyed, etc.

First, which version are you using?
Second, are your nodes not even in terms of memory allocated to ES (heap
size env var) or machine memory?


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Courtenay  
View profile  
 More options Jun 14 2012, 7:14 pm
From: Courtenay <court3...@gmail.com>
Date: Thu, 14 Jun 2012 16:14:48 -0700
Local: Thurs, Jun 14 2012 7:14 pm
Subject: Re: OOM errors, shard destroyed, index destroyed, etc.

0.19.4

The machines have different amounts of memory, yes. Either 6, 15, or 25 gb allocated to ES.

So I can stop the "leak" by changing the thread pool model, but then what happens is that it indexes about 30 documents then freezes up.

So .. maybe something is preventing indexing and the cache thread model would just keep spawning and run out of memory.

And all my servers are sitting on 100-200% CPU despite not actually indexing or having much heavy load. Where can I start looking? Extra logs?

On Jun 14, 2012, at 2:44 PM, Shay Banon <kim...@gmail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »