I've had a less than wonderful time with ES over the past few days after a year of good service.
We have about 60gb of data in a troublesome index. 3 nodes, 2gb ram each, 1 replica (2 copies of data). It's been running fine up until the weekend.
Here's what happens:
- One of the nodes goes OOM. - Master node locks up - Other nodes remove master for timeout - Cluster stops responding - Other nodes either go OOM or lock up or start spewing errors - Those crash too
So I have to generally restart the master and the OOM nodes, and sometimes the whole cluster. When it comes back up
- Cluster reports unassigned shards - Cluster rebuilds, eventually - Cluster crashes again with OOM.
Amazingly, it's always consistently the same shard that is unassigned. One of the crashes actually deleted all copies of that shard! So we had to rebuild the whole index from scratch (it's going to take about a week. On production systems.). Grr.
So we rebuilt the cluster with way more shards (32 instead of 5), upgraded the hardware, (now running on 5 beefy machines with 17 or 58gb of ram each). And the while it's a different shard now that goes bad and has to be recovered, it's still the same consistent one, and we still get OOMs.
So it appears to my uneducated self that there's a bad document there which is causing a leak -> OOM errors which cascades failures to the whole cluster and causes split-brain issues which cause the deletions and other weird behavior.
- Is there a way of inspecting the contents of that shard? - Is there a way of preventing the OOM error?
On Wednesday, June 13, 2012 3:57:38 PM UTC-7, courtenay wrote:
> Hey all,
> I've had a less than wonderful time with ES over the past few days after a > year of good service.
> We have about 60gb of data in a troublesome index. 3 nodes, 2gb ram each, > 1 replica (2 copies of data). It's been running fine up until the weekend.
On Thu, Jun 14, 2012 at 2:22 AM, courtenay <court3...@gmail.com> wrote:
> More information inline.
> On Wednesday, June 13, 2012 3:57:38 PM UTC-7, courtenay wrote:
>> Hey all,
>> I've had a less than wonderful time with ES over the past few days after
>> a year of good service.
>> We have about 60gb of data in a troublesome index. 3 nodes, 2gb ram each,
>> 1 replica (2 copies of data). It's been running fine up until the weekend.
>> Here's what happens:
>> - One of the nodes goes OOM.
> This just happened again, on the same shard. (15)