memcache unavailable after machine failure

719 views
Skip to first unread message

kir...@feedly.com

unread,
May 7, 2013, 10:55:30 PM5/7/13
to spymem...@googlegroups.com
We have spymemcached 2.8.4 configured to access 2 servers on separate machines. One server went down, and the client seemed to keep trying to reconnect and the queue got backed up. Things started working again only after I removed the downed host from the config and restarted the client. Here is the config:

ConnectionFactoryBuilder builder = new ConnectionFactoryBuilder();
builder.setProtocol(Protocol.BINARY);
builder.setHashAlg(DefaultHashAlgorithm.KETAMA_HASH);
builder.setOpTimeout(2500);
builder.setLocatorType(Locator.CONSISTENT);
builder.setTranscoder(new CompressingEncoder(2048));

MemcachedClient newClient = new MemcachedClient(builder.build(), new ArrayList<InetSocketAddress>(newInstances));
((SerializingTranscoder) newClient.getTranscoder()).setCompressionThreshold(2048);

From the logs:

net.spy.memcached.MemcachedConnection handleIO WARNING: sun.nio.ch.SelectionKeyImpl@1f34602a exceeded continuous timeout threshold
net.spy.memcached.MemcachedConnection queueReconnect WARNING: Closing, and reopening {QA sa=yyy/xxx:5700, #Rops=22, #Wops=1340, #iq=1, topRop=Cmd: 16 Opaque: 35945, topWop=Cmd: -1 Opaque: -1 Keys: e-40415eb63a7d7114v1 ...
net.spy.memcached.protocol.TCPMemcachedNodeImpl setupResend WARNING: Discarding partially completed op: Cmd: 16 Opaque: 35945
net.spy.memcached.MemcachedClient$10$1 receivedStatus WARNING: Unsuccessful stat fetch: {OperationStatus success=false:  cancelled}
net.spy.memcached.protocol.TCPMemcachedNodeImpl setupResend WARNING: Discarding partially completed op: Cmd: -1 Opaque: -1 Keys: e-4738fd2e35acbeb3v1 e-62ddc12de341d3aev1
<many similar messages...>
net.spy.memcached.MemcachedConnection addOperation WARNING: Could not redistribute to another node, retrying primary node for e-d618466439778a23v1.
<many similar messages...>
net.spy.memcached.MemcachedConnection attemptReconnects INFO: Reconnecting {QA sa=yyy/xxx:5700, #Rops=0, #Wops=1509, #iq=0, topRop=null, topWop=Cmd: -1 Opaque: -1 Keys: e-40415eb63a7d7114v1 ..., toWrite=0, interested=0}
AM net.spy.memcached.MemcachedConnection handleIO INFO: Connection state changed for sun.nio.ch.Se...@e6dd284net.spy.memcached.MemcachedConnection handleIO INFO: Reconnecting due to exception on {QA sa=yyy/xxx:5700, #Rops=0, #Wops=1520, #iq=0, topRop=null, topWop=Cmd: -1 Opaque: -1 Keys: e-40415eb63a7d7114v1 ..., toWrite=0, interested=0}
java.net.NoRouteToHostException: No route to host
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
        at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:369)
        at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:242)
        at net.spy.memcached.MemcachedConnection.run(MemcachedConnection.java:836)net.spy.memcached.MemcachedConnection queueReconnect WARNING: Closing, and reopening {QA sa=yyy/xxx:5700, #Rops=0, #Wops=1520, #iq=0, topRop=null, topWop=Cmd: -1 Opaque: -1 Keys: e-40415eb63a7d7114v1 ..., toWrite=0, interested=0}, attempt 1.

This seems to keep happening in a loop. The reconnection attempts seem to be happening periodically which seems ok, but I don't understand why requests got backed up. It seems like they should have all gone to the remaining server? Is there a config setting I missed?

Thanks.

Matt Ingenthron

unread,
May 8, 2013, 8:13:00 PM5/8/13
to spymem...@googlegroups.com
It looks correct to me, and I agree it should have redistributed.

The only situation where it would generate that message is if the other nodes were not active, or there is a bug.

If you can help debug, it'd be great to point a debugger at this section in the MemcachedConnection class:
    } else {
      // Look for another node in sequence that is ready.
      for (Iterator<MemcachedNode> i = locator.getSequence(key); placeIn == null
          && i.hasNext();) {
        MemcachedNode n = i.next();
        if (n.isActive()) {
          placeIn = n;
        }
      }
      // If we didn't find an active node, queue it in the primary node
      // and wait for it to come back online.
      if (placeIn == null) {
        placeIn = primary;
        this.getLogger().warn(
            "Could not redistribute "
                + "to another node, retrying primary node for %s.", key);
      }
    }

We want to see if there is a sequence there.

Thanks,

Matt



--
You received this message because you are subscribed to the Google Groups "spymemcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spymemcached...@googlegroups.com.
To post to this group, send email to spymem...@googlegroups.com.
Visit this group at http://groups.google.com/group/spymemcached?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Masanobu Horiyama

unread,
May 31, 2013, 12:41:08 AM5/31/13
to spymem...@googlegroups.com
Hi,

  We have the same problem with spymemcached 2.8.4. We found cases where there are downed nodes and a live node, and some keys will never hash to the live node.

I've written a test that reproduces the problem.

Masa

package net.spy.memcached;

import org.easymock.EasyMock;
import org.testng.Assert;
import org.testng.annotations.Test;

import java.net.InetSocketAddress;
import java.util.*;

/**
 * @author mhoriyam
 *
 * This test reproduces an issue we are seeing where when several memcached nodes are down, redistribute
 * fails to hash to a live memcached node.
 *
 */
public class KetamaNodeLocatorTest {

    @Test
    public void testMissingNode() {

        List<MemcachedNode> nodes = new ArrayList<MemcachedNode>();
        nodes.add(createMockNode(new InetSocketAddress("localhost", 333)));
        nodes.add(createMockNode(new InetSocketAddress("localhost", 444)));
        nodes.add(createMockNode(new InetSocketAddress("localhost", 12345)));

        KetamaNodeLocator locator = new KetamaNodeLocator(nodes, DefaultHashAlgorithm.KETAMA_HASH);

        Iterator<MemcachedNode> i = locator.getSequence("serverList");

        Set<MemcachedNode> foundNodes = new HashSet<MemcachedNode>();
        while (i.hasNext()) {
            // 7 nodes where localhost:333 and localhost:444 are repeated, but there's no localhost:12345
            foundNodes.add(i.next());
        }

        // This fails. localhost:12345 is never found.
        Assert.assertEquals(foundNodes, nodes);
    }

    private MemcachedNode createMockNode(InetSocketAddress sock) {
        MemcachedNode mockNode = EasyMock.createMock(MemcachedNode.class);
        EasyMock.expect(mockNode.getSocketAddress()).andReturn(sock).anyTimes();
        EasyMock.replay(mockNode);
        return mockNode;

    }
}


On Tuesday, May 7, 2013 10:55:30 PM UTC-4, kir...@feedly.com wrote:
We have spymemcached 2.8.4 configured to access 2 servers on separate machines. One server went down, and the client seemed to keep trying to reconnect and the queue got backed up. Things started working again only after I removed the downed host from the config and restarted the client. Here is the config:

ConnectionFactoryBuilder builder = new ConnectionFactoryBuilder();
builder.setProtocol(Protocol.BINARY);
builder.setHashAlg(DefaultHashAlgorithm.KETAMA_HASH);
builder.setOpTimeout(2500);
builder.setLocatorType(Locator.CONSISTENT);
builder.setTranscoder(new CompressingEncoder(2048));

MemcachedClient newClient = new MemcachedClient(builder.build(), new ArrayList<InetSocketAddress>(newInstances));
((SerializingTranscoder) newClient.getTranscoder()).setCompressionThreshold(2048);

From the logs:

net.spy.memcached.MemcachedConnection handleIO WARNING: sun.nio.ch.SelectionKeyImpl@1f34602a exceeded continuous timeout threshold
net.spy.memcached.MemcachedConnection queueReconnect WARNING: Closing, and reopening {QA sa=yyy/xxx:5700, #Rops=22, #Wops=1340, #iq=1, topRop=Cmd: 16 Opaque: 35945, topWop=Cmd: -1 Opaque: -1 Keys: e-40415eb63a7d7114v1 ...
net.spy.memcached.protocol.TCPMemcachedNodeImpl setupResend WARNING: Discarding partially completed op: Cmd: 16 Opaque: 35945
net.spy.memcached.MemcachedClient$10$1 receivedStatus WARNING: Unsuccessful stat fetch: {OperationStatus success=false:  cancelled}
net.spy.memcached.protocol.TCPMemcachedNodeImpl setupResend WARNING: Discarding partially completed op: Cmd: -1 Opaque: -1 Keys: e-4738fd2e35acbeb3v1 e-62ddc12de341d3aev1
<many similar messages...>
net.spy.memcached.MemcachedConnection addOperation WARNING: Could not redistribute to another node, retrying primary node for e-d618466439778a23v1.
<many similar messages...>
net.spy.memcached.MemcachedConnection attemptReconnects INFO: Reconnecting {QA sa=yyy/xxx:5700, #Rops=0, #Wops=1509, #iq=0, topRop=null, topWop=Cmd: -1 Opaque: -1 Keys: e-40415eb63a7d7114v1 ..., toWrite=0, interested=0}
AM net.spy.memcached.MemcachedConnection handleIO INFO: Connection state changed for sun.nio.ch.SelectionKeyImpl@e6dd284net.spy.memcached.MemcachedConnection handleIO INFO: Reconnecting due to exception on {QA sa=yyy/xxx:5700, #Rops=0, #Wops=1520, #iq=0, topRop=null, topWop=Cmd: -1 Opaque: -1 Keys: e-40415eb63a7d7114v1 ..., toWrite=0, interested=0}
java.net.NoRouteToHostException: No route to host
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
        at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:369)
        at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:242)
        at net.spy.memcached.MemcachedConnection.run(MemcachedConnection.java:836)net.spy.memcached.MemcachedConnection queueReconnect WARNING: Closing, and reopening {QA sa=yyy/xxx:5700, #Rops=0, #Wops=1520, #iq=0, topRop=null, topWop=Cmd: -1 Opaque: -1 Keys: e-40415eb63a7d7114v1 ..., toWrite=0, interested=0}, attempt 1.

ingenthr

unread,
Jun 5, 2013, 3:23:04 AM6/5/13
to spymem...@googlegroups.com
I filed issue 278, thanks!  Please track it there.

kir...@feedly.com

unread,
Jun 6, 2013, 10:20:22 AM6/6/13
to spymem...@googlegroups.com
Thanks for figuring out a test case! Were you seeing similar behavior in that the memcached queues would fill up? I would actually be ok with misses, but in my situation the queues filled up and my entire application became unresponsive as all threads were stuck waiting for memcached. Should the client still try to contact a node even if it's dead?

Thanks
Kireet

Masanobu Horiyama

unread,
Jun 9, 2013, 11:11:40 PM6/9/13
to spymem...@googlegroups.com

Yes, we were seeing the same behavior. Once a node went down, and the client could not locate a live one, the queues would fill up and the client would become unusable. The client will in most cases hash keys to a live node, but there's something like a less than 1% chance for keys to not hash. You can test that out by generating random keys and seeing how many times the test will return false.

Unfortunately, this 1% is quite significant under heavy load, and quickly kills the client. We tried tweaking some parameters like opQueueMaxBlockTime and operationTimeout with limited success. We even modified the KetamaNodeLocator class to retry more than 7 times to find a live node which, as expected, reduced the probability of a bad hash. In the end, we created a mechanism by which our clients would poll for changes in the topology and simply restart themselves without the down nodes.

Quick correction -- The assert in the test should have been:

        // This fails. localhost:12345 is never found.
        Assert.assertTrue(foundNodes.containsAll(nodes));

Masa

kir...@feedly.com

unread,
Jun 13, 2013, 12:26:02 PM6/13/13
to spymem...@googlegroups.com
This seems like a critical stability issue. Is there an ETA for a fix? I don't see any activity on the issue.

Kireet
Reply all
Reply to author
Forward
0 new messages