Redis Elasticache Occasional Address Not Found Errors

Rick Waugh

unread,

Mar 12, 2015, 7:26:34 PM3/12/15

to redi...@googlegroups.com

This is odd, and I have a ticket open with AWS on this, but I'm using node.js, and I'm using the node_redis npm module. I have generated an elasticcache redis instance, one primary, one read only.

Very easy to do. But under load, I started to get the occasional error, ""Redis connection to xxxx-lt-001.19atpm.0001.usw2.cache.amazonaws.com:6379 failed - getaddrinfo ENOTFOUND".

I'm stumped. Number of connections is only a couple of hundred at a time. CPU usage on redis is minimal. CPU usage on my Node server is about 25 percent. I'm also using dynamo, and it's barely breathing.

This is an API server for a game, and it's a big server I'm going against, a c4.xlarge. I'm starting to wonder if everything is so fast, I'm simply overwhelming the network on the EC2 instance, but I'm not thinking that that's the error I've been getting. And even when getting the errors, data is still being updated/inserted, as I can watch the variables change.

Anyone seen anything odd like this and resolved it?

Josiah Carlson

unread,

Mar 12, 2015, 8:24:43 PM3/12/15

to redi...@googlegroups.com

If I'm interpreting the error message correctly, it would seem that periodically your clients are unable to connect to your Redis server due to a DNS lookup failure. DNS is UDP based, so if/when the network from your node.js clients is overwhelmed (by you, other vms on the same physical machine, other machines on the network segment), DNS lookups can fail with some regularity (1 out of 10,000 is rare, but is often enough if your throughput is high).

My first suggestion is to reuse your Redis connections; many node.js clients are typically configured to use a single connection per Redis server for each node.js process, not each request. This allows for essentially transparent pipelining, which can increase overall throughput, at a slight cost of a bit more latency in an individual Redis call on average. This can sometimes be faster, depending on the details of the types of Redis calls you're making. If you can't reuse 1 connection (due to the use of BLPOP, pubsub, etc.), you should mention why; maybe someone has a solution that avoids this Redis usage, which could allow you to move to 1 connection, or at least not reconnect quickly.

My second suggestion is the same for anyone who is disconnecting and reconnecting to Redis servers repeatedly (Apache + mod_php users in particular, who can't use connection pooling) is to see if you can use Twemproxy/Nutcracker [1]. Put that on each of your node.js servers, connect through it instead of directly to your Redis server, and Twemproxy will automatically pipeline requests as well as keep only enough connections to Redis as is necessary, likely preventing the DNS lookup failures.

And on a related note, there are several commercial providers of Redis on AWS, with a variety of price/performance/feature tradeoffs - if you are looking for something more or different than the standard AWS options, know that you do have options.

[1] https://github.com/twitter/twemproxy

- Josiah

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.
To post to this group, send email to redi...@googlegroups.com.
Visit this group at http://groups.google.com/group/redis-db.
For more options, visit https://groups.google.com/d/optout.

Rick Waugh

unread,

Mar 13, 2015, 5:35:31 PM3/13/15

to redi...@googlegroups.com

Thanks, Josiah. I've been looking at the connections and how I set them. The issue always with node is that it's difficult to know whether reusing an object in an asynchronous environment can cause problems with overwriting data and making a mess. So I think we all tend to "overallocate" objects. I may try to do this.

Rick Waugh

unread,

Mar 14, 2015, 11:10:13 AM3/14/15

to redi...@googlegroups.com

Josiah, I made some changes, and it fixed my load testing problems. I am worried about how the client works, get overwriting data incorrectly.

In order to keep my connections down to almost nothing, I have created the following global variables in my main server.js file

redis_write_end_point = redis.createClient(6379, 'test-lt.19xxx.ng.0001.usw2.cache.amazonaws.com');

redis_read_end_point = redis.createClient(6379, 'test-lt-002.19xxx.0001.usw2.cache.amazonaws.com');

Then these are used throughout the app. So, they are always reused. But that means that everyone hitting that node.js process is sharing the same two connections, and I'm really concerned about people ending up having data swapped between them. But I'm not sure how else I can reuse connections, if I don't do this. Is there another way, or is this going to work?

On Thursday, March 12, 2015 at 5:24:43 PM UTC-7, Josiah Carlson wrote:

Josiah Carlson

unread,

Mar 14, 2015, 9:23:22 PM3/14/15

to redi...@googlegroups.com

Take a moment and consider what you are saying. If there was any real cause for concern about responses from the Redis server being delivered to the wrong callback, the author of your library would have already addressed it. Anything else would be the definition of bad software, and insane.

Under the covers, Redis executes commands it receives on any given connection one at a time, in order (Redis is also single-threaded, so uses an event loop similar to what node.js and basically every other async server does in order to multiplex between different connections), and replies in order. Your client keeps a FIFO queue (maybe an array, maybe something else) attached to its connection to Redis, so as each reply comes back (which was executed in-order), the response gets passed to the callback on the front of the queue that is popped off, which prepares for the next response.

If you're using explicitly blocking commands like BLPOP, BRPOP, etc., SUBSCRIBE or PSUBSCRIBE, then you will need more connections. But if you're using commands that don't block waiting for data, everything should just work with the two connections you have.

- Josiah

Rick Waugh

unread,

Mar 15, 2015, 12:09:01 PM3/15/15

to redi...@googlegroups.com

Tx, Josiah. Fixed my Redis problems, still having other network problems. Very weird. I'm hardly taxing my server.

Felix Gallo

unread,

Mar 15, 2015, 12:12:00 PM3/15/15

to redi...@googlegroups.com

Aws redis elasticache is unsafe at any speed. It is poorly implemented, breaks all the time, turning on aof will cause you to lose data as the underlying instances have comically low risk space, and support is poor. It's frankly shameful. Run your own redis instances.

F.

Rick Waugh

unread,

Mar 15, 2015, 4:51:26 PM3/15/15

to redi...@googlegroups.com

We are truly running it as a cache, not a DB, and you can't use AOF if you do multi zone, which is sad. But it ran well for a two hour test today, with 30,000 concurrent users. I'm hoping that it continues. We're a small shop, and I really don't want to run my own servers. They said they have some major announcements coming up on it, so hopefully they'll address your concerns.

Cihangir Savas

unread,

Mar 16, 2015, 2:21:09 AM3/16/15

to redi...@googlegroups.com

We are also having problems sporadically when autoscaling creates new instance(s) are not being able to connect to redis, they are just timing out...

On Sunday, March 15, 2015 at 9:12:00 AM UTC-7, Felix wrote:

Rick Waugh

unread,

Mar 16, 2015, 1:45:32 PM3/16/15

to redi...@googlegroups.com

That's a bit disturbing. I load tested up to forcing a server spawn, and it seemed fine. Be interested to hear if there is any resolution to that. Has Amazon said anything?

Cihangir SAVAS

unread,

Mar 16, 2015, 3:37:07 PM3/16/15

to Rick Waugh, redi...@googlegroups.com

That doesn’t happen very often, but happens. When it happens our health check marks the server as unhealthy and terminates it. We couldn’t hear something valuable apart from asking iptables, tcpdump, netstat outputs…

--
Cihangir SAVAS

You received this message because you are subscribed to a topic in the Google Groups "Redis DB" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/redis-db/NIKSsVBFXKw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to redis-db+u...@googlegroups.com.

Zachary Boschert

unread,

Mar 17, 2015, 2:56:57 AM3/17/15

to redi...@googlegroups.com

I'd be interested what you mean by "Elasticache is Unsafe at any speed", and it is "poorly implemented, breaks all the time".
What things are you doing that Elasticache isn't doing for you. Any idea on why Elasticache isn't doing those things?

Stefano Fratini

unread,

Mar 18, 2015, 12:25:49 AM3/18/15

to redi...@googlegroups.com

Without getting into other people's mind I can tell from my personal experience that

- Redis on Elasticache is designed to be used as a caching layer only.

-- There are no guarantees in terms of auto-healing speed of a problematic node. According to https://aws.amazon.com/blogs/aws/elasticache-redis-multi-az/ "The entire failover process, from detection to the resumption of normal caching behavior, will take several minutes. Your application’s caching tier should have a strategy (and some code!) to deal with a cache that is momentarily unavailable."

- Redis configuration parameters cannot be changed in the latest supported version (http://docs.aws.amazon.com/AmazonElastiCache/latest/UserGuide/CacheParameterGroups.Redis.html)

In general it feels like the guys at AWS consider Redis a Memcache copy or little more

Setting up a master + a slave + sentinel on all your connecting nodes is not very difficult and gives you flexibility and fast failovers

Just my 2 cents

Rick Waugh

unread,

Mar 18, 2015, 6:53:20 PM3/18/15

to redi...@googlegroups.com

Using it as a caching module in AWS makes perfect sense. They have Dynamo as their managed NoSQL service, and RDS as relational DB service. Redis is a great caching server, so it fits with the rest of their stack. Nothing says you can't run it by itself if you want more of it's native functionality.

The failover is troublesome, and takes too long. I do hope when they come up with the new announcement on Redis, it provides more/better service, as it's a fantastic tool, and when combined with Dynamo, provides an incredible environment for a scaleable managed solution.

Reply all

Reply to author

Forward