Kong high availability without clustered database?

stefanegg

unread,

Oct 28, 2016, 3:23:50 AM10/28/16

to Kong

Hi all,

We have a basic Kong installation consisting of 3 Kong nodes (0.9.2) and a single Postgres DB host. There is a load balancer in front of the Kong nodes, as well as in the upstream path, so things are highly available. The DB is currently not clustered since it would require more effort on our end, and also since I was under the impression that the Kong nodes will continue serving traffic even when the DB is down (by caching the latest configuration).

However, a few seconds after I shut down the Postgres DB, all my requests to existing API's get an HTTP 500 with '{"message":"An unexpected error occurred"}' response, and only resume serving API requests after the DB is online again. Our Kong config is quite simple and small at this time, however we are using the key-auth and the datadog plugin.

Did we miss an essential piece of configuration, or did I misunderstand the way the Kong nodes are dependent on the DB?

Thanks!

Logroñoide

unread,

Oct 28, 2016, 6:27:22 AM10/28/16

to Kong

Hi Stefan,

I'm also testing different HA options with Kong. Basically this is my configuration:

- Kubernetes + Docker

- Postgres as datastore in Master-Slave configuration

- Redis + Sentinel as cache for rate limiting in Master-Slave configuration

- Kong (0.9.3) multinode (3 minimum).

- Openstack Load Balancing as a Service to Kubernetes NodePorts.

- Only rate limiting plugin at the moment.

So everything is redundant and self-healing, except Postgres (not yet...).

If I take down the postgres database, my cluster keeps on proxying all requests to the upstream service if it was delivering the service before. Even for several minutes (>5 minutes for sure). Latency increases, but it works. So, I think Kong can continue working if the datasource goes down.

If I take down the redis service and I have enabled the 'fault tolerant' option for redis in the rate-limiting plugin, the cluster keeps on proxying all requests without enforcing the rate limiting. This is exactly what I expected from this plugin. If I take down redis without 'fault tolerant' enabled, then I get an 'Error 500'.

As result of these tests I gave my thumbs up to the deployment of Kong, but I will try to test with more plugins.

When I take down postgres, this is info in the error.log file:

2016/10/28 10:18:47 [error] 12696#0: [lua] cluster.lua:82: [postgres error] connection refused, context: ngx.timer

2016/10/28 10:18:50 [error] 12696#0: [lua] postgres.lua:158: [postgres] could not cleanup TTLs: connection refused, context: ngx.timer

Nothing about the plugin or the api.

Could you please post the errors in your error.log file to understand better what is going wrong in your deployment?

Cheers

Diego

Sent from ProtonMail, encrypted email based in Switzerland.

--
You received this message because you are subscribed to the Google Groups "Kong" group.
To unsubscribe from this group and stop receiving emails from it, send an email to konglayer+...@googlegroups.com.
To post to this group, send email to kong...@googlegroups.com.
Visit this group at https://groups.google.com/group/konglayer.
To view this discussion on the web visit https://groups.google.com/d/msgid/konglayer/ed1ef4b9-4728-4a7b-94d1-1046ae1d4a41%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

stefanegg

unread,

Oct 28, 2016, 11:33:36 AM10/28/16

to Kong, stefan.egg...@gmail.com

Hi Diego,

Thanks for the comprehensive info. Maybe worth mentioning that we are using Rancher 1.1.2 with Cattle container orchestration. It is basically using an overlay network with integrated load balancing to direct traffic within the cluster. I am pretty sure though that I ruled out any network/platform specific issues (touching wood) by running basic connectivity tests (nslookup/ping/telnet etc.).

I noticed I am getting the same errors even when calling the /status admin API endpoint, hence all entries below are Kong/NGINX error log lines from calling this specific URL (in my case, http://localhost:13888/status", which maps to port 8001 inside the Kong container (0.0.0.0:13888->8001/tcp).

I basically tested/simulated 3 different scenarios:

1. stopping the Postgres DB container ("kong-db" will not get resolved anymore)

[error] 106#0: [lua] responses.lua:101: [postgres error] kong-db could not be resolved (3: Host not found), client: 172.17.0.1, server: kong_admin, request: "GET /status HTTP/1.1", host: "localhost:13888"

2. leaving the Postgres DB container running, but without a running Postgres daemon inside:

[error] 109#0: [lua] responses.lua:101: [postgres error] connection refused, client: 172.17.0.1, server: kong_admin, request: "GET /status HTTP/1.1", host: "localhost:13888"

3. changing the pg_host in kong.conf from a hostname ("kong-db") to the Postgres DB container's IP address:

[error] 117#0: [lua] responses.lua:101: [postgres error] no route to host, client: 172.17.0.1, server: kong_admin, request: "GET /status HTTP/1.1", host: "localhost:13888"

With all 3 scenarios, I get the same error message on the client that I initially described (HTTP 500 "An unexpected error occurred"). So in comparison with your environment, Diego, mine doesn't seem to play as nice :-(.

Thanks,

Stefan

Logroñoide

unread,

Oct 31, 2016, 4:36:47 AM10/31/16

to stefanegg, Kong

Hi Stefan,

I think that any request to the admin API during a database downtime (during a failover for example) are going to return an error. But any request to the proxy APIs will work. So may be the /status response is what you can expect from the admin perspective (database is not working).

If you try the port on 8000 or 8443 to your upstream API it should work.

Cheers

Diego

Sent from ProtonMail, encrypted email based in Switzerland.

-------- Original Message --------
Subject: Re: Kong high availability without clustered database?
Local Time: 28 October 2016 5:33 PM
UTC Time: 28 October 2016 15:33
From: stefan.egg...@gmail.com
To: Kong <kong...@googlegroups.com>

--
You received this message because you are subscribed to the Google Groups "Kong" group.
To unsubscribe from this group and stop receiving emails from it, send an email to konglayer+...@googlegroups.com.
To post to this group, send email to kong...@googlegroups.com.
Visit this group at https://groups.google.com/group/konglayer.

To view this discussion on the web visit https://groups.google.com/d/msgid/konglayer/78ba653a-5b2a-4720-9a13-4d934097ee92%40googlegroups.com.

stefanegg

unread,

Oct 31, 2016, 8:39:45 AM10/31/16

to Kong, stefan.egg...@gmail.com, logro...@protonmail.com

Hi Diego,

Of course, you are correct, sorry for that. I re-tried it with an actual API, but still got the same results.

After some more digging, I found that a colleague has added the udp-log plugin to 2 separate API's, and I narrowed the issue down to this specific plugin. I.e., as soon as I removed the plugin from these 2 API's, "offline mode" worked perfectly, but after I re-added it to ANY API (not even the one I tested with), "offline mode" broke. I assume this should not happen, so I will post an issue on GitHub.

Thanks a lot,

Stefan

Logroñoide

unread,

Oct 31, 2016, 8:57:54 AM10/31/16

to stefanegg, Kong

Hi Stefan,

I have just tried the udp-log plugin and you are right: it does not work in 'database offline' mode. I get a nice 500 error.

For logs I use loggly, and it works in 'database offline' mode.

Cheers

Diego

Sent from ProtonMail, encrypted email based in Switzerland.

-------- Original Message --------
Subject: Re: Kong high availability without clustered database?
Local Time: 31 October 2016 1:39 PM
UTC Time: 31 October 2016 12:39
From: stefan.egg...@gmail.com
To: Kong <kong...@googlegroups.com>

Reply all

Reply to author

Forward