All cluster endpoints are active despite health checks

Vadim Lazovskiy

unread,

Mar 14, 2023, 3:02:24 AM3/14/23

to envoy-users

Hi there!

I'm trying to set up a proxy for PostgreSQL servers that forwards connections on the current Postgres master.
I have an HTTP endpoint that returns 200 at the master server and 500 on standby.
The problem is that even though active health checks fail on standby endpoints, connections are still distributed to every cluster endpoint.

Here is the configuration:

admin: address: socket_address: { address: 127.0.0.1, port_value: 9901 } static_resources: listeners: - name: postgres address: socket_address: address: 0.0.0.0 port_value: 5432 filter_chains: - filters: - name: envoy.filters.network.tcp_proxy typed_config: "@type": type.googleapis.com/envoy.extensions.filters.network.tcp_proxy.v3.TcpProxy stat_prefix: postgres cluster: postgres_cluster clusters: - name: postgres_cluster connect_timeout: 1s type: STATIC lb_policy: ROUND_ROBIN outlier_detection: null health_checks: - http_health_check: path: /master interval: 1s no_traffic_interval: 5s timeout: 1s unhealthy_threshold: 2 healthy_threshold: 1 transport_socket_match_criteria: useTlsAuth: true transport_socket_matches: - name: useTlsAuth match: useTlsAuth: true transport_socket: name: envoy.transport_sockets.tls typed_config: "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext common_tls_context: tls_certificates: - certificate_chain: filename: /etc/envoy/tls/client.crt private_key: filename: /etc/envoy/tls/client.key load_assignment: cluster_name: postgres_cluster endpoints: - lb_endpoints: - endpoint: hostname: postgres-0 address: socket_address: address: 10.0.0.1 port_value: 5432 health_check_config: hostname: postgres-0 port_value: 443 - endpoint: hostname: postgres-1 address: socket_address: address: 10.0.0.2 port_value: 5432 health_check_config: hostname: postgres-1 port_value: 443 - endpoint: hostname: postgres-2 address: socket_address: address: 10.0.0.3 port_value: 5432 health_check_config: hostname: postgres-2 port_value: 443

In the clusters dump, I can see

"health_status": { "failed_active_health_check": true, "eds_health_status": "HEALTHY" },

for standbys, and

"health_status": { "eds_health_status": "HEALTHY" },

for the master.

There are two issues here:

Even though the endpoint is failing its active health check, it's not removed from load balancing
I can stop postgres at the target server, but the corresponding endpoint is not removed from balancing. Envoy will report the "Connection refused" error to log, and the client connection is closing immediately.

Why is that happening, and how can I teach the envoy to respect active health checks?

Thank you!

Stephan Zuercher

unread,

Mar 14, 2023, 11:11:21 AM3/14/23

to Vadim Lazovskiy, envoy-users

If you check the lb_healthy_panic stat for the cluster, I believe you’ll see that it’s panic routing. The default threshold for panic routing is 50% healthy hosts. For this use, I expect you’ll want to set the threshold to 0 to disable the feature.

Stephan

On Mar 14, 2023, at 12:02 AM, Vadim Lazovskiy <vadim.l...@gmail.com> wrote:

--
You received this message because you are subscribed to the Google Groups "envoy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to envoy-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/envoy-users/88d3f333-c796-4f04-9f13-b3ac62fe0011n%40googlegroups.com.

Vadim Lazovskiy

unread,

Mar 14, 2023, 2:46:42 PM3/14/23

to envoy-users

Thank you for the reply!

I've set healthy_panic_threshold with no luck:

clusters:
- name: postgres_cluster
connect_timeout: 1s

type: STRICT_DNS
lb_policy: ROUND_ROBIN
common_lb_config:
healthy_panic_threshold:
value: 0

Here is the clusters dump:

{
"cluster_statuses": [
{
"name": "postgres_cluster",
"host_statuses": [
{
"address": {
"socket_address": {
"address": "10.0.0.1",
"port_value": 5432
}
},
"stats": [
{
"name": "cx_connect_fail"
},
{
"value": "9",
"name": "cx_total"
},
{
"name": "rq_error"
},
{
"name": "rq_success"
},
{
"name": "rq_timeout"
},
{
"value": "9",
"name": "rq_total"
},
{
"type": "GAUGE",
"name": "cx_active"
},
{
"type": "GAUGE",
"name": "rq_active"
}
],
"health_status": {
"eds_health_status": "HEALTHY"
},
"weight": 1,
"hostname": "postgres-0",
"locality": {}
},
{
"address": {
"socket_address": {
"address": "10.0.0.2",
"port_value": 5432
}
},
"stats": [
{
"name": "cx_connect_fail"
},
{
"value": "8",
"name": "cx_total"
},
{
"name": "rq_error"
},
{
"name": "rq_success"
},
{
"name": "rq_timeout"
},
{
"value": "8",
"name": "rq_total"
},
{
"type": "GAUGE",
"name": "cx_active"
},
{
"type": "GAUGE",
"name": "rq_active"
}
],

"health_status": {
"failed_active_health_check": true,
"eds_health_status": "HEALTHY"
},

"weight": 1,
"hostname": "postgres-1",
"locality": {}
},
{
"address": {
"socket_address": {
"address": "10.0.0.3",
"port_value": 5432
}
},
"stats": [
{
"name": "cx_connect_fail"
},
{
"value": "8",
"name": "cx_total"
},
{
"name": "rq_error"
},
{
"name": "rq_success"
},
{
"name": "rq_timeout"
},
{
"value": "8",
"name": "rq_total"
},
{
"type": "GAUGE",
"name": "cx_active"
},
{
"type": "GAUGE",
"name": "rq_active"
}
],

"health_status": {
"failed_active_health_check": true,
"eds_health_status": "HEALTHY"
},

"weight": 1,
"hostname": "postgres-2",
"locality": {}
}
],
"circuit_breakers": {
"thresholds": [
{
"max_connections": 1024,
"max_pending_requests": 1024,
"max_requests": 1024,
"max_retries": 3
},
{
"priority": "HIGH",
"max_connections": 1024,
"max_pending_requests": 1024,
"max_requests": 1024,
"max_retries": 3
}
]
},
"observability_name": "postgres_cluster"
}
]
}

And some logs for health check:

[2023-03-14 18:39:13.539][1][debug][client] [source/common/http/codec_client.cc:130] [C0] response complete
[2023-03-14 18:39:13.539][1][debug][hc] [source/common/upstream/health_checker_impl.cc:342] [C0] hc response=200 health_flags=healthy
[2023-03-14 18:39:13.859][1][debug][client] [source/common/http/codec_client.cc:130] [C1] response complete
[2023-03-14 18:39:13.859][1][debug][hc] [source/common/upstream/health_checker_impl.cc:342] [C1] hc response=404 health_flags=/failed_active_hc
[2023-03-14 18:39:14.218][1][debug][client] [source/common/http/codec_client.cc:130] [C2] response complete
[2023-03-14 18:39:14.218][1][debug][hc] [source/common/upstream/health_checker_impl.cc:342] [C2] hc response=404 health_flags=/failed_active_hc

And for several connections:

[2023-03-14 18:40:53.137][49][debug][filter] [source/common/tcp_proxy/tcp_proxy.cc:397] [C61] Creating connection to cluster postgres_cluster
[2023-03-14 18:40:53.137][49][debug][pool] [source/common/conn_pool/conn_pool_base.cc:245] trying to create new connection
[2023-03-14 18:40:53.137][49][debug][pool] [source/common/conn_pool/conn_pool_base.cc:143] creating a new connection
[2023-03-14 18:40:53.137][49][debug][connection] [source/common/network/connection_impl.cc:864] [C62] connecting to 10.0.0.1:5432
[2023-03-14 18:40:53.137][49][debug][connection] [source/common/network/connection_impl.cc:880] [C62] connection in progress
[2023-03-14 18:40:53.137][49][debug][conn_handler] [source/server/active_tcp_listener.cc:332] [C61] new connection from 127.0.0.1:37112
[2023-03-14 18:40:53.185][49][debug][connection] [source/common/network/connection_impl.cc:669] [C62] connected
[2023-03-14 18:40:53.185][49][debug][pool] [source/common/conn_pool/conn_pool_base.cc:289] [C62] attaching to next stream
[2023-03-14 18:40:53.185][49][debug][pool] [source/common/conn_pool/conn_pool_base.cc:175] [C62] creating stream
[2023-03-14 18:40:53.185][49][debug][filter] [source/common/tcp_proxy/tcp_proxy.cc:660] [C61] TCP:onUpstreamEvent(), requestedServerName:
[2023-03-14 18:40:53.876][49][debug][connection] [source/common/network/connection_impl.cc:637] [C62] remote close
[2023-03-14 18:40:53.876][49][debug][connection] [source/common/network/connection_impl.cc:247] [C62] closing socket: 0
[2023-03-14 18:40:53.876][49][debug][pool] [source/common/conn_pool/conn_pool_base.cc:407] [C62] client disconnected, failure reason:
[2023-03-14 18:40:53.876][49][debug][pool] [source/common/conn_pool/conn_pool_base.cc:204] [C62] destroying stream: 0 remaining
[2023-03-14 18:40:53.876][49][debug][connection] [source/common/network/connection_impl.cc:137] [C61] closing data_to_write=0 type=0
[2023-03-14 18:40:53.876][49][debug][connection] [source/common/network/connection_impl.cc:247] [C61] closing socket: 1
[2023-03-14 18:40:53.876][49][debug][conn_handler] [source/server/active_tcp_listener.cc:76] [C61] adding to cleanup list
[2023-03-14 18:40:54.488][50][debug][filter] [source/common/tcp_proxy/tcp_proxy.cc:249] [C63] new tcp proxy session
[2023-03-14 18:40:54.488][50][debug][filter] [source/common/tcp_proxy/tcp_proxy.cc:397] [C63] Creating connection to cluster postgres_cluster
[2023-03-14 18:40:54.488][50][debug][pool] [source/common/conn_pool/conn_pool_base.cc:245] trying to create new connection
[2023-03-14 18:40:54.488][50][debug][pool] [source/common/conn_pool/conn_pool_base.cc:143] creating a new connection
[2023-03-14 18:40:54.488][50][debug][connection] [source/common/network/connection_impl.cc:864] [C64] connecting to 10.0.0.3:5432
[2023-03-14 18:40:54.488][50][debug][connection] [source/common/network/connection_impl.cc:880] [C64] connection in progress
[2023-03-14 18:40:54.488][50][debug][conn_handler] [source/server/active_tcp_listener.cc:332] [C63] new connection from 127.0.0.1:37122
[2023-03-14 18:40:54.535][50][debug][connection] [source/common/network/connection_impl.cc:669] [C64] connected
[2023-03-14 18:40:54.535][50][debug][pool] [source/common/conn_pool/conn_pool_base.cc:289] [C64] attaching to next stream
[2023-03-14 18:40:54.535][50][debug][pool] [source/common/conn_pool/conn_pool_base.cc:175] [C64] creating stream
[2023-03-14 18:40:54.535][50][debug][filter] [source/common/tcp_proxy/tcp_proxy.cc:660] [C63] TCP:onUpstreamEvent(), requestedServerName:
[2023-03-14 18:40:55.211][50][debug][connection] [source/common/network/connection_impl.cc:637] [C64] remote close
[2023-03-14 18:40:55.211][50][debug][connection] [source/common/network/connection_impl.cc:247] [C64] closing socket: 0
[2023-03-14 18:40:55.211][50][debug][pool] [source/common/conn_pool/conn_pool_base.cc:407] [C64] client disconnected, failure reason:
[2023-03-14 18:40:55.211][50][debug][pool] [source/common/conn_pool/conn_pool_base.cc:204] [C64] destroying stream: 0 remaining
[2023-03-14 18:40:55.211][50][debug][connection] [source/common/network/connection_impl.cc:137] [C63] closing data_to_write=0 type=0
[2023-03-14 18:40:55.211][50][debug][connection] [source/common/network/connection_impl.cc:247] [C63] closing socket: 1
[2023-03-14 18:40:55.211][50][debug][conn_handler] [source/server/active_tcp_listener.cc:76] [C63] adding to cleanup list
[2023-03-14 18:40:55.817][1][debug][main] [source/server/server.cc:215] flushing stats
[2023-03-14 18:40:56.311][49][debug][filter] [source/common/tcp_proxy/tcp_proxy.cc:249] [C65] new tcp proxy session
[2023-03-14 18:40:56.311][49][debug][filter] [source/common/tcp_proxy/tcp_proxy.cc:397] [C65] Creating connection to cluster postgres_cluster
[2023-03-14 18:40:56.311][49][debug][pool] [source/common/conn_pool/conn_pool_base.cc:245] trying to create new connection
[2023-03-14 18:40:56.311][49][debug][pool] [source/common/conn_pool/conn_pool_base.cc:143] creating a new connection
[2023-03-14 18:40:56.311][49][debug][connection] [source/common/network/connection_impl.cc:864] [C66] connecting to 10.0.0.2:5432
[2023-03-14 18:40:56.311][49][debug][connection] [source/common/network/connection_impl.cc:880] [C66] connection in progress
[2023-03-14 18:40:56.311][49][debug][conn_handler] [source/server/active_tcp_listener.cc:332] [C65] new connection from 127.0.0.1:37128
[2023-03-14 18:40:56.362][49][debug][connection] [source/common/network/connection_impl.cc:669] [C66] connected
[2023-03-14 18:40:56.362][49][debug][pool] [source/common/conn_pool/conn_pool_base.cc:289] [C66] attaching to next stream
[2023-03-14 18:40:56.362][49][debug][pool] [source/common/conn_pool/conn_pool_base.cc:175] [C66] creating stream
[2023-03-14 18:40:56.362][49][debug][filter] [source/common/tcp_proxy/tcp_proxy.cc:660] [C65] TCP:onUpstreamEvent(), requestedServerName:
[2023-03-14 18:40:57.089][49][debug][connection] [source/common/network/connection_impl.cc:637] [C66] remote close
[2023-03-14 18:40:57.089][49][debug][connection] [source/common/network/connection_impl.cc:247] [C66] closing socket: 0
[2023-03-14 18:40:57.089][49][debug][pool] [source/common/conn_pool/conn_pool_base.cc:407] [C66] client disconnected, failure reason:
[2023-03-14 18:40:57.089][49][debug][pool] [source/common/conn_pool/conn_pool_base.cc:204] [C66] destroying stream: 0 remaining
[2023-03-14 18:40:57.089][49][debug][connection] [source/common/network/connection_impl.cc:137] [C65] closing data_to_write=0 type=0
[2023-03-14 18:40:57.089][49][debug][connection] [source/common/network/connection_impl.cc:247] [C65] closing socket: 1
[2023-03-14 18:40:57.089][49][debug][conn_handler] [source/server/active_tcp_listener.cc:76] [C65] adding to cleanup list

As you can see, it still tries to distribute connections evenly to all cluster endpoints in spite of health check status.

вторник, 14 марта 2023 г. в 17:11:21 UTC+2, Stephan Zuercher:

Message has been deleted

Ajay Sindwani

unread,

Oct 25, 2023, 7:32:58 PM10/25/23

to envoy-users

Hi Stephan,

Thanks for sharing. lb_healthy_panic did help for another configuration circumstance in my case.

Reply all

Reply to author

Forward