Re: Istio 0.4.0 ignoring circuit breaker configuration

35 views

Skip to first unread message

Matt Klein

unread,

Jan 19, 2018, 1:08:55 PM1/19/18

to Christian Posta, Envoy Users, Kamesh Sampath, Istio Users

+envoy-users

It's on my list to go through this email today. Sorry for the delay.

On Fri, Jan 19, 2018 at 10:02 AM, Christian Posta <christi...@gmail.com> wrote:

On Fri, Jan 19, 2018 at 10:04 AM, Kamesh Sampath <ksam...@redhat.com> wrote:

On Tuesday, January 16, 2018 at 3:29:36 AM UTC+5:30, Christian Posta wrote:
So I'm seeing something similar to you wrt outlier detection. Basically, the trick is to get outlier detection to kick in and you should see something like this:

```
cluster.out.httpbin.istio-samples.svc.cluster.local|http|version=v1.outlier_detection.ejections_active: 1
cluster.out.httpbin.istio-samples.svc.cluster.local|http|version=v1.outlier_detection.ejections_consecutive_5xx: 1
cluster.out.httpbin.istio-samples.svc.cluster.local|http|version=v1.outlier_detection.ejections_overflow: 0
cluster.out.httpbin.istio-samples.svc.cluster.local|http|version=v1.outlier_detection.ejections_success_rate: 0
cluster.out.httpbin.istio-samples.svc.cluster.local|http|version=v1.outlier_detection.ejections_total: 1
```

Note the "ejections_active" stat.

Once this happens, it seems like the host that's been ejected still gets *some* of the requests.

Wonder if Matt (cc'd) can direct us to the right place?

Wonder if Envoy still allows a fraction of requests to the ejected endpoint/host and this is by design? I guess I would expect an ejected endpoint to not receive any more requests for the `interval_ms` setting.

I see this happening. thats bit confuses as to whether host has been ejected or not, as request continues to flow the host/pod

Yah -- let's ask on the envoy mailing list. Will CC you.

On Sun, Jan 14, 2018 at 7:02 PM, <sene...@gmail.com> wrote:
Hi Christian,

Many thanks for the support and confirmation.
I have since doubled back on my test cases and discovered the following:

Test tool being used (httperf) to simulate concurrent load appears not to be maintaining multiple concurrent open sessions (despite the request being configured to to such):

--wsess=N1,N2,X
Requests the generation and measurement of sessions instead of individual requests. A session consists of a sequence of bursts which are spaced out by the user think-time.
Each burst consists of a fixed number L of calls to the server (L is specified by option --burst-length). The calls in a burst are issued as follows: at first, a single call is issued.
Once the reply to this first call has been fully received, all remaining calls in the burst are issued concurrently.
The concurrent calls are issued either as pipelined calls on an existing persistent connection or as individual calls on separate connections.
Whether a persistent connection is used depends on whether the server responds to the first call with a reply that includes a ''Connection: close'' header line. If such a line is present, separate connections are used.

Per the excerpt from the man page - I expect the Bookinfo deployment is issuing ''Connection: close'' headers in responses - which are being honoured by httperf.
After shifting over to an alternate test tool (gobench) - the behaviour as described by yourself and the Istio documentation is being observed. Fantastic!

*************************************************************************************

As a side note, with respect to simpleCB policy and pod ejection:

For a given configuration such as:

circuitBreaker: simpleCb: maxConnections: 1 httpMaxRequests: 1 httpMaxPendingRequests: 1 httpMaxRequestsPerConnection: 1 sleepWindow: 5m httpDetectionInterval: 1s httpConsecutiveErrors: 1 httpMaxEjectionPercent: 100

I'm able to illicit multiple 5xx responses from httpbin pods in an Istio managed LB pool - such that Istio/Enoy reports:

cluster.httpbin_service.outlier_detection.ejections_active: 0 cluster.httpbin_service.outlier_detection.ejections_consecutive_5xx: 1 cluster.httpbin_service.outlier_detection.ejections_overflow: 0 cluster.httpbin_service.outlier_detection.ejections_success_rate: 0 cluster.httpbin_service.outlier_detection.ejections_total: 1

However - all hosts in the pool continue to respond.
Any thoughts or feedback appreciated?

Rgds..

On Sunday, 14 January 2018 08:28:03 UTC+11, Christian Posta wrote:
So in my experience Envoy circuit breaking values aren't strict, or there is some leeway given -- but circuit breaking does kick in. You can see in this (http://blog.christianposta.com/microservices/01-microservices-patterns-with-envoy-proxy-part-i-circuit-breaking/) if you set the max connections/pending requests to "1" and send two concurrent connections, they usually all still go through. I think this level of padding/leeway is fine for real apps. But if you change to 3 or more concurrent connections, you'll see the circuit breaking behavior. I've also verified this within istio -- things look like they work as expected. Wondering if you'd like to start a discussion on Envoy users list to get a better understanding what those thresholds really are and whether they can be made to be stricter?

On Fri, Jan 12, 2018 at 3:17 AM, <sene...@gmail.com> wrote:
Hi Experts,

I am having trouble with circuit breaker implementation on Kubernetes.
Istio/Envoy appears to be ignoring circuit breaker configuration. Kindly seeking your help please.

Environment:
# istioctl version Version: 0.4.0 GitRevision: 24089ea97c8d244493c93b499a666ddf4010b547 GitBranch: 6401744b90b43901b2aa4a8bced33c7bd54ffc13 User: root@cc5c34bbd1ee GolangVersion: go1.9.1

Bookinfo & sleep applications deployed, with the following route rules & destination policies:

Routing:
apiVersion: config.istio.io/v1alpha2 kind: RouteRule metadata: name: details-default namespace: default spec: destination: name: details precedence: 1 route: - labels: version: v1 --- apiVersion: config.istio.io/v1alpha2 kind: RouteRule metadata: name: productpage-default namespace: default spec: destination: name: productpage precedence: 1 route: - labels: version: v1 --- apiVersion: config.istio.io/v1alpha2 kind: RouteRule metadata: name: ratings-default namespace: default spec: destination: name: ratings precedence: 1 route: - labels: version: v1 --- apiVersion: config.istio.io/v1alpha2 kind: RouteRule metadata: name: reviews-default namespace: default spec: destination: name: reviews precedence: 1 route: - labels: version: v2 ---

Destination:
apiVersion: config.istio.io/v1alpha2 kind: DestinationPolicy metadata: name: reviews-cb namespace: default spec: destination: name: reviews labels: version: v2 circuitBreaker: simpleCb: maxConnections: 1 httpMaxRequests: 1 httpMaxPendingRequests: 1 httpMaxRequestsPerConnection: 1 sleepWindow: 3m httpDetectionInterval: 30s httpConsecutiveErrors: 1 httpMaxEjectionPercent: 100

From the sleep container I am accessing the reviews service via curl, in the fashion:
for i in {1..100}; do curl http://10.99.89.197:9080/reviews/0; echo "."; done;
and have tried x2 of these running simultaneously.

I find that all requests are returned a 2xx response, but expect one of the connections to be rejected due to the cb configuration!?

Envoy:
curl http://localhost:15000/stats | grep reviews cluster.out.reviews.default.svc.cluster.local|http|version=v2.bind_errors: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.external.upstream_rq_200: 200 cluster.out.reviews.default.svc.cluster.local|http|version=v2.external.upstream_rq_2xx: 200 cluster.out.reviews.default.svc.cluster.local|http|version=v2.lb_healthy_panic: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.lb_local_cluster_not_ok: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.lb_recalculate_zone_structures: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.lb_subsets_active: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.lb_subsets_created: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.lb_subsets_fallback: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.lb_subsets_removed: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.lb_subsets_selected: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.lb_zone_cluster_too_small: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.lb_zone_no_capacity_left: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.lb_zone_number_differs: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.lb_zone_routing_all_directly: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.lb_zone_routing_cross_zone: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.lb_zone_routing_sampled: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.max_host_weight: 1 cluster.out.reviews.default.svc.cluster.local|http|version=v2.membership_change: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.membership_healthy: 1 cluster.out.reviews.default.svc.cluster.local|http|version=v2.membership_total: 1 cluster.out.reviews.default.svc.cluster.local|http|version=v2.retry_or_shadow_abandoned: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.update_attempt: 62 cluster.out.reviews.default.svc.cluster.local|http|version=v2.update_empty: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.update_failure: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.update_rejected: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.update_success: 62 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_cx_active: 4 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_cx_close_notify: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_cx_connect_fail: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_cx_connect_timeout: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_cx_destroy: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_cx_destroy_local: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_cx_destroy_local_with_active_rq: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_cx_destroy_remote: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_cx_destroy_remote_with_active_rq: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_cx_destroy_with_active_rq: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_cx_http1_total: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_cx_http2_total: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_cx_max_requests: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_cx_none_healthy: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_cx_overflow: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_cx_protocol_error: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_cx_rx_bytes_buffered: 2360 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_cx_rx_bytes_total: 117995 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_cx_total: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_cx_tx_bytes_buffered: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_cx_tx_bytes_total: 106600 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_flow_control_backed_up_total: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_flow_control_drained_total: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_flow_control_paused_reading_total: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_flow_control_resumed_reading_total: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_rq_200: 200 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_rq_2xx: 200 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_rq_active: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_rq_cancelled: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_rq_maintenance_mode: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_rq_pending_active: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_rq_pending_failure_eject: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_rq_pending_overflow: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_rq_pending_total: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_rq_per_try_timeout: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_rq_retry: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_rq_retry_overflow: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_rq_retry_success: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_rq_rx_reset: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_rq_timeout: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_rq_total: 200 cluster.out.reviews.default.svc.cluster.local|http|version=v2.upstream_rq_tx_reset: 0 cluster.out.reviews.default.svc.cluster.local|http|version=v2.version: 10892423629305340143

Advice much appreciated please. If you require any further information - please let me know.

--
You received this message because you are subscribed to the Google Groups "Istio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to istio-users...@googlegroups.com.
To post to this group, send email to istio...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/istio-users/ba7855ff-91c4-4472-becc-10c7cd83cef1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Christian Posta
twitter: @christianposta
http://blog.christianposta.com

--
You received this message because you are subscribed to the Google Groups "Istio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to istio-users...@googlegroups.com.
To post to this group, send email to istio...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/istio-users/83efd61e-17e0-4d2c-830f-a4615ee01ae7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Christian Posta
twitter: @christianposta
http://blog.christianposta.com

--
You received this message because you are subscribed to the Google Groups "Istio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to istio-users+unsubscribe@googlegroups.com.
To post to this group, send email to istio...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/istio-users/7f3fae6d-9d29-4403-8fd6-abf2b10345f0%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Christian Posta
twitter: @christianposta
http://blog.christianposta.com

--
You received this message because you are subscribed to the Google Groups "Istio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to istio-users+unsubscribe@googlegroups.com.
To post to this group, send email to istio...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/istio-users/CAKS7T5NLdWffMnSkZPgDmZF8LCshF2hEJk19inUwBWfzRz8E8w%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Matt Klein

Software Engineer

mkl...@lyft.com

https://calendly.com/mattklein123

Matt Klein

unread,

Jan 19, 2018, 1:41:47 PM1/19/18

to Christian Posta, Envoy Users, Kamesh Sampath, Istio Users

It's hard for me to say what is happening without seeing the actual envoy config and a full dump of the stats. Just guessing, the issue is one of:

1) Reaching the "healthy panic" threshold for a very small cluster such that Envoy routes to all hosts. https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/load_balancing.html#panic-threshold (This can be confirmed via stats, the threshold is configurable in runtime, not sure about config).

2) Reaching outlier detection max eject threshold: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/outlier#ejection-algorithm (settings can be configured and it should be clear from stats if ejections are overflowing).

If you reply with the Envoy (not Istio) config and a full dump of stats after you repro I can probably tell what is going on.

Christian Posta

unread,

Jan 19, 2018, 5:14:23 PM1/19/18

to Matt Klein, Envoy Users, Kamesh Sampath, Istio Users

Thanks Matt!

I suspect we're seeing the healthy panic threshold. I'll verify this.

BTW... Istio configures the envoy proxies with xDS. The actual config file is quite small and just points all the xDS clusters to pilot. Is there a way to get Envoy to dump the effective configuration as we would expect to see it in the json file if we were manually configuring it? I know we can call into the various /clusters /routes endpoints but it'd be awesome if we can actually see the config that envoy thinks it sees. Otherwise we can grab what the pilot is exposing but I'd rather envoy tell us what the config really is.

Matt Klein

unread,

Jan 19, 2018, 6:20:15 PM1/19/18

to Christian Posta, Envoy Users, Kamesh Sampath, Istio Users

Is there a way to get Envoy to dump the effective configuration as we would expect to see it in the json file if we were manually configuring it?

Unfortunately not. I don't think this would be too hard of a feature to add and it would be worthwhile (it's been asked for previously). Feel free to open an issue to track.

Christian Posta

unread,

Jan 19, 2018, 6:36:12 PM1/19/18

to Matt Klein, Envoy Users, Kamesh Sampath, Istio Users

Awesome, thanks!

https://github.com/envoyproxy/envoy/issues/2421

Reply all

Reply to author

Forward

0 new messages