linkerd - external circuit breaking - empty response on HTTP 500

Betson Thomas

unread,

Apr 21, 2017, 5:33:32 PM4/21/17

to linkerd-users, Betson Thomas

Hello,

Using the optimized yaml file provided by Kevin (https://groups.google.com/forum/#!topic/linkerd-users/IhnDgSgecIs), we're able to successfully load endpoints external to the k8s cluster. In the process of testing circuit breaking, we setup a sample service (located external to the kubernetes cluster) that always returns a failing HTTP 500 response, but it appears when proxying through linkerd, we do not receive a reply. Making this call directly returns the response as expected.

We have tried a few different failure accrual settings including explicitly setting it to none. All behave in the same way.

Thanks in advance for your help.

http_proxy=$(kubectl get svc l5d -o jsonpath="{.status.loadBalancer.ingress[0].*}"):4140 curl -vL http://ForcedFailureReturnsHTTP500

* Trying [...]...

* TCP_NODELAY set

* Connected to redacted.elb.amazonaws.com (...) port 4140 (#0)

GET http://ForcedFailureReturnsHTTP500 HTTP/1.1

Host: ForcedFailureReturnsHTTP500

User-Agent: curl/7.51.0

Accept: */*

Proxy-Connection: Keep-Alive

* Curl_http_done: called premature == 0

* Empty reply from server

* Connection #0 to host redacted.elb.amazonaws.com left intact

curl: (52) Empty reply from server

-Betson

Kevin Lingerfelt

unread,

Apr 24, 2017, 9:03:32 PM4/24/17

to Betson Thomas, linkerd-users, Betson Thomas

Hi Betson,

When these requests fail do you see any messages in linkerd's log? Can you try turning up linkerd's log level and sending the logs from one of your linkerd processes when it processes one of the failing requests? If you're in Kubernetes, you can crank up logging with the following args:

containers:

- name: l5d

image: buoyantio/linkerd:0.9.1

args:

- "/io.buoyant/linkerd/config/config.yaml"

- "-com.twitter.finagle.tracing.debugTrace=true"

- "-log.level=DEBUG"

volumeMounts:

- name: "linkerd-config"

mountPath: "/io.buoyant/linkerd/config"

readOnly: true

Kevin

--
You received this message because you are subscribed to the Google Groups "linkerd-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to linkerd-users+unsubscribe@googlegroups.com.
To post to this group, send email to linker...@googlegroups.com.
Visit this group at https://groups.google.com/group/linkerd-users.
To view this discussion on the web visit https://groups.google.com/d/msgid/linkerd-users/306e5126-7a48-4558-8017-2228ad9f1b60%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Betson Thomas

unread,

May 2, 2017, 2:13:00 PM5/2/17

to linkerd-users

Sorry for the delay on this. Attached is the log with a ClosedChannelException. This is also against the official 1.0.0 release

-Betson

To unsubscribe from this group and stop receiving emails from it, send an email to linkerd-user...@googlegroups.com.

To post to this group, send email to linker...@googlegroups.com.
Visit this group at https://groups.google.com/group/linkerd-users.

linkerd-1.0.0-HTTP500-error.log.txt

Betson Thomas

unread,

May 4, 2017, 2:37:13 PM5/4/17

to linkerd-users

Additional observations:

Failure Accrual: NONE

Response Classifier	Response	Log
Retryable5XX	No Response	ClosedChannelException
NonRetryable5XX	(Expected) 500 Internal Server Error

Failure Accrual: 5 CONSECUTIVE FAILURES

Response Classifier	Response	Log
Retryable5XX	No Response	"Marking Node as Dead" and ClosedChannelException
NonRetryable5XX	(Expected) 500 Internal Server Error	"Marking Node as Dead"

-Betson

Betson Thomas

unread,

May 11, 2017, 1:25:17 PM5/11/17

to linkerd-users

Additionally, when an external node is marked as dead, requests are still sent to it (confirmed at the remote service).

-Betson

Kevin Lingerfelt

unread,

May 16, 2017, 2:46:22 PM5/16/17

to Betson Thomas, linkerd-users

Hi Betson,

Sorry I didn't respond sooner. I'm actually not able to reproduce the behavior that you describe, where an http_proxy request through linkerd to a failing external service hangs instead of returning. Here is what I tried:

First I deployed the linkerd-egress.yml config to the "default" namespace in my k8s cluster. The linkerd config exposes a linkerd router on port 4140 with default failure accrual (5 consecutive failures) and retries enabled.

Next I deployed a separate service "cayenne3" to the "test" namespace in my k8s cluster. The "cayenne3" service exposes an http server on port 7003 that returns 500s for all requests. The service is reachable by DNS at http://cayenne3.test.svc.cluster.local:7003.

Then I made an http_proxy request to that service:

$ http_proxy=http://$LINKERD_IP:4140 curl -v http://cayenne3.test.svc.cluster.local:7003
* Rebuilt URL to: http://cayenne3.test.svc.cluster.local:7003/
* Trying $LINKERD_IP...
* TCP_NODELAY set
* Connected to $LINKERD_IP ($LINKERD_IP) port 4140 (#0)
> GET http://cayenne3.test.svc.cluster.local:7003/ HTTP/1.1
> Host: cayenne3.test.svc.cluster.local:7003
> User-Agent: curl/7.50.3
> Accept: */*
> Proxy-Connection: Keep-Alive
>
< HTTP/1.1 500 Internal Server Error
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Tue, 16 May 2017 18:21:15 GMT
< Content-Length: 23
< l5d-success-class: 0.0
< Via: 1.1 linkerd
<
internal service error

* Curl_http_done: called premature == 0

* Connection #0 to host $LINKERD_IP left intact

That "internal service error" message is coming from the failing backend that I deployed, not linkerd.

Linkerd's metrics (:9990/admin/metrics.json) confirm this behavior. First, looking at the "service" metrics:

"rt/outgoing/service/svc/cayenne3.test.svc.cluster.local:7003/failures": 910,
"rt/outgoing/service/svc/cayenne3.test.svc.cluster.local:7003/failures/com.twitter.finagle.service.ResponseClassificationSyntheticException": 910,
"rt/outgoing/service/svc/cayenne3.test.svc.cluster.local:7003/requests": 910,
"rt/outgoing/service/svc/cayenne3.test.svc.cluster.local:7003/retries/per_request.avg": 4.491379310344827,
"rt/outgoing/service/svc/cayenne3.test.svc.cluster.local:7003/retries/total": 3968,
"rt/outgoing/service/svc/cayenne3.test.svc.cluster.local:7003/success": 0,

I sent linkerd 910 requests, and all 910 failed. Requests were retried 4.5 times on average, so 910 requests to linkerd resulted in 4878 total requests from the linkerd client to the failing backend. You can see that confirmed by the "client" metrics:

"rt/outgoing/client/$/io.buoyant.rinet/7003/cayenne3.test.svc.cluster.local/failure_accrual/probes": 6,
"rt/outgoing/client/$/io.buoyant.rinet/7003/cayenne3.test.svc.cluster.local/failure_accrual/removals": 1,
"rt/outgoing/client/$/io.buoyant.rinet/7003/cayenne3.test.svc.cluster.local/failure_accrual/removed_for_ms": 468561,
"rt/outgoing/client/$/io.buoyant.rinet/7003/cayenne3.test.svc.cluster.local/failure_accrual/revivals": 0,
"rt/outgoing/client/$/io.buoyant.rinet/7003/cayenne3.test.svc.cluster.local/failures": 4878,
"rt/outgoing/client/$/io.buoyant.rinet/7003/cayenne3.test.svc.cluster.local/failures/com.twitter.finagle.service.ResponseClassificationSyntheticException": 4878,
"rt/outgoing/client/$/io.buoyant.rinet/7003/cayenne3.test.svc.cluster.local/requests": 4878,
"rt/outgoing/client/$/io.buoyant.rinet/7003/cayenne3.test.svc.cluster.local/success": 0,

Those metrics also show that failure accrual kicked in eventually. After failure accrual starts, linkerd will short-circuit requests before actually sending to the failing backend, and probe the backend on an interval to see if it has recovered. In my setup the backend has been probed 6 times, with 0 revivals. In linkerd's logs, I also see:

I 0516 18:12:19.109 UTC THREAD25 TraceId:4065c1d3d752a50d: FailureAccrualFactory marking connection to "$/io.buoyant.rinet/7003/cayenne3.test.svc.cluster.local" as dead. Remote Address: Inet(cayenne3.test.svc.cluster.local/$CAYENNE_IP:7003,Map())

I don't see any ClosedChannelExceptions as you described. Looking at the logs you sent, it looks like you also encountered a CancelledRequestException, which usually indicates that the client that you're using to talk to linkerd has canceled the request before it has completed. This is often triggered by aggressive timeouts in your client. Since linkerd uses a retry budget, it might retry multiple times on a failing request, and those retries could exceed your client's timeout. For instance, in my test, one request to linkerd was retried 100 times, taking a total of 371ms, before returning. If my client had a timeout enabled, the client might have given up before the retries completed. Retry behavior if fully configurable if you want to constrain the number of retry attempts per request. In general, I recommend not including timeouts in your client that talks to linkerd, since linkerd manages timeout and retry behavior for you.

Hope that helps,

Kevin

To unsubscribe from this group and stop receiving emails from it, send an email to linkerd-users+unsubscribe@googlegroups.com.

To post to this group, send email to linker...@googlegroups.com.
Visit this group at https://groups.google.com/group/linkerd-users.

To view this discussion on the web visit https://groups.google.com/d/msgid/linkerd-users/d217db08-65a8-463f-bf43-320e66c6dc7b%40googlegroups.com.

Betson Thomas

unread,

May 17, 2017, 2:49:50 PM5/17/17

to linkerd-users

Thanks Kevin. We mimicked your external service setup and reproduced your results. However, I received info that we saw a discrepancy in requests being sent to nodes marked as dead, separate from the probes. I'll get details on this and confirm.

We will retest with the sample egress configuration and our external service setup to identify any difference in performance.

-Betson

Kevin Lingerfelt

unread,

May 23, 2017, 1:56:26 PM5/23/17

to Betson Thomas, linkerd-users

Hi Betson,

Alex provided a helpful writeup of how circuit breaking happens when all connections have been marked as dead. This seems relevant to your last question, and you can read about it in this forum post:

https://discourse.linkerd.io/t/unable-to-prevent-requests-reaching-the-endpoint-after-circuit-breaking-is-in-action/89/2

Hope that helps,

Kevin

To unsubscribe from this group and stop receiving emails from it, send an email to linkerd-users+unsubscribe@googlegroups.com.

To post to this group, send email to linker...@googlegroups.com.
Visit this group at https://groups.google.com/group/linkerd-users.

To view this discussion on the web visit https://groups.google.com/d/msgid/linkerd-users/285cd8b2-6225-4797-bcd6-40de85dd02d1%40googlegroups.com.

Betson Thomas

unread,

May 24, 2017, 9:05:26 AM5/24/17

to Kevin Lingerfelt, linkerd-users

Thanks Kevin. Yes, Shaik is on our team. I'll follow up on that thread.

-Betson

Reply all

Reply to author

Forward