Cloud Run tracing issue with concurrent requests

705 views
Skip to first unread message

ja...@pinian.co

unread,
Sep 3, 2021, 8:45:23 AM9/3/21
to Google Stackdriver Discussion Forum
Cloud Run fails to trace correctly when concurrent requests are made with the same TraceID. Specifically, it forwards the x-cloud-trace-context header to the application server as expected, but it does not record its own span with Cloud Trace.

I have a Cloud Run API service. When making sequential requests to it with an x-cloud-trace-context header as follows, each request is correctly traced.

for I in {1..3}; do
  curl -D - -o /dev/null -H"X-Cloud-Trace-Context: 00fb1bfed62a3789a04a24d3ab8ddd14/${I};o=1" -H"Accept: */*" https://pinian-core-s6qd3oo4jq-uc.a.run.app/v1/user?userId=ff4WFv2tGQ949AEAAQ
done

You can tell that tracing happened because the response header of x-cloud-trace-context is 00fb1bfed62a3789a04a24d3ab8ddd14;o=1. You can see this three times, one per response.

However, if multiple requests are made concurrently in a short enough window, only one of the requests will have its span recorded.

for I in {1..3}; do
  curl -D - -o /dev/null -H"X-Cloud-Trace-Context: 00fb1bfed62a3789a04a24d3ab8ddd14/${I};o=1" -H"Accept: */*" https://pinian-core-s6qd3oo4jq-uc.a.run.app/v1/user?userId=ff4WFv2tGQ949AEAAQ &
done

This is identical, other than the &.

The output is a little harder to read, but you can observe one request with an x-cloud-trace-context of 00fb1bfed62a3789a04a24d3ab8ddd14;o=1 and two requests with 00fb1bfed62a3789a04a24d3ab8ddd14, i.e. missing ;o=1 demonstrating that they haven't been handled properly. This is also confirmed by looking at the Trace dashboard for this TraceID.

This problem also occurs when the requests to the service are different, e.g. requesting /v1/user?userId=XXX, then /v1/questions, then /v1/topic?topicId=YYY. It happens regardless of whether the same SpanID is used for each request. 

I have also tested running many more requests concurrently.

As long as the TraceID is the same and the requests happen concurrently, only one in a given window of time will have its span recorded correctly. 

I have confirmed that it's not an application issue, because the application server is recording its spans correctly as part of the same trace. This also confirms that Cloud Run is forwarding the trace header.

My expectation is that Cloud Run records spans for each of these requests correctly, and this seems like a bug.

Jamie.
Screen Shot 2021-09-03 at 00.19.04.png

Eyamba Ita

unread,
Sep 9, 2021, 11:34:05 AM9/9/21
to Google Stackdriver Discussion Forum
Hello Jamie,

I am the Product Manager for Cloud Trace. Thanks for bringing this to our attention. We have filed it in our internal bug system and will be taking a look to determine exactly where the issue may lie in the execution chain. 
It seems like we are evaluating these spans generated concurrently as duplicates and that obviously has some downstream processing consequences.
I will let you know once we have pinpointed the source of the issue and introduced a fix if necessary.


Thanks
Eyamba

ja...@pinian.co

unread,
Mar 14, 2022, 2:46:18 PM3/14/22
to Google Stackdriver Discussion Forum
Hi, 

Is there any update on this? I'm still seeing the problem, and it really lessens the utility of tracing. Because none of the spans are properly connected in the UI, it's hard to tell what's happening as part of each request.

Jamie.

Eyamba Ita

unread,
Mar 14, 2022, 3:52:54 PM3/14/22
to Google Stackdriver Discussion Forum

Hello Jamie,

Apologies for the delayed response. 

Upon further investigation, it seems that the concurrent requests are not specifying a new parent ID and all claiming to be the root span of traceid 00fb1bfed62a3789a04a24d3ab8ddd14.

This means that upon ingestion into our product, all these 3 new spans are evaluated as being the same span from different times and thus two are discarded and 1 is kept.

This is why we recommend that every new request that isn’t propagating an existing traceid should use a globally unique parent id.  I suspect if you try that it should resolve the seemingly inconsistent behavior you have observed.

Happy to chat further.

Eyamba

Jamie Talbot

unread,
Mar 14, 2022, 4:21:01 PM3/14/22
to Eyamba Ita, Google Stackdriver Discussion Forum
Hey Eyamba, 

Thanks for the update. Although in my example I'm doing a few curl disconnected requests, this is just a simplified version. In reality, I have:

Browser -> Cloud Load Balancer -> Cloud Run -> Frontend (NodeJS server)

In order to render a page, the Frontend server makes three concurrent requests to the API server:

Frontend -> Cloud Load Balancer -> Cloud Run -> Api Server (Go server)

So, in that situation, the concurrent requests are actually part of a common trace, and do share a parent. And, those API requests are enhanced with a X-Cloud-Trace-Context that is generated automatically by the Open Telemetry AutoInstrumentation, with the official Open Telemetry Cloud Trace Propagator (@google-cloud/opentelemetry-cloud-trace-propagator).

Under these circumstances, the same behaviour occurs, which you can see in the following screenshot. 

Screen Shot 2022-03-14 at 16.05.19.png

For this request, the three parallel HTTPS GET requests are traced by the Frontend server. Those requests pass through Cloud Run (not traced) to the API server which is traced (Recv.*).

To be clear, Recv./pinian.ContentService/GetPosition should be a child of the 132ms HTTPS GET, and the other Recv spans children of the other two HTTPS GETs.

Note also that CloudRun *does* report a trace for the HTTPS GET on line 3 (45ms), and *does* report a trace for the HTTPS GET on line 7 (38ms), which are peer requests to the same API service sent with the same propagation. Presumably because enough time has elapsed to not be counted as a duplicate. But, none of these requests are duplicates.

Hope that helps explain the problem.

Jamie.





--
© 2021 Google Inc. 1600 Amphitheatre Parkway, Mountain View, CA 94043
 
Email preferences: You received this email because you signed up for the Google Stackdriver Discussion Google Group (google-stackdr...@googlegroups.com) to participate in discussions with other members of the GoogleStackdriver community.
---
You received this message because you are subscribed to the Google Groups "Google Stackdriver Discussion Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-stackdriver-d...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-stackdriver-discussion/49e1ca32-7c6c-4a20-95eb-790407f83ff8n%40googlegroups.com.

Eyamba Ita

unread,
Mar 16, 2022, 12:02:01 PM3/16/22
to Google Stackdriver Discussion Forum
Jamie,

Thanks for sharing additional details. It really helps clarify the issue for us.
Yes, it seems you are right that there is something inconsistent with the way Cloud Run is propagating context headers.
It seems to only be passing along the traceid and not the span parent id.
I will pass these details along to my counterparts on the CloudRun team to reproduce and implement a fix. 
Once that is implemented, they will circle back to notify you. I will stay in the loop until this is closed.
Again, apologies for the delay.

Thanks
Eyamba
Reply all
Reply to author
Forward
0 new messages