Inconsistent transcoding of response to JSON

75 views
Skip to first unread message

Øyvind Wergeland

unread,
May 12, 2020, 5:49:40 AM5/12/20
to Google Cloud Endpoints
Hi,

We are using ESP as a sidecar in GKE to transcode calls to our gRPC services. For somewhat large responses, we see that sometimes the some of the protobuf fields are missing in the JSON response, but the JSON is valid with all closing curly brackets.

When runnning with --enable-debug we see that the received gRPC response is always of the same size for a given request, but the serialized string is sometimes shorter:

From /var/log/nginx/error.log - these are all responses for the exact same request, and contain the exact same protobuf content every time from the gRPC service:

2020/05/11 18:27:52 [debug] 11#11: *1 NgxEspGrpcServerCall::Write: Writing 388422 bytes
2020/05/11 18:27:52 [debug] 11#11: *1 NgxEspGrpcServerCall::Write: completed synchronously
...
2020/05/11 18:27:52 [debug] 11#11: *1 http chunk: 886108
...
2020/05/11 18:28:00 [debug] 11#11: *5 NgxEspGrpcServerCall::Write: Writing 388422 bytes
2020/05/11 18:28:00 [debug] 11#11: *5 NgxEspGrpcServerCall::Write: completed synchronously
...
2020/05/11 18:28:00 [debug] 11#11: *5 http chunk: 927729
...
2020/05/11 18:28:02 [debug] 11#11: *7 NgxEspGrpcServerCall::Write: Writing 388422 bytes
2020/05/11 18:28:02 [debug] 11#11: *7 NgxEspGrpcServerCall::Write: completed synchronously
...
2020/05/11 18:28:02 [debug] 11#11: *7 http chunk: 927729


From the GKE container log:

2020-05-11T18:27:53.264270096Z 127.0.0.1 - - [11/May/2020:18:27:53 +0000] "POST /someService/someMethod HTTP/1.1" 200 886122 "-" "curl/7.54.0"
2020-05-11T18:28:01.173099703Z 127.0.0.1 - - [11/May/2020:18:28:01 +0000] "POST /someService/someMethod HTTP/1.1" 200 927743 "-" "curl/7.54.0"
2020-05-11T18:28:03.372017566Z 127.0.0.1 - - [11/May/2020:18:28:03 +0000] "POST /someService/someMethod HTTP/1.1" 200 927743 "-" "curl/7.54.0"

Note that this does not always occur on the first request, but on the debug instance it happens typically in one of the 10 first requests after restart (no other HTTP requests are sent to this instance). After the first time it happens very seldom on that instance, but much more often in our production environment.

We have reproduced with tags 1.15.0, 1.30.0, 1.48.0 and 1.50.0 of Docker image gcr.io/endpoints-release/endpoints-runtime

I'm a bit at loss on how to handle this issue, as it is reproducible, but not in a completely consistent manner. We are fairly certain it happens somewhere in  NgxEspTranscodedGrpcServerCall::ConvertResponseMessage  (https://github.com/cloudendpoints/esp/blob/master/src/nginx/transcoded_grpc_server_call.cc#L196), but there is a lot of code to track down from there.

Any idea on what's causing this or how to isolate the issue further is greatly appreciated!


Best regards,

Øyvind Matheson Wergeland
Director of Software Engineering
Exabel AS

Teju Nareddy

unread,
May 12, 2020, 5:01:49 PM5/12/20
to Google Cloud Endpoints
Can you share some details about which specific fields are missing in the JSON response? Are they enums, primitive types, or messages? If you can post the varying responses you're seeing, that would help.

My initial hunch is that the output is missing fields with default values. https://github.com/cloudendpoints/esp/blob/b33a63d7103e37c80aea492c3243e923f2d428f1/start_esp/start_esp.py#L754

        Whether to always print primitive fields
        for grpc-json transcoding. By default primitive fields with default
        values will be omitted in JSON output. For example, an int32 field set
        to 0 will be omitted. Setting this flag to true will override the
        default behavior and print primitive fields regardless of their values.
        Defaults to false

Wayne, can you take a look. I don't have a good understanding of how ESP buffers the response during transcoding. The logs lines for `Write` may be dependent on the amount of of data buffered, hence why the size varies.

qiwz...@google.com

unread,
May 12, 2020, 7:24:17 PM5/12/20
to Google Cloud Endpoints
It seems that ESP received serialized proto response message is 388422 bytes, but after translated into JSON,  its size changes.   Sometime is 886108 bytes and sometime is 927729 bytes.  The translated json should be in the debug log from this line
.   Maybe you can do a diff to see what is the different, or you can diff the result you received from your Http client.  

Bottom line is: the transcoding code is calling this protobuf function,  the problem could be inside that function. 

Øyvind Wergeland

unread,
May 13, 2020, 10:23:34 AM5/13/20
to Google Cloud Endpoints
Hi,

Thank you very much for your feedback.

First of all: All the responses in the original post are for the exact same gRPC response from the backend.

I'm not sure if I can share the exact proto definitions and messages, but when this message is truncated, it is in a repeated field. The repeated message is a nested nested message with an enum and a oneof of some other nested messages. 

It may of course be completely coincidental that this happens in the repeated field.

The message structure is fairly similar to:

message GrpcResponseMessage {
  // several fields
  message Result = 7;
  // some more fields
}

message Result {
  // several fields with lot of data
  repeated Warning warnings = 14;
  // a few more fields
}

message Warning {
  // two deprecated fields, not set in this response
  // an enum field:
  WarningType type = 3;

  oneof details {
    WarningTypeOne type_one = 4
    WarningTypeTwo type_two = 5
    WarningTypeThree type_three = 6
    WarningTypeFour type_four = 7
  }
}

I noticed that even though the JSON is syntactically correct, the JSON may be invalid. At least one observation contains the same field twice in the last message, but with an empty value, e.g like this:

…, {"type":"TYPE_ONE","typeOne":{"message":"warning details"},"typeOne":{}}]}}}

while there should have been more warnings in the list. The next warning is of the same type.

I've managed to build ESP with added debug logging in NgxEspTranscodedGrpcServerCall::ReadTranslatedResponse, which confirms that the truncation has already happend, as size and std::string(reinterpret_cast<const char *>(buffer), size).size() are the exact same size as the number of bytes returned to the HTTP client (I piped the response body from curl to wc -c). The sizes matches in client and debug log both when the message is complete and when it is truncated.

Could this be caused by the binary_input stream being closed somehow before all the fields are translated to JSON in https://github.com/protocolbuffers/protobuf/blob/master/src/google/protobuf/util/json_util.cc#L84 ?

By the way, I built the docker image with /bin/bash ./script/linux-build-docker -i <image> -p generic - is this the correct way to build images similar to gcr.io/endpoints-release/endpoints-runtime?

I also had to change the logging calls from ngx_log_debug1(NGX_LOG_DEBUG_HTTP, ...) to ngx_log_error(NGX_LOG_DEBUG, ...) to get the debug entries in /var/log/nginx/error.log.

We're starting this container with --enable-debug.

Regards, Øyvind

Wayne Zhang

unread,
May 13, 2020, 2:55:52 PM5/13/20
to Øyvind Wergeland, Google Cloud Endpoints
A couple ideas on how to debug this
== try to validate if the binary data before transcoding is good,  try to write the grpc_msg data to a file from this line (or HEX it and write to the log, I don't think you can write binary to the log) and compare it to the one from your grpc server,  or compare it between different requests. 
== with the binary data, try to write a stand alone code to just call this transcoding function to see if we can reproduce the problem,  maybe call it repeatedly with thousands of times to see if the problem occurs.  

Does the truncate happen at the end of JSON, or at the middle.   Even at the end,   the transcoing_code may add the necessary ending blankets to form a valid json.   This may indicate the input binary data is truncated. 

--
You received this message because you are subscribed to a topic in the Google Groups "Google Cloud Endpoints" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/google-cloud-endpoints/eqxNqq8k4jA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to google-cloud-endp...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-endpoints/2aea1288-703e-4312-97f6-1681c4dd2dba%40googlegroups.com.

Øyvind Wergeland

unread,
May 15, 2020, 3:05:57 AM5/15/20
to Google Cloud Endpoints
Thanks for the suggestions. We've managed to verify that the binary data received by ESP always is complete and identical by writing to disk.

The only difference between responses we have found so far is that the slicing varies (by debugging the message serializer), so a vague theory is that a specific slicing could trigger the bug.

We also found a possible bug here, as transcoder_->ResponseOutput()->Next() is only called once, and not until it returns false, per documentation. However, adding an additional call to Next() always yield false response and 0 bytes, both when the end result is truncated and not, so it seems that the implementation always reads all its input if there is any.

We have not yet tried to call the transcoding function repeatedly with the same data.

The truncation always happen at the end.

To unsubscribe from this group and all its topics, send an email to google-cloud-endpoints+unsub...@googlegroups.com.

Øyvind Wergeland

unread,
May 15, 2020, 11:22:55 AM5/15/20
to Google Cloud Endpoints
We've managed to isolate the bug.

The problem was as expected triggered by the variations in slicing. The bug occurs when the last tag in a slice is an unknown type of variable length (I'm not sure why we have unknown tags in the message though). The problem is that the part of the field value that is available in the next slice is not skipped, and the message reading is corrupted from that slice.

Unknown fields are skipped here with a call to WireFormat::SkipField

That leads to a call to to CodedInputStream::Skip here, which is delegated to CodedInputStream::SkipFallback, thats calls the unsupported method GrpcZeroCopyInputStream::Skip in ESP here.

It looks like the missing return value check in ProtoStreamObjectSource::WriteMessage could mask the problem.

The fix is to implement GrpcZeroCopyInputStream::Skip.

Regards, Øyvind

Wayne Zhang

unread,
May 15, 2020, 12:29:58 PM5/15/20
to Øyvind Wergeland, Google Cloud Endpoints
That is a great debug.  It seems that the issue is in the protobuf library.  Could you file an issue and PR to fix it?   Thanks

-Wayne

To unsubscribe from this group and all its topics, send an email to google-cloud-endp...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Google Cloud Endpoints" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-endp...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-endpoints/3a67d207-9137-477e-bbee-83ed718d4acc%40googlegroups.com.

Øyvind Wergeland

unread,
May 15, 2020, 1:26:46 PM5/15/20
to Google Cloud Endpoints
Thanks. Actually, it looks like the bug is in ESP with the missing implementation of `Skip` (I've verified that it is being called by adding debug logging in it):


I've tried to verify that it works if it is implemented, and I'll make an issue and PR if it is successful.

The reason we triggered the bug was that those field I believed were deprecated and never set, in fact was set, but deleted from the proto defintiion (and hence of unknown type to ESP), and one of those fields was a string field. We've worked around the issue by reintroducing the fields to the proto definition.

Regards, Øyvind
To unsubscribe from this group and all its topics, send an email to google-cloud-endpoints+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Google Cloud Endpoints" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-endpoints+unsub...@googlegroups.com.

Wayne Zhang

unread,
May 15, 2020, 2:28:09 PM5/15/20
to Øyvind Wergeland, Google Cloud Endpoints
Thanks.  I filed this issue to track it. 

To unsubscribe from this group and all its topics, send an email to google-cloud-endp...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Google Cloud Endpoints" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-endp...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Google Cloud Endpoints" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-endp...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-endpoints/f5d681bc-6d2d-4eba-b02a-e9119d8b70d0%40googlegroups.com.

Øyvind Wergeland

unread,
May 15, 2020, 2:42:02 PM5/15/20
to Google Cloud Endpoints
Excellent! We have confirmed that it works with a proper implementation of Skip, and are working to prepare a PR.

To unsubscribe from this group and all its topics, send an email to google-cloud-endpoints+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Google Cloud Endpoints" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-endpoints+unsub...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Google Cloud Endpoints" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-endpoints+unsub...@googlegroups.com.

Øyvind Wergeland

unread,
May 18, 2020, 10:52:51 AM5/18/20
to Google Cloud Endpoints
Thank you very much for taking over the PR.

I just wondered of we should be concerned regarding the missing check on the return value of Skip here in protobuf? It did mask the bug in this case, but I'm not sure if an actual error would occur in ESP, since gRPC streaming is not supported.

Wayne Zhang

unread,
May 18, 2020, 12:32:35 PM5/18/20
to Øyvind Wergeland, Google Cloud Endpoints
You can file an issue with protobuf repo to seek their input. 

-Wayne

To unsubscribe from this group and all its topics, send an email to google-cloud-endp...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Google Cloud Endpoints" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-endp...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Google Cloud Endpoints" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-endp...@googlegroups.com.

--
You received this message because you are subscribed to a topic in the Google Groups "Google Cloud Endpoints" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/google-cloud-endpoints/eqxNqq8k4jA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to google-cloud-endp...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-endpoints/e8edc3e9-6031-457d-9c7c-4b21d3dc1a40%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages