ESP and custom nginx troubleshooting

605 views
Skip to first unread message

tv...@slb.com

unread,
Jun 22, 2018, 9:11:12 AM6/22/18
to Google Cloud Endpoints
Hi,

I'm running a grpc-service on kubernetes, using the ESP. I have a serious issue I can't figure out, and I was hoping for some ideas. 

Some of my requests are incorrectly returned to clients empty (0 bytes), but with status code OK. The exact same request are returned correctly both a few seconds before and after these incidents. When I check the  'produced api' logs there are 500s logged for these requests. So in order to verify I added additional logging in my custom nginx file, and it turns the very same requests that the ESP log as 500 are logged as 200 by nginx itself. This of course means the clients never receive the error and never retry, as they should. The client is implemented in Go and is making grpc requests, not REST, so the transcoding should not come into play. To make matters even more strange, it seems these requests never even make it to my actual backend, so the root cause of the 500 seems to be inside the ESP somewhere. These incidents also seem to happen in bursts, typically about 5-10 of them within the same second or so. The only other log entries I've found that seem relevant within the same time window are related to nginx buffering requests to file (see example below), but I normally see many of these warnings without any issues at all.

2018/06/20 22:54:51 [warn] 11#11: *163265 a client request body is buffered to a temporary file /var/cache/nginx/client_temp/0000016957, client: 10.0.0.15, server: , request: "POST /(redacted)/Read HTTP/2.0", host: "(redacted).cloud.goog"

As for now, I'm mostly interested making sure the 500s are not incorrectly masked as 200 for the clients. The root cause of the 500 I can look at later. I suspect that I somehow accidentally mask these 500s as 200s by some mistake in my custom nginx file, but I can't figure out what it might be. The problem could also be in the ESP code, for example correctly logging the request as an error, but for whatever reason still returns it as an OK. 

Below is my custom nginx file:

daemon off;

user nginx nginx;

pid /var/run/nginx.pid;

worker_processes auto;
events {
    use epoll;
}

# Logging to stderr enables better integration with Docker and GKE/Kubernetes.
error_log stderr warn;

http {
  include /etc/nginx/mime.types;
  server_tokens off;

  # HTTP subrequests
  endpoints_resolver 8.8.8.8;
  endpoints_certificates /etc/nginx/trusted-ca-certificates.crt;

  set_real_ip_from  0.0.0.0/0;
  set_real_ip_from  0::/0;
  real_ip_header    X-Forwarded-For;
  real_ip_recursive on;

  # Format log as json for nice integration with StackDriver
  # Full list of available variables: http://nginx.org/en/docs/varindex.html
  log_format trace escape=json
    '{'
      # This one is picked up specifically by StackDriver and shown as the headline of the log entry
      '"message": "$uri",'
      '"http_response_code": "$status",'
      '"http_method":"$request_method",'
      '"http_user_agent":"$http_user_agent",'
      '"http_referrer":"$http_referer",'
      '"request_size_in_bytes":"$request_length",'
      '"response_size_in_bytes":"$bytes_sent",'
      '"X-Request-ID":"$http_x_request_id",'
      '"X-Parent-Request-ID":"$http_x_parent_request_id"'
    '}';

  server {

    # Desperately try to get rid of those 'buffering' warnings, does not seem to work so far.
    client_max_body_size 10M;
    client_body_buffer_size 512M;

    server_name "";
    # don't send the nginx version number in error pages and Server header
    server_tokens off;

    listen 9001 ssl backlog=16384;
    listen 9000 http2 ssl backlog=16384;
    ssl_certificate /etc/nginx/ssl/fullchain.pem;
    ssl_certificate_key /etc/nginx/ssl/privkey.pem;

    # make the server choose the best cipher instead of the browser
    # Perfect Forward Secrecy(PFS) is frequently compromised without this
    ssl_prefer_server_ciphers on;

    # support only believed secure ciphersuites using the following priority:
    # 1.) prefer PFS enabled ciphers
    # 2.) prefer AES128 over AES256 for speed (AES128 has completely adequate security for now)
    # 3.) Support DES3 for IE8 support
    #
    # disable the following ciphersuites completely
    # 1.) null ciphers
    # 2.) ciphers with low security
    # 3.) fixed ECDH cipher (does not allow for PFS)
    # 4.) known vulnerable cypers (MD5, RC4, etc)
    # 5.) little-used ciphers (Camellia, Seed)
    ssl_ciphers 'kEECDH+ECDSA+AES128 kEECDH+ECDSA+AES256 kEECDH+AES128 kEECDH+AES256 kEDH+AES128 kEDH+AES256 DES-CBC3-SHA +SHA !aNULL !eNULL !LOW !kECDH !DSS !MD5 !EXP !PSK !SRP !CAMELLIA !SEED !3DES !SHA1';

    # Use 2048 bit Diffie-Hellman RSA key parameters
    # (otherwise Nginx defaults to 1024 bit, lowering the strength of encryption # when using PFS)
    # Generated by OpenSSL with the following command:
    # openssl dhparam -outform pem -out /etc/nginx/ssl/dhparam2048.pem 2048
    ssl_dhparam /etc/nginx/custom/dhparam2048.pem;

    # Cache SSL Sessions for up to 10 minutes
    # This improves performance by avoiding the costly session negotiation process where possible
    ssl_session_cache builtin:1000 shared:SSL:10m;
    ssl_session_timeout  10m;
    ssl_session_tickets off; # Requires nginx >= 1.5.9

    # google domain verification (in order to use PubSub push notifications we have
    # to prove ownership of the target URL)
    location /google8b0fc1b90cdd1d3d.html { alias /etc/nginx/google-domain-verification.html; }

    access_log /dev/stdout trace;

    location / {
      # Enable CORS
      add_header Access-Control-Allow-Methods 'GET,OPTIONS,PUT,POST,DELETE' always;
      add_header Access-Control-Allow-Credentials 'true' always;
      add_header Access-Control-Allow-Origin '$http_origin' always;
      add_header Access-Control-Allow-Headers 'Access-Control-Allow-Origin,Authorization,DNT,User-Agent,Keep-Alive,Content-Type,accept,origin,X-Requested-With' always;

      # when serving user-supplied content, include a X-Content-Type-Options: nosniff header along with the Content-Type: header,
      # to disable content-type sniffing on some browsers.
      add_header X-Content-Type-Options 'nosniff';

      add_header Cache-Control 'no-cache, no-store, must-revalidate';
      add_header Cache-Pragma 'no-cache';

      # This is dynamic content, so should never be cached.
      expires -1;

      # The AppSec scan says we need to define these headers. They're not really relevant for APIs serving nothing but json
      # but doesn't do any harm, so added here to silence the AppSec tests.
      add_header Content-Security-Policy 'script-src \'self\'';
      add_header X-XSS-Protection '1; mode=block;';

      add_header Strict-Transport-Security 'max-age=86400; includeSubDomains';

      add_header X-Frame-Options "DENY";

      if ($request_method = OPTIONS ) {
          return 200;
      }

      # Begin Endpoints v2 Support
      endpoints {
        on;
        server_config /etc/nginx/server_config.pb.txt;
        api /etc/nginx/endpoints/service.json;
        metadata_server http://169.254.169.254;
      }
      # End Endpoints v2 Support

      # WARNING: only first backend is used
      grpc_pass 127.0.0.1:8000 override;
    }

    include /var/lib/nginx/extra/*.conf;
  }

  server {
    # expose /nginx_status and /endpoints_status but on a different port to
    # avoid external visibility / conflicts with the app.
    listen 8090;
    location /nginx_status {
      stub_status on;
      access_log off;
    }
    location /endpoints_status {
      endpoints_status;
      access_log off;
    }
    location /healthz {
      return 200;
      access_log off;
    }
    location / {
      root /dev/null;
    }
  }
}


Dan Ciruli

unread,
Jun 22, 2018, 12:05:02 PM6/22/18
to Torbjorn Vik, Google Cloud Endpoints
Hi Torbjorn,

We will have someone look immediately.

DC

--
You received this message because you are subscribed to the Google Groups "Google Cloud Endpoints" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-endp...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-endpoints/f01f70aa-c3dd-4cc2-ac5f-97d1257c9177%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
DC

tv...@slb.com

unread,
Jun 22, 2018, 2:53:33 PM6/22/18
to Google Cloud Endpoints
Oh, I'm running ESP version 1.19.0, by the way. That might be relevant.

Tao Li

unread,
Jun 23, 2018, 2:57:44 PM6/23/18
to Google Cloud Endpoints


On Friday, June 22, 2018 at 11:53:33 AM UTC-7, tv...@slb.com wrote:
Oh, I'm running ESP version 1.19.0, by the way. That might be relevant.

Can you try with v1.20.0? In the meantime, we are investigating and will get back to you soon.

lei...@google.com

unread,
Jun 23, 2018, 3:51:42 PM6/23/18
to Google Cloud Endpoints
Hi Torbjorn,
Can you share the steps to reproduce the problem? Meanwhile, as your setup has a client making grpc requests and a grpc service, can you try a simple hello-world type grpc service (e.g., the grpc client sends a "hello" and the grpc service replies a "world") without your custom nginx config to see if it has the same problem?
The link https://cloud.google.com/endpoints/docs/grpc/about-grpc includes the instructions for a hello-world grpc service. If a hello-world grpc service without your custom config works,  you may add your custom configs to your custom nginx config file one by one until seeing which of your custom config leads to the problem.  If a hello-world grpc service without your custom config does not works, the possible causes of the problem are narrowed down to the cloud endpoints.
Thanks,
Lei

On Friday, June 22, 2018 at 6:11:12 AM UTC-7, tv...@slb.com wrote:

tv...@slb.com

unread,
Jun 24, 2018, 10:39:11 AM6/24/18
to Google Cloud Endpoints
Hi Lei,

It's difficult for me to give exact steps for reproducing, given the issue is so intermittent. For example, the last 24hours it's happened 1 time out of about 2.5M requests. For the same reason it's impractical to modify my nginx config file line-by-line, given it could take weeks to narrow down the issue that way. If there was a way to get more verbose logging from the ESP, that might be more useful in order to understand what's happening. Given the ESP is actually logging 500s, maybe some more logging info could help us. I looked for some sort of 'verbose' flag for the ESP, but was unable to find anything...do you know if this exists?

T.

lei...@google.com

unread,
Jun 24, 2018, 1:25:15 PM6/24/18
to Google Cloud Endpoints
Hi Torbjorn,
Yes, a more verbose logging flag for ESP exists. A more verbose logging can be achieved through configuring the log level in the line "error_log stderr warn;" of your custom nginx config file. The syntax of the error_log is: error_log [log_file] [log_level];
The first parameter of error_log defines the file storing the log, e.g., "stderr" selects the standard error file. The second parameter of error_log defines the level of logging, and can be one of the following values: debug, info, notice, warn, error, crit, alert, or emerg. Log levels above are listed in the order of decreasing verbosity, with debug being the most verbose option and warn being somewhere in the middle in terms of its verbosity. 
Thanks,
Lei

tv...@slb.com

unread,
Jun 24, 2018, 8:12:04 PM6/24/18
to Google Cloud Endpoints
Ah, I see. I'm no expert in NGINX, and I wasn't sure if this logging level was only for Nginx itself or if it would somehow control the ESP-plugin/system/piece/stuff as well. I've set it to 'debug' now and we'll see if we get something more useful out of it. 

Thanks!

tv...@slb.com

unread,
Jun 26, 2018, 10:47:38 AM6/26/18
to Google Cloud Endpoints
I've set the nginx logging level to 'debug' now (and also upgraded to 1.20.0) and have a bit more information to share after another burst of errors this morning. The timing of the errors seem to coincide with a burst of two other sets of logging messages. At the time of the problem, I get a handful of these:

2018-06-26 05:56:21.000 PDT2018/06/26 12:56:21 [debug] 11#11: *20498 Failed to send the trailers - rc=-1

...and these:

2018-06-26 05:56:21.000 PDT2018/06/26 12:56:21 [info] 10#10: *20554 client canceled stream 29, client: 10.0.0.6...

Given I just enabled the debug level logging I unfortunately don't have a very long history for verification, but during the last 24 hours the two log messages above appeared only one other time. In that case they did _not_ coincide with the 500-masking-issue, so there seems to be a strong correlation, but not necessarily a perfect 1-to-1 relationship. 

Does this spark any thoughts or ideas?

tv...@slb.com

unread,
Jun 26, 2018, 10:49:46 AM6/26/18
to Google Cloud Endpoints
Oh, and the requests that fail are always dealing with a streaming API (the client is sending one request message, server is streaming back many messages), that might be relevant.

lei...@google.com

unread,
Jun 26, 2018, 5:00:38 PM6/26/18
to Google Cloud Endpoints
These two logs ("Failed to send grpc trailers ..."  and "Client cancelled stream...") indicate that the TCP connection between the go client and the ESP nginx proxy was broken for a short burst, which may be due to network problems or that the go client disconnects from the TCP connection. These grpc failures happen at a small probability (" the last 24 hours it's happened 1 time out of about 2.5M requests"). To make the grpc requests more reliable, you may add a retry mechanism at the go client. But even with a retry mechanism, grpc transmission failures may still occur (e.g., the network is disconnected for 10 seconds while the grpc client only retries three times within 3 seconds after the first failure).


On Tuesday, June 26, 2018 at 7:47:38 AM UTC-7, tv...@slb.com wrote:

lei...@google.com

unread,
Jun 26, 2018, 6:13:54 PM6/26/18
to Google Cloud Endpoints
As a workaround to the issue of empty grpc response with a grpc status code OK, when the go client receives a such grpc response, the go client can detect the empty grpc response and retry the request a few times. The following is a pseudo-code for such retry logic at the go client:
while(grpcResponse.status != OK || grpcResponse.size == 0) {
   if(retryTimes>=MAX_RETRY_TIMES) {
     break;
   }
   retryTimes++; 
   // Retry the grpc request
   .... 

tv...@slb.com

unread,
Jun 27, 2018, 1:41:46 PM6/27/18
to Google Cloud Endpoints
Yeah, that workaround would probably work. But I'm still very interested in figuring out what the proper solution would be, given this is an API we expose to our external clients and I don't want to have tell them to implement workarounds like that (at least not without saying we're working on the proper solution). 

So those log messages above indicate that there is a loss of connection, but it still doesn't explain why the ESP logs these requests as 500 while nginx logs them as 200.... That's the main problem here as these errors are being incorrectly masked as OK.

lei...@google.com

unread,
Jun 27, 2018, 2:29:32 PM6/27/18
to Google Cloud Endpoints
Regarding to "it still doesn't explain why the ESP logs these requests as 500 while nginx logs them as 200.... That's the main problem here as these errors are being incorrectly masked as OK", I think the reason that ESP logs these requests as 500 while nginx logs them as 200 is because you added additional logging/changes in your custom nginx file (cited from your first post: "When I check the  'produced api' logs there are 500s logged for these requests. So in order to verify I added additional logging in my custom nginx file, and it turns the very same requests that the ESP log as 500 are logged as 200 by nginx itself.")

Since in 'produced api' logs there are 500s logged for these requests, you can view such errors in 'produced api' logs. If you need to view additional logging, you can configure the log_level through "error_log [log_file] [log_level]", which will show these errors in the following logs from one of your previous posts.  
2018-06-26 05:56:21.000 PDT2018/06/26 12:56:21 [debug] 11#11: *20498 Failed to send the trailers - rc=-1
2018-06-26 05:56:21.000 PDT2018/06/26 12:56:21 [info] 10#10: *20554 client canceled stream 29, client: 10.0.0.6...

tv...@slb.com

unread,
Jun 27, 2018, 2:49:13 PM6/27/18
to Google Cloud Endpoints
The problem is of course not the logging. The problem is that the ESP is identifying a problem as an error (as indicated by the logs), but this is returned as OK to the client (again indicated by the logs).  I want this error that the ESP somehow detects, to bubble up all the way to the client.

qiwz...@google.com

unread,
Sep 17, 2018, 1:36:35 PM9/17/18
to Google Cloud Endpoints
According to gRPC protocol:  https://github.com/grpc/grpc/blob/master/doc/PROTOCOL-HTTP2.md
gRPC always responds with HTTP-status:  ":status 200".  The actual status is in the "grpc-sttatus" in the trailer.     ESP implements it according to this: the code is here: https://github.com/cloudendpoints/esp/blob/master/src/nginx/error.cc#L157

Nginx only cares about HTTP-status, so it logs as 200.    If your client is gRPC client,  it should not read HTTP-status,  its status should come from grpc-status in the trailers.

Thanks
-Wayne

Torbjorn Vik

unread,
Dec 18, 2018, 10:59:59 AM12/18/18
to Google Cloud Endpoints
Hi Wayne,

I completely missed your response here, this is really helpful. Thanks!

T.
Reply all
Reply to author
Forward
0 new messages