getaddrinfo(3) query for "www.googleapis.com" fails under a moderate load on a GCE instance

249 views
Skip to first unread message

Vladimir Roubtsov

unread,
Sep 12, 2017, 12:58:02 PM9/12/17
to gce-discussion
[cross-posted from cloud-dns-discuss, hoping to get more traction in this group]

There are bug reports you can find scattered around different SDKs that mention libcurl errors like

"Could not resolve host: www.googleapis.com; Name or service not known"

Although a retry will frequently work around and mask this issue, it appears that something is wrong with GCE DNS/resolver infrastructure: there is a flaw that is not libcurl-specific but can be reproduced via parallel getaddrinfo(3) queries, under a moderate instance load. What's more, this failure seems to begin only when resolving "www.googleapis.com" nameIt never happens with "www.google.com". After it occurs with "www.googleapis.com", it seems to "stick" and become reproducible with many other external addresses. 

Steps to reproduce:

1. On a Linux GCE instance (I have been using CentOS 7, uname -a: "Linux ... 3.10.0-514.26.2.el7.x86_64 #1 SMP Tue Jul 4 15:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux"), compile the attached file

>g++ -o crash_addrinfo main.cpp -lpthread

2. run ~200 [*] threads querying any www.google.com, e.g:

>./crash_addrinfo www.google.com 80 200
(... all threads start, nothing bad happens for a long time...)

3. now change the name to www.googleapis.com:

>./crash_addrinfo www.googleapis.com 80 200
(almost immediately lots of threads fail with EAI_NONAME):
...
started
started
started
*** rc -2 Name or service not known
*** rc -2 Name or service not known
*** rc -2 Name or service not known
*** rc -2 Name or service not known
*** rc -2 Name or service not known
*** rc -2 Name or service not known
...

With my production binaries, I've used ltrace  to confirm that it is the getaddrinfo() libc call that's failing. I have run the above tests on two other, virtually identical, CentOS images outside of google cloud and the issue does not reproduce anywhere else.

Please advise,
Vlad

(*) it is possible to see the issue with fewer threads, e.g. 20, but the probability is much higher, essentially 100%, with 100-200
main.cpp

Navi Aujla (Google Cloud Support)

unread,
Sep 12, 2017, 3:22:25 PM9/12/17
to gce-discussion
Hello Vlad, 

Thank you for providing details along with the issue replication steps. However, this forum is meant for the general discussion of the google cloud platform. 

For this reported problem, please open an issue using public issue tracker [1] and we will verify to work on it. 

[1] https://issuetracker.google.com

Vladimir Roubtsov

unread,
Sep 12, 2017, 3:32:23 PM9/12/17
to gce-discussion

Vladimir Roubtsov

unread,
Sep 20, 2017, 11:23:56 AM9/20/17
to gce-discussion
bump

Paul Nash

unread,
Sep 20, 2017, 4:05:11 PM9/20/17
to Vladimir Roubtsov, gce-discussion
The issue is filed, and will be triaged by the team, and any updates will be provided on the issue you filed (thanks for that). Unfortunately I don't think we can guarantee a specific ETA. FWIW, we're not aware of any widespread reports of issues like this, which factors into the order in which the team can engage on issues.

One question because I'm curious -- you reported this specifically on CentOS 7. Since you have the "repro" case, I'm curious if you have/could try this on another OS, like a current Ubuntu image? If you have any extra info along these lines, please add it to the issue so our engineers can see it.

Thanks,
-P

On Wed, Sep 20, 2017 at 8:23 AM, Vladimir Roubtsov <vrou...@whitebaygroup.com> wrote:
bump

--
© 2017 Google Inc. 1600 Amphitheatre Parkway, Mountain View, CA 94043
 
Email preferences: You received this email because you signed up for the Google Compute Engine Discussion Google Group (gce-discussion@googlegroups.com) to participate in discussions with other members of the Google Compute Engine community and the Google Compute Engine Team.
---
You received this message because you are subscribed to the Google Groups "gce-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gce-discussion+unsubscribe@googlegroups.com.
To post to this group, send email to gce-discussion@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gce-discussion/8388c760-e279-4bcf-8426-24c8ab02ada9%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

Paul R. Nash | Group Product Manager, Compute Engine | paul...@google.com | 206-876-1620

Vladimir Roubtsov

unread,
Sep 20, 2017, 4:22:26 PM9/20/17
to gce-discussion
Paul, thanks for your follow-up.

Before I filed my issue with the repro I searched for similar error messages and there are scattered reports by users of various SDKs that rely on libcurl. There are quite a few and they don't get much attention because they are sporadic and go away after some retries. It is also possible that the SDKs (which I can't use one for C++ work, sadly) throttle the load under the covers and make the issue further less likely. However, it is hard to imagine why any node within GCP should fail to resolve "www.googleapis.com", ever, and I was hoping this fact would attract more attention.

As for CentOS 7 vs something else, my report already states that I've tried two different CentOS 7 machines outside of GCP and could not reproduce the issue there. It is very likely specific to GCP. Regarding other Linux types entirely, I am not a trial GCP user and would incur additional charges by doing what you suggest.


On Wednesday, September 20, 2017 at 4:05:11 PM UTC-4, Paul Nash wrote:
The issue is filed, and will be triaged by the team, and any updates will be provided on the issue you filed (thanks for that). Unfortunately I don't think we can guarantee a specific ETA. FWIW, we're not aware of any widespread reports of issues like this, which factors into the order in which the team can engage on issues.

One question because I'm curious -- you reported this specifically on CentOS 7. Since you have the "repro" case, I'm curious if you have/could try this on another OS, like a current Ubuntu image? If you have any extra info along these lines, please add it to the issue so our engineers can see it.

Thanks,
-P
On Wed, Sep 20, 2017 at 8:23 AM, Vladimir Roubtsov <vrou...@whitebaygroup.com> wrote:
bump
Email preferences: You received this email because you signed up for the Google Compute Engine Discussion Google Group (gce-dis...@googlegroups.com) to participate in discussions with other members of the Google Compute Engine community and the Google Compute Engine Team.

---
You received this message because you are subscribed to the Google Groups "gce-discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gce-discussio...@googlegroups.com.
To post to this group, send email to gce-dis...@googlegroups.com.

Vladimir Roubtsov

unread,
Oct 17, 2017, 6:00:55 PM10/17/17
to gce-discussion
More than a month passes, the issue is completely ignored both here or in the issue tracker. I am out of here. 
Reply all
Reply to author
Forward
0 new messages