There are bug reports you can find scattered around different SDKs that mention libcurl errors like
"Could not resolve host: www.googleapis.com; Name or service not known"
Although a retry will frequently work around and mask this issue, it appears that something is wrong with GCE DNS/resolver infrastructure: there is a flaw that is not libcurl-specific but
can be reproduced via parallel getaddrinfo(3) queries, under a moderate instance load. What's more, this
failure seems to begin only when resolving "www.googleapis.com" name.
It never happens with "www.google.com". After it occurs with "
www.googleapis.com", it seems to "stick" and become reproducible with many other external addresses.
Steps to reproduce:
1. On a Linux GCE instance (I have been using CentOS 7, uname -a: "Linux ... 3.10.0-514.26.2.el7.x86_64 #1 SMP Tue Jul 4 15:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux"), compile the attached file
>g++ -o crash_addrinfo main.cpp -lpthread
(... all threads start, nothing bad happens for a long time...)
(almost immediately lots of threads fail with EAI_NONAME):
...
started
started
started
*** rc -2 Name or service not known
*** rc -2 Name or service not known
*** rc -2 Name or service not known
*** rc -2 Name or service not known
*** rc -2 Name or service not known
*** rc -2 Name or service not known
...
With my production binaries, I've used ltrace to confirm that it is the getaddrinfo() libc call that's failing. I have run the above tests on two other, virtually identical, CentOS images outside of google cloud and the issue does not reproduce anywhere else.
Please advise,
Vlad
(*) it is possible to see the issue with fewer threads, e.g. 20, but the probability is much higher, essentially 100%, with 100-200