I have service running on kubernetes. I have prometheus operator installed (using helm) and the services using python 2.7. I would also mention that we are using gevent.
The code is as such:
def setup(self, contextId, port, isTest=False):
if self._workerThread is not None:
logging.error(u"Metrics already setup")
return
logging.info(u"Starting metrics listener on port {}".format(port), extra=makeExtraFromContextId(contextId))
self._port = port
self._isTest = isTest
if not self._isTest:
self._workerThread = Thread(target=self._httpWorkerFunction, args=[contextId, self._event])
self._workerThread.start()
def _httpWorkerFunction(self, contextId, event):
try:
app = make_wsgi_app()
httpd = make_server('', self._port, app)
except Exception as ex:
logging.exception(u"Failed to start metrics listener on port {} - {}".format(self._port, ex.message),
extra=makeExtraFromContextId(contextId))
return
while not event.is_set():
try:
httpd.handle_request()
except Exception as ex:
logging.exception(u"Exception in metrics listener request handle {}".format(ex.message))
Everything was working fine until suddenly I get an alert that some of the services are down.
These are two different services, one with 2 pods and the other with 3. 1 of the first services and 2 of the second stopped responding.
I tried calling the metrics endpoint myself from within the other pods but the request just hangs and eventually times out.
I haven't restarted them yet so we can investigate but I really have no idea that happened.
Any help would be much appreciated