metrics endpoint stopped responding

85 views
Skip to first unread message

Nadav Kremer

unread,
May 12, 2020, 4:39:28 AM5/12/20
to Prometheus Users

I have service running on kubernetes. I have prometheus operator installed (using helm) and the services using python 2.7. I would also mention that we are using gevent.

The code is as such:

    def setup(self, contextId, port, isTest=False):
        if self._workerThread is not None:
            logging.error(u"Metrics already setup")
            return

        logging.info(u"Starting metrics listener on port {}".format(port), extra=makeExtraFromContextId(contextId))

        self._port = port
        self._isTest = isTest

        if not self._isTest:
            self._workerThread = Thread(target=self._httpWorkerFunction, args=[contextId, self._event])
            self._workerThread.start()

    def _httpWorkerFunction(self, contextId, event):
        try:
            app = make_wsgi_app()
            httpd = make_server('', self._port, app)
        except Exception as ex:
            logging.exception(u"Failed to start metrics listener on port {} - {}".format(self._port, ex.message),
                              extra=makeExtraFromContextId(contextId))

            return

        while not event.is_set():
            try:
                httpd.handle_request()
            except Exception as ex:
                logging.exception(u"Exception in metrics listener request handle {}".format(ex.message))

Everything was working fine until suddenly I get an alert that some of the services are down.
These are two different services, one with 2 pods and the other with 3. 1 of the first services and 2 of the second stopped responding.

I tried calling the metrics endpoint myself from within the other pods but the request just hangs and eventually times out.

I haven't restarted them yet so we can investigate but I really have no idea that happened.

Any help would be much appreciated

Thomas Frössman

unread,
Jul 25, 2020, 11:48:39 AM7/25/20
to Prometheus Users
I have also noted the same issue with ONE  out of many django projects running under uwsgi. We use the django-prometheus package which has built in support for starting one http server per uwsgi process so port 8000-8016 are exporters that represents the 16 processes. Some times one of the prometheus export http servers just stops responding while it's parent process hasn't changed or stopped working. Restarting uwsgi fixes it but within days to weeks one hangs again.  It's probably something a little bit specific to this project since I haven't seen it in any of our other projects.. I don't know if it's the same root cause as your problem but I will investigate it given time.
Reply all
Reply to author
Forward
0 new messages