Hello,
I would like to ask your opinion about a problem I figured out recently with how python prometheus client functions in our system.
Description of the problem
def test_gauge_all_accross_forks(self):
pid = 0
values.ValueClass = MultiProcessValue(lambda: pid)
g1 = Gauge('g1', 'help', registry=None)
g1.set(1)
pid = 1
g2 = Gauge('g2', 'help', registry=None)
g2.set(2)
# this works fine
self.assertEqual(1, self.registry.get_sample_value('g1', {'pid': '0'}))
self.assertEqual(2, self.registry.get_sample_value('g2', {'pid': '1'}))
# this metric has never been reported from pid:1, thus should not be present
self.assertIsNone(1, self.registry.get_sample_value('g1', {'pid': '1'}))
Or more verbose, steps to reproduce:
- multiprocessing environment
- report a metric X from parent process (with pid 0)
- fork
- continue reporting metric X from parent process (with pid 0)
- report a metric Y from child process (with pid 1)
- collect metrics via normal mutliprocessing collector
Expectation:
1. metric X is reported with label "pid: 0", non-zero value
2. metric Y is reported with label "pidL 1", non-zero value
3. metric X is NOT reported with label "pid: 1" (i.e. it is not reported from pid 1 - should not be present)
Actual:
1) and 2) holds, 3) does not.
I was wondering if I can somehow fix the way we report metrics, on our side, but I discovered that this is not possible.
Results of my investigation
As of current master, `values` list here is used to store the list of all metric values that are synced via a memory-mapped dict:
And when there's a fork, this list gets copied over to child process. Then when it detects that there was a pid change, it "resets" the memory-mapped file,
and goes over all metric values:
and tries to read them from the file:
But for the parent metrics there isn't anything in the brand new memory mapped file. So `self._file.read_value(self._key)` initializes it... with 0!
Why is it a problem
Although it is not blocking us from using the client (yet), it creates N (number of parent metrics) times M (number of child processes) times series that are just
occupying the prometheus. But it will eventually become a problem since the number of metrics is always growing and can reach some critical mass.
TL;DR What do you guys think? I cannot think of a decent solution just yet.
Thank you!