Ruby Client pre-fork server support.

305 views
Skip to first unread message

Jeffery Utter

unread,
Nov 10, 2015, 7:14:10 PM11/10/15
to Prometheus Developers
There is an outstanding issue on the Github issues list for the Ruby Client library about support for multi-process (forking) servers. The issue can be found here: https://github.com/prometheus/client_ruby/issues/9 .

I have a potential fix for this here: https://github.com/jeffutter/prometheus-client_ruby/tree/multi-worker . However, since it is a non-trivial change I wanted to get some feedback and the contributing guidelines said to bring that up here.

Right now there are plenty of broken tests, since I re-architected a little. If it seems like a generally good solution I'll fix things up and get it ready for a pull requests.

Thanks,
Jeff

Uriel Corfa

unread,
Nov 11, 2015, 5:02:45 AM11/11/15
to Jeffery Utter, Prometheus Developers

This is an issue that also affects other languages that are used in a prefork model. There's a multiprocess branch for the python client too but it seems like better discovery mechanisms would solve the issue while keeping clients simple. Has there been any effort in that direction?

Currently the only way I have to monitor workers behind a gunicorn/uwsgi proxy that preforks has been to add a startup code in the worker that tries to bind to each port in a range until it finds one that is open, and to scrape the worker via that port. It seems really clumsy. But monitoring each worker separately is more likely to help detect issues so I'd rather stick to that approach rather than using a registry shared accross processes.

Another option would be to have some kind of machine-wide authority (or user-wide) that is discoverable easily by the workers (e.g. it runs on a fixed port on localhost, or it provides a unix socket at a predetermined path) and allow workers to self-announce. One easy way to do this would be to have that central authority tell the worker "please export on port 12345" and tell Prometheus "a XYZ worker is now exporting on localhost:12345". I'm not sure if that's even possible to do in a portable way without possible races on port grabbing.

Does anyone have a better solution?

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Brian Brazil

unread,
Nov 11, 2015, 5:14:46 AM11/11/15
to Uriel Corfa, Jeffery Utter, Prometheus Developers
On Wed, Nov 11, 2015 at 10:02 AM, Uriel Corfa <ur...@corfa.fr> wrote:

This is an issue that also affects other languages that are used in a prefork model. There's a multiprocess branch for the python client too but it seems like better discovery mechanisms would solve the issue while keeping clients simple. Has there been any effort in that direction?


Discovery isn't sufficient to handle the problem. What we want is that a multi-process server has the same semantics as a multi-threaded server. This is a problem with gauges, consider a gauge which was the last time something was processed. If a process dies and was the last one to process something, you'll lose data and have your value go backwards so that implies we need metrics that persists beyond the life of any one process as they do beyond the life of any one thread.

Brian 

Currently the only way I have to monitor workers behind a gunicorn/uwsgi proxy that preforks has been to add a startup code in the worker that tries to bind to each port in a range until it finds one that is open, and to scrape the worker via that port. It seems really clumsy. But monitoring each worker separately is more likely to help detect issues so I'd rather stick to that approach rather than using a registry shared accross processes.

Another option would be to have some kind of machine-wide authority (or user-wide) that is discoverable easily by the workers (e.g. it runs on a fixed port on localhost, or it provides a unix socket at a predetermined path) and allow workers to self-announce. One easy way to do this would be to have that central authority tell the worker "please export on port 12345" and tell Prometheus "a XYZ worker is now exporting on localhost:12345". I'm not sure if that's even possible to do in a portable way without possible races on port grabbing.

Does anyone have a better solution?

On Wed, Nov 11, 2015, 12:14 AM Jeffery Utter <jeff...@sadclown.net> wrote:
There is an outstanding issue on the Github issues list for the Ruby Client library about support for multi-process (forking) servers. The issue can be found here: https://github.com/prometheus/client_ruby/issues/9 .

I have a potential fix for this here: https://github.com/jeffutter/prometheus-client_ruby/tree/multi-worker . However, since it is a non-trivial change I wanted to get some feedback and the contributing guidelines said to bring that up here.

Right now there are plenty of broken tests, since I re-architected a little. If it seems like a generally good solution I'll fix things up and get it ready for a pull requests.

Thanks,
Jeff

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Matthias Rampke

unread,
Nov 11, 2015, 8:42:56 AM11/11/15
to Jeffery Utter, Prometheus Developers
I'm not sure using PStore is the right thing to do here. If I understand it correctly, this will incur disk I/O for pretty much any metrics increment. This won't be sustainable in a high throughput environment. Additionally, you'll clutter TMPDIR with all those files.

At the very least, the persistence/multi-process support should be optional, off by default, and equipped with a warning (and the means) to make sure these go to an in-memory filesystem.

/MR

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Matthias Rampke
Engineer

SoundCloud Ltd. | Rheinsberger Str. 76/7710115 Berlin, Germany | +49 173 6395215

Managing Director: Alexander Ljung Incorporated in England & Wales with Company No. 6343600 | Local Branch Office | AG Charlottenburg  | HRB 110657B

Brian Brazil

unread,
Nov 11, 2015, 8:45:32 AM11/11/15
to Matthias Rampke, Jeffery Utter, Prometheus Developers
On Wed, Nov 11, 2015 at 1:42 PM, Matthias Rampke <m...@soundcloud.com> wrote:
I'm not sure using PStore is the right thing to do here. If I understand it correctly, this will incur disk I/O for pretty much any metrics increment. This won't be sustainable in a high throughput environment. Additionally, you'll clutter TMPDIR with all those files.

At the very least, the persistence/multi-process support should be optional, off by default, and equipped with a warning (and the means) to make sure these go to an in-memory filesystem.

That's pretty much what I'm doing on the Python side, I still have to resolve the hitting disk issue.

Brian
 

/MR

On Wed, Nov 11, 2015 at 12:14 AM, Jeffery Utter <jeff...@sadclown.net> wrote:
There is an outstanding issue on the Github issues list for the Ruby Client library about support for multi-process (forking) servers. The issue can be found here: https://github.com/prometheus/client_ruby/issues/9 .

I have a potential fix for this here: https://github.com/jeffutter/prometheus-client_ruby/tree/multi-worker . However, since it is a non-trivial change I wanted to get some feedback and the contributing guidelines said to bring that up here.

Right now there are plenty of broken tests, since I re-architected a little. If it seems like a generally good solution I'll fix things up and get it ready for a pull requests.

Thanks,
Jeff

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Matthias Rampke
Engineer

SoundCloud Ltd. | Rheinsberger Str. 76/7710115 Berlin, Germany | +49 173 6395215

Managing Director: Alexander Ljung Incorporated in England & Wales with Company No. 6343600 | Local Branch Office | AG Charlottenburg  | HRB 110657B

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Jeffery Utter

unread,
Nov 11, 2015, 9:21:48 AM11/11/15
to Prometheus Developers, jeff...@sadclown.net
Yeah, it will incur I/O for all metrics, however most (if not all) modern Linux distros use tmpfs for /tmp which is in-memory and thus won't require disk writes.

With my current implementation the PStore support is optional and disabled by default. The implementation still defaults to in-memory hashes.

I agree there should be some warning (perhaps a mention in the readme is enough?)

I would love to find a purely in-memory solution that works for pre-forking workers. However to the best of my ability the only solution I can find is using the 'raindrops' gem, which would require all counters (all possible labels) to be known up-front, before the server forks.


On Wednesday, November 11, 2015 at 7:42:56 AM UTC-6, Matthias Rampke wrote:
I'm not sure using PStore is the right thing to do here. If I understand it correctly, this will incur disk I/O for pretty much any metrics increment. This won't be sustainable in a high throughput environment. Additionally, you'll clutter TMPDIR with all those files.

At the very least, the persistence/multi-process support should be optional, off by default, and equipped with a warning (and the means) to make sure these go to an in-memory filesystem.

/MR
On Wed, Nov 11, 2015 at 12:14 AM, Jeffery Utter <jeff...@sadclown.net> wrote:
There is an outstanding issue on the Github issues list for the Ruby Client library about support for multi-process (forking) servers. The issue can be found here: https://github.com/prometheus/client_ruby/issues/9 .

I have a potential fix for this here: https://github.com/jeffutter/prometheus-client_ruby/tree/multi-worker . However, since it is a non-trivial change I wanted to get some feedback and the contributing guidelines said to bring that up here.

Right now there are plenty of broken tests, since I re-architected a little. If it seems like a generally good solution I'll fix things up and get it ready for a pull requests.

Thanks,
Jeff

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Shane Hansen

unread,
Nov 11, 2015, 11:58:40 AM11/11/15
to Jeffery Utter, Prometheus Developers
I can see a couple options that would make sense and be pluggable to any python/ruby framework.

1. Use shared memory + a simple binary format.
2. Use a long running helper process and send metrics to it ala statsd.


On Wed, Nov 11, 2015 at 7:21 AM, Jeffery Utter <jeff...@sadclown.net> wrote:
Yeah, it will incur I/O for all metrics, however most (if not all) modern Linux distros use tmpfs for /tmp which is in-memory and thus won't require disk writes.

With my current implementation the PStore support is optional and disabled by default. The implementation still defaults to in-memory hashes.

I agree there should be some warning (perhaps a mention in the readme is enough?)

I would love to find a purely in-memory solution that works for pre-forking workers. However to the best of my ability the only solution I can find is using the 'raindrops' gem, which would require all counters (all possible labels) to be known up-front, before the server forks.

On Wednesday, November 11, 2015 at 7:42:56 AM UTC-6, Matthias Rampke wrote:
I'm not sure using PStore is the right thing to do here. If I understand it correctly, this will incur disk I/O for pretty much any metrics increment. This won't be sustainable in a high throughput environment. Additionally, you'll clutter TMPDIR with all those files.

At the very least, the persistence/multi-process support should be optional, off by default, and equipped with a warning (and the means) to make sure these go to an in-memory filesystem.

/MR
On Wed, Nov 11, 2015 at 12:14 AM, Jeffery Utter <jeff...@sadclown.net> wrote:
There is an outstanding issue on the Github issues list for the Ruby Client library about support for multi-process (forking) servers. The issue can be found here: https://github.com/prometheus/client_ruby/issues/9 .

I have a potential fix for this here: https://github.com/jeffutter/prometheus-client_ruby/tree/multi-worker . However, since it is a non-trivial change I wanted to get some feedback and the contributing guidelines said to bring that up here.

Right now there are plenty of broken tests, since I re-architected a little. If it seems like a generally good solution I'll fix things up and get it ready for a pull requests.

Thanks,
Jeff

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Matthias Rampke
Engineer

SoundCloud Ltd. | Rheinsberger Str. 76/7710115 Berlin, Germany | +49 173 6395215

Managing Director: Alexander Ljung Incorporated in England & Wales with Company No. 6343600 | Local Branch Office | AG Charlottenburg  | HRB 110657B

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.

Brian Brazil

unread,
Nov 11, 2015, 12:09:12 PM11/11/15
to Shane Hansen, Jeffery Utter, Prometheus Developers
On Wed, Nov 11, 2015 at 4:58 PM, Shane Hansen <shanem...@gmail.com> wrote:
I can see a couple options that would make sense and be pluggable to any python/ruby framework.

1. Use shared memory + a simple binary format.

The challenge there is Windows support, and various limits that Unix systems put on shared memory and the management around that tends to be clunky. You're also going to get contention.
 
2. Use a long running helper process and send metrics to it ala statsd.

This will have the same problems as statsd, having to pass through the network stack hundreds of times per request is unlikely to end well.

We need something that works like shared memory, but can grow over time, preferably per-process and continues to exist after subprocess death. 

Brian 



--

Shane Hansen

unread,
Nov 11, 2015, 12:32:21 PM11/11/15
to Brian Brazil, Jeffery Utter, Prometheus Developers
Depending on your definition of "network stack", unix domain sockets might be an option on linux/unix. However windows would require a totally different transport in that case. Perhaps something around named pipes.

There's complexity associated with doing any sort of IPC, but assuming you want to deal with subprocesses, then you have to accept IPC and if you are doing IPC then your choices are pretty much: shared memory, filesystems, pipes, or sockets. Pick your poison ; ).

Brian Brazil

unread,
Nov 11, 2015, 12:40:50 PM11/11/15
to Shane Hansen, Jeffery Utter, Prometheus Developers
On Wed, Nov 11, 2015 at 5:32 PM, Shane Hansen <shanem...@gmail.com> wrote:
Depending on your definition of "network stack", unix domain sockets might be an option on linux/unix. However windows would require a totally different transport in that case. Perhaps something around named pipes.

There's complexity associated with doing any sort of IPC, but assuming you want to deal with subprocesses, then you have to accept IPC and if you are doing IPC then your choices are pretty much: shared memory, filesystems, pipes, or sockets. Pick your poison ; ).

It'd be best to avoid anything that hits the kernel, which leaves us with shared memory and mmapped files. I'm hopeful on the mmapped files approach.



--

Brian Brazil

unread,
Nov 11, 2015, 1:40:26 PM11/11/15
to Jeffery Utter, Prometheus Developers
On Wed, Nov 11, 2015 at 6:36 PM, Jeffery Utter <jeff...@sadclown.net> wrote:
As far as mmapped files go on ruby. There is one library that was last updated 6 years ago and doesn't work on ruby 2.1. Unfortunately I don't know if that will be a good solution. Also, as far as I know there are no viable shared memory solutions for ruby. Doing searches for shared memory between workers in ruby only turn up relatively complex IPC systems like DRb or cod.

I still don't think a file-based solution, on an in-memory filesystem will be terrible.

In Python I'm looking at BDB and friends, which is a little messy due to Python 2/3 changes. I'd rather not have to come up with our own file format. Having a file per process avoids most of the IPC aspects.

Brian



--

Shane Hansen

unread,
Nov 11, 2015, 1:53:36 PM11/11/15
to Brian Brazil, Jeffery Utter, Prometheus Developers
I'm more familiar with python land, where you can use things like ctypes to basically have a pure-python mmap impl. I'd expect you can do something similar in ruby land. The below ffi bindings are pretty recently maintained 

Jeffery Utter

unread,
Nov 11, 2015, 10:52:50 PM11/11/15
to Prometheus Developers, brian....@robustperception.io, jeff...@sadclown.net
From what I gather ( and I am far from an expert on this topic ). mmap only works on *nix systems, and would not work on windows. Moreover it basically gives you a shared block of memory that is entirely un-managed? It would require writing a C library to use this memory and provide some sort of mutex/lock like functionality? The benefit is, is that it is about the fastest way to share anything between processes?

I am not opposed to this solution, however It is beyond my experience and probably not something I have time to learn about deeply enough to implement something useful. Are there other potential solutions that might not be too daunting? Is it reasonable to provide multiple adapters now, one memory/hash backed, one backed by some sort of file (PStore or otherwise) and maybe a redis one?

I realize disk or network IO may not be optimal for some very high-throughput applications but in a typical Ruby app you are already doing plenty of disk/network IO. I imagine the overhead would be negligible until there is a better solution.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Matthias Rampke
Engineer

SoundCloud Ltd. | Rheinsberger Str. 76/7710115 Berlin, Germany | +49 173 6395215

Managing Director: Alexander Ljung Incorporated in England & Wales with Company No. 6343600 | Local Branch Office | AG Charlottenburg  | HRB 110657B
--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--




--



--

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.

Brian Brazil

unread,
Nov 12, 2015, 5:49:41 AM11/12/15
to Jeffery Utter, Prometheus Developers
On Thu, Nov 12, 2015 at 3:52 AM, Jeffery Utter <jeff...@sadclown.net> wrote:
From what I gather ( and I am far from an expert on this topic ). mmap only works on *nix systems, and would not work on windows.

There's an equivalent on Windows, but we should rely on some library that takes care of all this for us.
 
Moreover it basically gives you a shared block of memory that is entirely un-managed? It would require writing a C library to use this memory and provide some sort of mutex/lock like functionality? The benefit is, is that it is about the fastest way to share anything between processes?

We don't need to actively share data if there's a file per process. Only at scrape time do access all the files. 

I am not opposed to this solution, however It is beyond my experience and probably not something I have time to learn about deeply enough to implement something useful. Are there other potential solutions that might not be too daunting? Is it reasonable to provide multiple adapters now, one memory/hash backed, one backed by some sort of file (PStore or otherwise) and maybe a redis one?

I realize disk or network IO may not be optimal for some very high-throughput applications but in a typical Ruby app you are already doing plenty of disk/network IO. I imagine the overhead would be negligible until there is a better solution.

It's not just high throughput we need to worry about, latency is also an issue. If for example every metric change involved a disk seek that'd limit you to only a handful of metrics, whereas we want users to be free to use hundreds to thousands of metrics without a second thought.

Brian

 
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Matthias Rampke
Engineer

SoundCloud Ltd. | Rheinsberger Str. 76/7710115 Berlin, Germany | +49 173 6395215

Managing Director: Alexander Ljung Incorporated in England & Wales with Company No. 6343600 | Local Branch Office | AG Charlottenburg  | HRB 110657B
--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--




--



--

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.




--

Jeffery Utter

unread,
Nov 12, 2015, 7:29:23 AM11/12/15
to Brian Brazil, Prometheus Developers
Another solution I'm looking into is using something like 0mq or nanomsg.

I have just started looking into nanomsg, but it's docs propose: 

"Zero-Copy

While ZeroMQ offers a "zero-copy" API, it's not true zero-copy. Rather it's "zero-copy till the message gets to the kernel boundary". From that point on data is copied as with standard TCP. nanomsg, on the other hand, aims at supporting true zero-copy mechanisms such as RDMA (CPU bypass, direct memory-to-memory copying) and shmem (transfer of data between processes on the same box by using shared memory). The API entry points for zero-copy messaging are nn_allocmsg and nn_freemsg functions in combination with NN_MSG option passed to send/recv functions."

Sounds like that could alleviate some of the concern around latency from hitting the kernel?

On Thu, Nov 12, 2015 at 4:49 AM, Brian Brazil <brian....@robustperception.io> wrote:
On Thu, Nov 12, 2015 at 3:52 AM, Jeffery Utter <jeff...@sadclown.net> wrote:
From what I gather ( and I am far from an expert on this topic ). mmap only works on *nix systems, and would not work on windows.

There's an equivalent on Windows, but we should rely on some library that takes care of all this for us.
 
Yeah.. but someone has to write this library :) 

We don't need to actively share data if there's a file per process. Only at scrape time do access all the files.

By 'files' you mean chunks of memory allocated by mmap? So the parent would allocate chunks of memory for each worker, and pass those down to the workers. The workers would write their stats to this memory. At scrape time, whichever worker gets the request would read from the memory allocated to each of the workers and aggregate them?
 

It's not just high throughput we need to worry about, latency is also an issue. If for example every metric change involved a disk seek that'd limit you to only a handful of metrics, whereas we want users to be free to use hundreds to thousands of metrics without a second thought.

Ahh, this is a very valid point. I didn't really think of multiple metrics. I was sort of thinking they could be batched, once,  per-request. However, if every worker is writing to the same file - there could be data loss if two workers read at the same time and one increments, and writes and the other writes later. I think in my implementation it is locking the file, reading, writing and unlocking for every increment - which could have this performance problem.

Brian Brazil

unread,
Nov 12, 2015, 7:38:43 AM11/12/15
to Jeffery Utter, Prometheus Developers
On Thu, Nov 12, 2015 at 12:29 PM, Jeffery Utter <jeff...@sadclown.net> wrote:
Another solution I'm looking into is using something like 0mq or nanomsg.

I have just started looking into nanomsg, but it's docs propose: 

"Zero-Copy

While ZeroMQ offers a "zero-copy" API, it's not true zero-copy. Rather it's "zero-copy till the message gets to the kernel boundary". From that point on data is copied as with standard TCP. nanomsg, on the other hand, aims at supporting true zero-copy mechanisms such as RDMA (CPU bypass, direct memory-to-memory copying) and shmem (transfer of data between processes on the same box by using shared memory). The API entry points for zero-copy messaging are nn_allocmsg and nn_freemsg functions in combination with NN_MSG option passed to send/recv functions."

Sounds like that could alleviate some of the concern around latency from hitting the kernel?

On Thu, Nov 12, 2015 at 4:49 AM, Brian Brazil <brian....@robustperception.io> wrote:
On Thu, Nov 12, 2015 at 3:52 AM, Jeffery Utter <jeff...@sadclown.net> wrote:
From what I gather ( and I am far from an expert on this topic ). mmap only works on *nix systems, and would not work on windows.

There's an equivalent on Windows, but we should rely on some library that takes care of all this for us.
 
Yeah.. but someone has to write this library :) 

We don't need to actively share data if there's a file per process. Only at scrape time do access all the files.

By 'files' you mean chunks of memory allocated by mmap? So the parent would allocate chunks of memory for each worker, and pass those down to the workers. The workers would write their stats to this memory. At scrape time, whichever worker gets the request would read from the memory allocated to each of the workers and aggregate them?

I'm looking at BDB and friends. They appear to have ways to have data make it to the filesystem via mmap, but not hit disk (i.e. not calling fsync), which is what we want. It also avoids having to mess around with custom memory layouts.

Brian



--

Jeffery Utter

unread,
Dec 4, 2015, 11:26:51 PM12/4/15
to Prometheus Developers, jeff...@sadclown.net
I have been working on another approach to this. I have a rough draft over here: https://github.com/jeffutter/prometheus-client_ruby/tree/aggregate_stats

The idea this time is that there are no modifications to the stats stored in memory per thread. When the middlewares are called with the persist: true option - at the end of each rack request, they persist their current state to a file on disk scoped by their parent pid and their own pid.

When the exporter middleware gets hit, it reads all of the files for it's children and exports them as individually tagged stats as such:

# TYPE http_requests_total counter
# HELP http_requests_total A counter of the total number of HTTP requests made.
http_requests_total{method="get",host="127.0.0.1:5000",path="/",code="200",pid="13336"} 489
http_requests_total{method="get",host="127.0.0.1:5000",path="/",code="200",pid="13337"} 526
http_requests_total{method="get",host="127.0.0.1:5000",path="/",code="200",pid="13338"} 485
http_requests_total{method="get",host="127.0.0.1:5000",path="/",code="200",pid="13339"} 500

I think this yields a much better solution than my previous approach for a few reasons:

  • It is considerably simpler. There is no sharing of memory between workers which seems incredibly complex with forking workers in ruby.
  • IO is minimized compared to a solution that syncs on every counter increment.
  • Prometheus can aggregate the stats between processes (just as it would between discreet servers)
  • You gain visibility into metrics on a per-thread basis
  • This could potentially reveal problem threads that could be individually managed (if one cared to)
If this approach seems like something reasonable for inclusion, let me know and I will clean up the code and get coverage. Currently I'm using a PStore as it seems reasonably fast, it works, and I don't have to worry about marshaling the labels/values into strings for gdbm. If there is a more suitable/standard format let me know.
 

Brian

 
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Matthias Rampke
Engineer

SoundCloud Ltd. | Rheinsberger Str. 76/7710115 Berlin, Germany | +49 173 6395215

Managing Director: Alexander Ljung Incorporated in England & Wales with Company No. 6343600 | Local Branch Office | AG Charlottenburg  | HRB 110657B
--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--




--



--

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.




--




--

Brian Brazil

unread,
Dec 5, 2015, 2:54:44 AM12/5/15
to Jeffery Utter, Prometheus Developers
On Sat, Dec 5, 2015 at 4:26 AM, Jeffery Utter <jeff...@sadclown.net> wrote:
I have been working on another approach to this. I have a rough draft over here: https://github.com/jeffutter/prometheus-client_ruby/tree/aggregate_stats

The idea this time is that there are no modifications to the stats stored in memory per thread. When the middlewares are called with the persist: true option - at the end of each rack request, they persist their current state to a file on disk scoped by their parent pid and their own pid.

When the exporter middleware gets hit, it reads all of the files for it's children and exports them as individually tagged stats as such:

# TYPE http_requests_total counter
# HELP http_requests_total A counter of the total number of HTTP requests made.
http_requests_total{method="get",host="127.0.0.1:5000",path="/",code="200",pid="13336"} 489
http_requests_total{method="get",host="127.0.0.1:5000",path="/",code="200",pid="13337"} 526
http_requests_total{method="get",host="127.0.0.1:5000",path="/",code="200",pid="13338"} 485
http_requests_total{method="get",host="127.0.0.1:5000",path="/",code="200",pid="13339"} 500

I think this yields a much better solution than my previous approach for a few reasons:

I'm doing pretty similar with Python, in addition I'm doing what aggregation I can at exposition time. This will allow users to reuse rules/dashboards whether they're using either way of running things, and in general you don't want the same metric exposed with different labels. For counters it's easy, for gauges it's more difficult as you need to know how it's used. See https://github.com/prometheus/client_python/pull/66/files#diff-52e98fdb8931e88f7d37feb48ba6ded2R39

Brian
 

 

Brian

 
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Matthias Rampke
Engineer

SoundCloud Ltd. | Rheinsberger Str. 76/7710115 Berlin, Germany | +49 173 6395215

Managing Director: Alexander Ljung Incorporated in England & Wales with Company No. 6343600 | Local Branch Office | AG Charlottenburg  | HRB 110657B
--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--




--



--

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.




--




--

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

Matthias Rampke

unread,
Dec 5, 2015, 7:37:38 AM12/5/15
to Brian Brazil, prometheus-developers, Jeffery Utter

I would also prefer not to expose the pid directly – either aggregate at exposition time, or use a labelling scheme that keeps the label value consistent over time (not sure what such a scheme might be).

my concern is two-fold – for one, per-thread exposition blows up the number of metrics and data points considerably. additionally, if the labels keep changing that's also constantly producing new time series assuming the processes are not extremely long-lived.

We restart our Ruby procs a lot, and Prometheus is reasonably good at short lived time series but not *that* good.

Apart from that I think this is a good compromise between simplicity and I/O. To take out even further, what would you think about a minimum write-out interval? For an app serving many short requests it might be better to serve 100ms old metrics than to write to disk several thousand times a second?

/mr

Jeffery Utter

unread,
Dec 5, 2015, 8:04:16 AM12/5/15
to Brian Brazil, Prometheus Developers
Is it not also incredibly complicated or even impossible to aggregate the quantile values of summary counters? This is what lead me to exposing each pid instead of aggregating everything.

From what I can tell, the Quantile library in ruby throws the original data away as the values are aggregated. You get a result showing that you have x counts in the 95th percentile. But when combining that with the same count from another process the '95th percentile' could mean something entirely different.

Matthias Rampke

unread,
Dec 5, 2015, 8:31:25 AM12/5/15
to Jeffery Utter, prometheus-developers, Brian Brazil

Yes, summaries are inherently not aggregatable (but histograms are). For these cases a per-process label is unavoidable.

So it should really use histograms then … but for now, how about keeping per-thread summaries but aggregating counters?

I'd like to hear more opinions though, maybe I'm just wrong and many metrics is the way to go all the way. My concerns are definitely practical, not fundamental, and can be scaled around.

/mr

Brian Brazil

unread,
Dec 5, 2015, 5:33:03 PM12/5/15
to Matthias Rampke, prometheus-developers, Jeffery Utter
On Sat, Dec 5, 2015 at 12:37 PM, Matthias Rampke <m...@soundcloud.com> wrote:

I would also prefer not to expose the pid directly – either aggregate at exposition time, or use a labelling scheme that keeps the label value consistent over time (not sure what such a scheme might be).


For gauges we have no choice but to expose the pid when we don't have hints from the user as to what they're using the gauge for. For counters we can avoid it.

my concern is two-fold – for one, per-thread exposition blows up the number of metrics and data points considerably. additionally, if the labels keep changing that's also constantly producing new time series assuming the processes are not extremely long-lived.

We restart our Ruby procs a lot, and Prometheus is reasonably good at short lived time series but not *that* good.

Apart from that I think this is a good compromise between simplicity and I/O. To take out even further, what would you think about a minimum write-out interval? For an app serving many short requests it might be better to serve 100ms old metrics than to write to disk several thousand times a second?


I thought this was an approach that didn't hit disk at every write? Any mmap approach not calling fsync/fdatasync should be okay on this front.

Brian



--

Matthias Rampke

unread,
Dec 7, 2015, 4:37:29 AM12/7/15
to Brian Brazil, Jeffery Utter, prometheus-developers
On Dec 5, 2015 11:33 PM, "Brian Brazil"
<brian....@robustperception.io> wrote:
>
> I thought this was an approach that didn't hit disk at every write? Any mmap approach not calling fsync/fdatasync should be okay on this front.

You're right, without sync it will just work. Yay for POSIX semantics!

Thinking about it more, having some metrics per-process and some
preaggregated would be quite inconsistent and confusing. Jeffery – I
think your approach is sound.

Something I'm still not sure how to handle is the ever-changing PIDs …
on one hand, it's the only sane identifier I see at exposition time;
but I'd be interested in any approach to relabel it down to a fixed
set at ingestion time…

/MR

Brian Brazil

unread,
Dec 7, 2015, 4:44:18 AM12/7/15
to Matthias Rampke, Jeffery Utter, prometheus-developers
On Mon, Dec 7, 2015 at 9:37 AM, Matthias Rampke <m...@soundcloud.com> wrote:
On Dec 5, 2015 11:33 PM, "Brian Brazil"
<brian....@robustperception.io> wrote:
>
> I thought this was an approach that didn't hit disk at every write? Any mmap approach not calling fsync/fdatasync should be okay on this front.

You're right, without sync it will just work. Yay for POSIX semantics!

Thinking about it more, having some metrics per-process and some
preaggregated would be quite inconsistent and confusing. Jeffery – I
think your approach is sound.

I don't see it as inconsistent. Some we know without any other information that it's safe to combine (any counter), others we require user input for.
 

Something I'm still not sure how to handle is the ever-changing PIDs …
on one hand, it's the only sane identifier I see at exposition time;
but I'd be interested in any approach to relabel it down to a fixed
set at ingestion time…

This is why for the gauges in Python it's user configurable. Some you want to live past the life of the process, others you don't.


--

danielm...@gocardless.com

unread,
Nov 28, 2018, 10:04:26 AM11/28/18
to Prometheus Developers
Hello,

We at GoCardless have taken a stab at this problem, with the intention of not only solving the pre-fork issue, but also bringing the Ruby Client more in line with current best practices.

We've created an [RFC issue](https://github.com/prometheus/client_ruby/issues/94) on the main repo explaining our approach.

We've also created [a PR](https://github.com/prometheus/client_ruby/pull/95) that implements this proposal.

This brings the Ruby Client more in line with current best practices on things like up-front label declaration, and proposes a solution to the pre-fork servers.

The way we're working around pre-fork is using Files which, due to FS caching, is actually much faster than we expected. A counter increment is on the order of just 9μs, compared to 6μs for MMaps, while it's a much more stable approach (we found some stability issues with MMaps). More information on this and other approaches in our PR.

As mentioned higher above in this thread, this approach is optional. This is implemented by abstracting the storage of data away from the rest of the code, making these "stores" swappable. By default, we have a thread-safe solution, but the user can choose which store to use, and even implement their own easily.

The PR also includes a benchmarking script for the different data stores, for anyone wanting to develop their own, and we have a *lot* more information on our approach, more benchmarks, an MMap Store, etc in a separate repo (so as to not clutter the main one with unnecessary stuff):
https://github.com/gocardless/prometheus-client-ruby-data-stores-experiments

A note on the PR: This is changing a lot of things, and it's a pretty big refactor. As such, it is pretty big. However, it's written as a coherent list of commits with good comments. I recommend going through it commit by commit (rather than the whole diff at once), and taking a look at the commit descriptions, since they generally explain the reasoning behind each change, the tradeoffs we evaluated, etc.

On the topic of MMaps: We haven't ruled them out as an option, and we even have a small gem in our second repo that'd allow anyone to add the MMapStore easily to their project and test it / improve it. That said, we consider the performance gain probably does not justify the extra risk for "most" Ruby project, and we chose to err towards releasing something usable and stable earlier. However, we welcome all contributions to this!

Our "experiments" repo explains in more detail the specific stability problems we encountered and our work around them.

We'd love your comments on this!

Thank you
Daniel Magliola

Ben Kochie

unread,
Nov 28, 2018, 1:24:11 PM11/28/18
to danielm...@gocardless.com, prometheus...@googlegroups.com
We also made our own fork of the Ruby pre-fork client. Sorry I haven't been able to budget the time to talk about this more upstream.


Our version uses a forked and modified mmap C code to reduce overhead of handling metrics between processes. It does most of the heavy lifting in C, rather than Ruby. We're able to scrape 40k metrics (large number of per-action+controller histograms) in under 1s.

I would love to combine our work and get the official Ruby client to a place where neither of us need a fork. :-)

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.

danielm...@gocardless.com

unread,
Nov 29, 2018, 11:40:55 AM11/29/18
to Prometheus Developers
Hello Ben!

Yeah, we saw your work, it looks pretty impressive. And we're definitely hoping our work gets upstreamed.

Once that happens, the good news is that it'd be very easy to also mix in what you've done, since you can make a new Data Store and provide it as an alternative for people that would need that kind of performance, and that can accept the few limitations (like no Windows support, for example)

Take a look at our PR if you want, particularly the "data_stores" directory which documents quite extensively how one would add a new store, and also the "spec/benchmarks" one, which you can use to compare performance between different stores.

Daniel
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages