Monitoring third-party services from user-facing clients

rile...@gmail.com

unread,

Mar 1, 2017, 9:55:35 PM3/1/17

to Prometheus Developers

Ahoy!

We run a web app that uses some third-party services directly from the browser. For example, our users talk to Firebase directly. Because none of our servers are involved, we have no mechanism for understanding what e.g. the average latency is. Ouch! It is starting to be a real pain for us to be guessing at this and other operational metrics for opaque services.

So, we are contemplating writing code that will live in our user-facing app, collecting metrics. Every X seconds it would send all of its metrics to a new aggregator service, which we would also create. Then we'd have prometheus scrape the aggregator, and have some idea of our actual users' experiences with latency to the various services that are otherwise completely opaque to us!

We are thinking of open-sourcing a solution with these two parts - an aggregator that can receive bundles of stats from our many concurrent users, and some JS code that can send those bundles. I have three questions:

1) Does something like this already exist?

2) Does this kind of real-user monitoring with prometheus basically seem like a good idea? Are we being naive in thinking we can easily aggregate metrics from a million concurrent users and provide the aggregate to prometheus? (we know we cannot have high-cardinality labels like userId).

3) This feels like a generalizable problem that many apps might be able to benefit from. I think there are many apps out there that use 3rd-party services like this - basically completely unmonitored. Would anyone else here find this kind of tool useful if we did the work of open-sourcing it? To the maintainers of prometheus: would you find this kind of thing a resonant addition to the ecosystem of clients & exporters?

Thanks for any guidance!

Riley

Brian Brazil

unread,

Mar 1, 2017, 10:43:07 PM3/1/17

to rile...@gmail.com, Prometheus Developers

On 2 March 2017 at 02:55, <rile...@gmail.com> wrote:

Ahoy!

We run a web app that uses some third-party services directly from the browser. For example, our users talk to Firebase directly. Because none of our servers are involved, we have no mechanism for understanding what e.g. the average latency is. Ouch! It is starting to be a real pain for us to be guessing at this and other operational metrics for opaque services.

So, we are contemplating writing code that will live in our user-facing app, collecting metrics. Every X seconds it would send all of its metrics to a new aggregator service, which we would also create. Then we'd have prometheus scrape the aggregator, and have some idea of our actual users' experiences with latency to the various services that are otherwise completely opaque to us!

We are thinking of open-sourcing a solution with these two parts - an aggregator that can receive bundles of stats from our many concurrent users, and some JS code that can send those bundles. I have three questions:

1) Does something like this already exist?

There's https://github.com/outbrain/torch anyway.

2) Does this kind of real-user monitoring with prometheus basically seem like a good idea? Are we being naive in thinking we can easily aggregate metrics from a million concurrent users and provide the aggregate to prometheus? (we know we cannot have high-cardinality labels like userId).

The challenge there is the scale more than anything. 1M concurrent users makes everything complicated.

Brian

3) This feels like a generalizable problem that many apps might be able to benefit from. I think there are many apps out there that use 3rd-party services like this - basically completely unmonitored. Would anyone else here find this kind of tool useful if we did the work of open-sourcing it? To the maintainers of prometheus: would you find this kind of thing a resonant addition to the ecosystem of clients & exporters?

Thanks for any guidance!

Riley

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.
To post to this group, send email to prometheus-developers@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/92e5a588-69fb-41b7-8a92-f100dea1ca0d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Brian Brazil

www.robustperception.io

Richard Hartmann

unread,

Mar 1, 2017, 11:28:05 PM3/1/17

to rile...@gmail.com, Prometheus Developers

The scaling to frequent data pushes from n users might become a concern at some point. But with a scalable service, this does not seem to be an unsolvable problem.

That being said, even as someone who self-hosts _everything_ for privacy reasons, I would like to enable our devs to gather metrics about the local performance of their stuff and changes to that over time.

Richard

Sent by mobile; excuse my brevity and the wall of text Gmail appends by default.

Stuart Nelson

unread,

Mar 2, 2017, 4:17:43 AM3/2/17

to Richard Hartmann, rile...@gmail.com, Prometheus Developers

There's a wealth of general browser information that I would love to have a frontend exporter/client library for, and a place to send it to so that prometheus can scrape it :)

cf. https://githubengineering.com/browser-monitoring-for-github-com/

On Thu, Mar 2, 2017 at 5:28 AM Richard Hartmann <richih.ma...@gmail.com> wrote:

The scaling to frequent data pushes from n users might become a concern at some point. But with a scalable service, this does not seem to be an unsolvable problem.

That being said, even as someone who self-hosts _everything_ for privacy reasons, I would like to enable our devs to gather metrics about the local performance of their stuff and changes to that over time.

Richard

Sent by mobile; excuse my brevity and the wall of text Gmail appends by default.

On Mar 2, 2017 03:55, <rile...@gmail.com> wrote:

Ahoy!

We run a web app that uses some third-party services directly from the browser. For example, our users talk to Firebase directly. Because none of our servers are involved, we have no mechanism for understanding what e.g. the average latency is. Ouch! It is starting to be a real pain for us to be guessing at this and other operational metrics for opaque services.

So, we are contemplating writing code that will live in our user-facing app, collecting metrics. Every X seconds it would send all of its metrics to a new aggregator service, which we would also create. Then we'd have prometheus scrape the aggregator, and have some idea of our actual users' experiences with latency to the various services that are otherwise completely opaque to us!

We are thinking of open-sourcing a solution with these two parts - an aggregator that can receive bundles of stats from our many concurrent users, and some JS code that can send those bundles. I have three questions:

1) Does something like this already exist?

2) Does this kind of real-user monitoring with prometheus basically seem like a good idea? Are we being naive in thinking we can easily aggregate metrics from a million concurrent users and provide the aggregate to prometheus? (we know we cannot have high-cardinality labels like userId).

3) This feels like a generalizable problem that many apps might be able to benefit from. I think there are many apps out there that use 3rd-party services like this - basically completely unmonitored. Would anyone else here find this kind of tool useful if we did the work of open-sourcing it? To the maintainers of prometheus: would you find this kind of thing a resonant addition to the ecosystem of clients & exporters?

Thanks for any guidance!

Riley

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
To post to this group, send email to prometheus...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/92e5a588-69fb-41b7-8a92-f100dea1ca0d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
To post to this group, send email to prometheus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/CAD77%2BgQatQGj7qkQx66bg-56d3bR8d7Wh2-YwG%2B%3D%3D0OQ%2BJORQw%40mail.gmail.com.

Riley Eynon-Lynch

unread,

Mar 2, 2017, 12:00:08 PM3/2/17

to Prometheus Developers, richih.ma...@gmail.com, rile...@gmail.com

Thanks for the early feedback everyone. I think there's something promising here.

I drew the idea up in a little more detail at https://github.com/peardeck/prometheus-user-monitoring - no implementation yet, but I'd love feedback on the README. If you're interested in something like this existing and don't mind starring the repo on github, we can maybe convince The Business to let us create & maintain this as an OSS project long-term.

It would also be great to hear "I'd never use this because X." We'll definitely build something like this for ourselves but it would be nice to find out early that the solution is not widely useful.

Thanks again!

On Thursday, March 2, 2017 at 3:17:43 AM UTC-6, Stuart Nelson wrote:

There's a wealth of general browser information that I would love to have a frontend exporter/client library for, and a place to send it to so that prometheus can scrape it :)

cf. https://githubengineering.com/browser-monitoring-for-github-com/

On Thu, Mar 2, 2017 at 5:28 AM Richard Hartmann <richih.ma...@gmail.com> wrote:

The scaling to frequent data pushes from n users might become a concern at some point. But with a scalable service, this does not seem to be an unsolvable problem.

That being said, even as someone who self-hosts _everything_ for privacy reasons, I would like to enable our devs to gather metrics about the local performance of their stuff and changes to that over time.

Richard

Sent by mobile; excuse my brevity and the wall of text Gmail appends by default.

On Mar 2, 2017 03:55, <rile...@gmail.com> wrote:

Ahoy!

We run a web app that uses some third-party services directly from the browser. For example, our users talk to Firebase directly. Because none of our servers are involved, we have no mechanism for understanding what e.g. the average latency is. Ouch! It is starting to be a real pain for us to be guessing at this and other operational metrics for opaque services.

So, we are contemplating writing code that will live in our user-facing app, collecting metrics. Every X seconds it would send all of its metrics to a new aggregator service, which we would also create. Then we'd have prometheus scrape the aggregator, and have some idea of our actual users' experiences with latency to the various services that are otherwise completely opaque to us!

We are thinking of open-sourcing a solution with these two parts - an aggregator that can receive bundles of stats from our many concurrent users, and some JS code that can send those bundles. I have three questions:

1) Does something like this already exist?

2) Does this kind of real-user monitoring with prometheus basically seem like a good idea? Are we being naive in thinking we can easily aggregate metrics from a million concurrent users and provide the aggregate to prometheus? (we know we cannot have high-cardinality labels like userId).

3) This feels like a generalizable problem that many apps might be able to benefit from. I think there are many apps out there that use 3rd-party services like this - basically completely unmonitored. Would anyone else here find this kind of tool useful if we did the work of open-sourcing it? To the maintainers of prometheus: would you find this kind of thing a resonant addition to the ecosystem of clients & exporters?

Thanks for any guidance!

Riley

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.

To post to this group, send email to prometheus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/92e5a588-69fb-41b7-8a92-f100dea1ca0d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.

Julius Volz

unread,

Mar 8, 2017, 11:28:59 AM3/8/17

to Riley Eynon-Lynch, Prometheus Developers, Richard Hartmann, rile...@gmail.com

Yeah, I think it'd be good to have something like this. You'll probably want to send events + timings to the backend immediately as they happen, StatsD-style (because browsers can be closed at any point in time, and also the per-user interaction rate isn't high enough to make local pre-aggregation of events helpful).

StatsD "scales" by only sending every n-th event (randomly) and then considering this factor in the backend. For example, send a counter only every 10th time, but then treat it as an increase by 10 instead of 1 in the backend. It's a bit annoying to manage though, as you have to dynamically adjust this sampling factor depending on how many users / events you have, so that you don't topple over. Not sure if there's a better solution though.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsubscri...@googlegroups.com.

To post to this group, send email to prometheus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/92e5a588-69fb-41b7-8a92-f100dea1ca0d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsubscri...@googlegroups.com.

To post to this group, send email to prometheus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/CAD77%2BgQatQGj7qkQx66bg-56d3bR8d7Wh2-YwG%2B%3D%3D0OQ%2BJORQw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.

To post to this group, send email to prometheus-developers@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/4e6df8cc-460f-4ffa-9910-3508a8a42d33%40googlegroups.com.

Richard Hartmann

unread,

Mar 8, 2017, 11:41:52 AM3/8/17

to Julius Volz, rile...@gmail.com, Prometheus Developers, Riley Eynon-Lynch

Server tells new clients the rate to sample at, clients send info about sample rate along with metrics. Similar is standard in networking.

Be aware that users can take the info, though. But outlier removal should fix that.

Richard

Sent by mobile; excuse my brevity and the wall of text Gmail appends by default.

To post to this group, send email to prometheus-developers@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/4e6df8cc-460f-4ffa-9910-3508a8a42d33%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.
To post to this group, send email to prometheus-developers@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/CA%2BT6YoxXDXzi0vZpfYtrTtTjk8WrOUbgxqQm8n3zxsVL26NqkQ%40mail.gmail.com.

Riley Eynon-Lynch

unread,

Mar 9, 2017, 4:21:53 PM3/9/17

to Prometheus Developers, juliu...@gmail.com, rile...@gmail.com, ri...@peardeck.com

Thanks for the feedback, everyone. We have completed an MVP of an internal version of the project, with a simple aggregator written as a nodejs service. I hope to put it in production next week, see how it goes (not sure what to expect from node perf-wise yet), and check back in!

Some use cases that have been exciting us recently, in addition to the metrics that github wrote about, are:

* Maybe we missed a bug on an obscure client configuration that is killing X% of our users. Now we can alarm on a drop in active sessions from 7d ago, approximating active sessions by extrapolating from the rate of metrics we get from clients.

* Maybe a third-party system is newly blocked by popular firewall configs, or maybe the pdf service is failing silently. Now we can alarm on a drop in usage of a particular feature from 7d ago.

* If a user reports a lack of connectivity, having a graph of current success rates might help us respond more usefully

We haven't been totally general with our internal implementation, but the architecture I outlined in https://github.com/peardeck/prometheus-user-monitoring is indeed really simple to plug into a k8s cluster.

Thanks again!

Riley

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.

To post to this group, send email to prometheus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/92e5a588-69fb-41b7-8a92-f100dea1ca0d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.

To post to this group, send email to prometheus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/CAD77%2BgQatQGj7qkQx66bg-56d3bR8d7Wh2-YwG%2B%3D%3D0OQ%2BJORQw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.

To post to this group, send email to prometheus...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/4e6df8cc-460f-4ffa-9910-3508a8a42d33%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.

To post to this group, send email to prometheus...@googlegroups.com.

Riley Eynon-Lynch

unread,

Mar 26, 2017, 2:03:56 PM3/26/17

to Prometheus Developers, juliu...@gmail.com, rile...@gmail.com, ri...@peardeck.com

Writing back with an update - we've been running our aggregator in production for just over a week, and have found pretty great success! We haven't yet implemented the various suggestions here re. controlling the amount of total traffic, excluding outliers, etc, but as an MVP it's been really useful! We have a baseline for our user experience, as actually measured from their clients, that we can alert on automatically and investigate manually if a customer complains!

We put the source for the MVP up, and have a simple quickstart at https://github.com/peardeck/prometheus-user-metrics#try-it-locally - estimated total time for trying it out is about 5 minutes. I'd love to hear any feedback if you have the time to try it out!