New to Prometheus and have a few questions in regards to first set up

891 views
Skip to first unread message

mahony...@gmail.com

unread,
Oct 24, 2017, 4:43:22 PM10/24/17
to Prometheus Users
Preamble: So just started using Prometheus. Got started because my DBA wanted Percona setup and it came with all this jazz. When I saw the dashes and so forth I liked it. So now I am setting it up in a testbed/POC we are doing at the moment in AWS.

Before continuing,
i) the AWS setup is in a different account [with VPCpeering] and pretty restrictive security group/ports/inbound/outbound rules, so the simpler I can make this the better.
ii) I don't have a huge amount of time to spend on this atm, so easy wins help until I can revisit at a later point in the year.
iii) Autodiscovery looks cool. However we have just started moving to AWS so it is a slow process :P

1. I liked the look of the Percona version of the Grafana Dashes [Sytem Overview etc]. I originally hope to use telegraf as the remote for metrics as it seemed modular and I could just add whatever I wanted. However the labels are obviously different and I just can't set the time to even look at it. I wanted to use the Percona dashes so that we could have some homogenization and make it easy for users to get used to it.
Q: Unless I am overlooking something simple? If i use node_exporter I get the machines into these pre-created dashes, but is there any simple setup for rename_metric_label to convert from telegraf to node_exporter? Or am I being a moron?

2. Currently I am adding via static_config pointed at <host>:<ip>. I know the percona stuff does autodiscovery via consul, but once again time is against me so can't delve too deep into attempting this, although I want to.
Q: I attempted to rename the __address__ and remove the port [currenlty using 42000 as that is what PMM uses, and for simplicity it was easy to re-use], so just the hostname is visible. However this didn't work at all. I was kinda hoping I could pull "nodename" from the OS metric outputs and slap that into "target_label: instance" but went down a rabbit hole on that one.

3. When I get more au fait with the system, I plan on using DNS discovery later, with DNS peering across VPCs.
Q: Any gotcha's or opinions on this? [Our build scripts populate route53 with our instance dns info]

4. In my environment, I wont *really* be paying attention to these metrics on a 10s [default I belive] scrape setting. They are most likely to be used after the fact or if needed in real time, it would be known in advance allowing me change the scrape config, and reload. However 10s data points would be handy after the fact.
Q:Is it possible to set up the node_exporter to collect every 10s, and collect the data from the node every minute/2 mins etc? Is there much point to this as the data size would be the same, other than the number of network connections?

5. Q: Can i collect java info from the regular node_exporter - ie use the client_java [i just discovered on the github page] to correlate [we currently use Nagios/Cacti on-prem to montior heap usage, gc etc] this data into prometheus, and just have a new dash for java related stuff? Or will i have to open another port to allow this data be collected? Creating a single page for each account for the devs to see java stats in a nice graph/dash would be good :)
Or should I be using the jmx_exporter [ill probbaly be looking at apache_exporter and ha_exporter also when I get time, so extra ports is not a massive issue]

6. Q: [kinda related to 4]Cost in AWS: We use multiple accounts, with a central management account. All instances are in the same region,  across 3 AZs, all on private IPs using VPC peering]. I understand that non AZ data transfer is 0.01€/GB. If i have say 30 instances scraped, with 10 in each AZ, we are only looking at a couple of bytes a scrape.

7. Q: Is there any easy way to visually browse the incoming data? I fluted about with chronograf when I was looking at influxdb for a different project and found it handy enough. Actually might look at that tomorrow. :)

I understand some of these questions are probably rudimentary, and I have read the manual, and tried a few things, but I could spend a lot of time barking up the wrong tree, when a few questions here could point me in the right direction.

Thanks for any help,

B


Ben Kochie

unread,
Oct 25, 2017, 11:41:27 AM10/25/17
to mahony...@gmail.com, Prometheus Users
On Tue, Oct 24, 2017 at 10:43 PM, <mahony...@gmail.com> wrote:
Preamble: So just started using Prometheus. Got started because my DBA wanted Percona setup and it came with all this jazz. When I saw the dashes and so forth I liked it. So now I am setting it up in a testbed/POC we are doing at the moment in AWS.

Before continuing,
i) the AWS setup is in a different account [with VPCpeering] and pretty restrictive security group/ports/inbound/outbound rules, so the simpler I can make this the better.

We recommend running a Prometheus instance inside each VPC, this makes security simpler, and also makes sure that VPC networking isn't a problem for monitoring.
 
ii) I don't have a huge amount of time to spend on this atm, so easy wins help until I can revisit at a later point in the year.
iii) Autodiscovery looks cool. However we have just started moving to AWS so it is a slow process :P

1. I liked the look of the Percona version of the Grafana Dashes [Sytem Overview etc]. I originally hope to use telegraf as the remote for metrics as it seemed modular and I could just add whatever I wanted. However the labels are obviously different and I just can't set the time to even look at it. I wanted to use the Percona dashes so that we could have some homogenization and make it easy for users to get used to it.
Q: Unless I am overlooking something simple? If i use node_exporter I get the machines into these pre-created dashes, but is there any simple setup for rename_metric_label to convert from telegraf to node_exporter? Or am I being a moron?

We don't recommend telegraf, because it doesn't fit with the Prometheus best practices.

 

2. Currently I am adding via static_config pointed at <host>:<ip>. I know the percona stuff does autodiscovery via consul, but once again time is against me so can't delve too deep into attempting this, although I want to.
Q: I attempted to rename the __address__ and remove the port [currenlty using 42000 as that is what PMM uses, and for simplicity it was easy to re-use], so just the hostname is visible. However this didn't work at all. I was kinda hoping I could pull "nodename" from the OS metric outputs and slap that into "target_label: instance" but went down a rabbit hole on that one.

Instance labels are designed to be unique identifiers of a target (in combination with job).  So keeping the port there is important.  However many setups use relabel configs to add a `node` label for connivence.  This usually works quite well.


3. When I get more au fait with the system, I plan on using DNS discovery later, with DNS peering across VPCs.
Q: Any gotcha's or opinions on this? [Our build scripts populate route53 with our instance dns info]

DNS discovery works very well, you can populate various jobs into SRV records so Prometheus can get the address and port of the target.  You will likely want a very short TTL, say 5 seconds, for this to behave well.  But once you start using short TTLs, you will want a lot of DNS lookup caching.  I have been recommending every node have a localhost DNS cache these days, something like CoreDNS works well (and has built-in Prometheus metrics).
 

4. In my environment, I wont *really* be paying attention to these metrics on a 10s [default I belive] scrape setting. They are most likely to be used after the fact or if needed in real time, it would be known in advance allowing me change the scrape config, and reload. However 10s data points would be handy after the fact.
Q:Is it possible to set up the node_exporter to collect every 10s, and collect the data from the node every minute/2 mins etc? Is there much point to this as the data size would be the same, other than the number of network connections?

The way Prometheus is designed to work is that metrics sources, like the node_exporter, are stateless when it comes to collection.  They simply hand out the lastest metrics values on request.  They have no concept of internal polling intervals.  This makes them much easier to deploy, work well with HA or ad-hoc polling.

Prometheus is totally fine polling many targets every 10-15 seconds, the http sockets are very cheap. Prometheus has a very good compression scheme that keeps the storage use per sample quite low.  This isn't something to worry too much about.
 

5. Q: Can i collect java info from the regular node_exporter - ie use the client_java [i just discovered on the github page] to correlate [we currently use Nagios/Cacti on-prem to montior heap usage, gc etc] this data into prometheus, and just have a new dash for java related stuff? Or will i have to open another port to allow this data be collected? Creating a single page for each account for the devs to see java stats in a nice graph/dash would be good :)
Or should I be using the jmx_exporter [ill probbaly be looking at apache_exporter and ha_exporter also when I get time, so extra ports is not a massive issue]

You will want to have every application process expose a port for metrics, either via client_java or the jmx_exporter agent.  This is fundamentally how Prometheus works.  We keep the security concerns to a minimum, the /metrics endpoints are designed to be secure by default, they are read-only and inexpensive to generate.
 

6. Q: [kinda related to 4]Cost in AWS: We use multiple accounts, with a central management account. All instances are in the same region,  across 3 AZs, all on private IPs using VPC peering]. I understand that non AZ data transfer is 0.01€/GB. If i have say 30 instances scraped, with 10 in each AZ, we are only looking at a couple of bytes a scrape.

As I said above, you want to run Prometheus inside each VPC/AZ, unlike Nagios/Cacti, it's super easy to setup.  You don't want cross-AZ issues masking/causing monitoring failures.

Either way, all of the standard Prometheus client libraries use compression on metrics output, so network use is minimized.
 

7. Q: Is there any easy way to visually browse the incoming data? I fluted about with chronograf when I was looking at influxdb for a different project and found it handy enough. Actually might look at that tomorrow. :)

Not exactly, but you can use the /graph query browser to run ad-hoc expressions.  This is generally how I debug things, and generate query patterns that are used in Grafana.
 

I understand some of these questions are probably rudimentary, and I have read the manual, and tried a few things, but I could spend a lot of time barking up the wrong tree, when a few questions here could point me in the right direction.

Thanks for any help,

B


--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/1541038a-3fd2-48ba-b1b4-cce2067d16b1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

boma...@gmail.com

unread,
Oct 26, 2017, 10:18:34 AM10/26/17
to Prometheus Users
Wow Ben, thanks for all that!


>We recommend running a Prometheus instance inside each VPC, this makes security simpler, and also makes sure that VPC networking isn't a problem for monitoring.
Good to know, and we will explore this. We did want to be able to correlate data across accounts, but can investigate this method.


>We don't recommend telegraf, because it doesn't fit with the Prometheus best practices.
I read that article and understand. Thanks for the info


 >Instance labels are designed to be unique identifiers of a target (in combination with job).  So keeping the port there is important.  However many setups use relabel configs to add a `node` label for connivence.  This usually works quite well.
Ok. In the end I have actually just used the pmm-admin tool to put my non db instances into consul, so they will all have the same labels. Easy win for me, even if it is probably not the best way of doing things in the long run :) Ill revisit when I have time to learn more about Prometheus


>You will likely want a very short TTL, say 5 seconds, for this to behave well.
We have pretty static inventory currently. I just want to be able minimize the work required, so once we stand up a set of systems, they


>They have no concept of internal polling intervals.  This makes them much easier to deploy, work well with HA or ad-hoc polling.
Unsderstood. Only thing here is CPU was running extremely hot in my tests, and then when I checked, the scrapes were default set to 1s in PMM. :)


>You will want to have every application process expose a port for metrics, either via client_java or the jmx_exporter agent.
Got it. Hoping to look at a java exporter today if i get a chance.

Thanks again for everything, I have enough to work with :)

Ben Kochie

unread,
Oct 26, 2017, 10:22:10 AM10/26/17
to boma...@gmail.com, Prometheus Users
On Thu, Oct 26, 2017 at 4:18 PM, <boma...@gmail.com> wrote:
Wow Ben, thanks for all that!

>We recommend running a Prometheus instance inside each VPC, this makes security simpler, and also makes sure that VPC networking isn't a problem for monitoring.
Good to know, and we will explore this. We did want to be able to correlate data across accounts, but can investigate this method.

This can be done with recording rules and federation.  You have one Prometheus collect cluster summary metrics from each VPC. 
 


>We don't recommend telegraf, because it doesn't fit with the Prometheus best practices.
I read that article and understand. Thanks for the info

 >Instance labels are designed to be unique identifiers of a target (in combination with job).  So keeping the port there is important.  However many setups use relabel configs to add a `node` label for connivence.  This usually works quite well.
Ok. In the end I have actually just used the pmm-admin tool to put my non db instances into consul, so they will all have the same labels. Easy win for me, even if it is probably not the best way of doing things in the long run :) Ill revisit when I have time to learn more about Prometheus

>You will likely want a very short TTL, say 5 seconds, for this to behave well.
We have pretty static inventory currently. I just want to be able minimize the work required, so once we stand up a set of systems, they

>They have no concept of internal polling intervals.  This makes them much easier to deploy, work well with HA or ad-hoc polling.
Unsderstood. Only thing here is CPU was running extremely hot in my tests, and then when I checked, the scrapes were default set to 1s in PMM. :)

Yes, PMM does that, some of the defaults there are basically bordering on application profiling, not monitoring.  There are new features coming in node_exporter and mysqld_exporter to more explicitly support this kind of profiling, but it's not generally useful for what we consider standard monitoring.
 

>You will want to have every application process expose a port for metrics, either via client_java or the jmx_exporter agent.
Got it. Hoping to look at a java exporter today if i get a chance.

Thanks again for everything, I have enough to work with :)

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages