I originally put this in as a bug as that is where I thought this should be but I was told to ask here. Hopefully someone can help me out here. Basically prometheus decided at a random point to no longer collect data from all inputs into a custom exporter we created for Spectrum Perfromance Center except one. I thought is was due to the fact that we had duplicate HELP and TYPE in the output(we are fixing this now) but the devs on github do not think that is the issue. Anyway, stopping and starting prometheus sometimes fixes the issue, but deleting the data file for prometheus definitely fixes the issue. This is worrying as that means that we have lost data. This is in a dev environment for now but this will be prod eventually. Nothing came out in the logs and prometheus showed the scrape job as functioning just fine. Going to the exporter directly shows the that it is still working just fine.
The way our custom exporter works is you put the router you want to query in Spectrum Performance Center in the path and it queries SPC for that router and outputs the data in prometheus friendly format on the fly. So if I want myrouter1 and myrouter2 I would create 2 jobs in prometheus with different metrics paths to get the data. See the config below for more detail.
Some added detail. We are running only 1 container of prometheus in AWS ECS with the data directory presented to the container as a volume. Since we have multiple ECS instances, each one mounts an EFS mount point and the data directory for prometheus lives in EFS. This is so that if the container or instance goes down, prometheus can restart using the same data in a different location and we have not had any issues with this setup.
What did you do?
Wrote a custom exporter for Spectrum Performance Center that puts out duplicate HELP and TYPE statements. The custom exporter is being fixed on our end but Prometheus scraped the data without issue. Eventually data stopped collecting even though Prometheus showed the scrape as just fine.
How the exporter is called:
http://localhost:8001/int/myrouter1
Output:
# HELP interface_bw Interface BW for Gi1
# TYPE interface_bw summary
interface_bw_in_bits{device="myrouter1",interface="Gi1",} 11
interface_bw_out_bits{device="myrouter1",interface="Gi1",} 16
# HELP interface_bw Interface BW for Gi2
# TYPE interface_bw summary
interface_bw_in_bits{device="myrouter1",interface="Gi2",} 1477
interface_bw_out_bits{device="myrouter1",interface="Gi2",} 125709
# HELP interface_bw Interface BW for Gi3
# TYPE interface_bw summary
interface_bw_in_bits{device="myrouter1",interface="Gi3",} 132831
interface_bw_out_bits{device="myrouter1",interface="Gi3",} 8135
Discovered the issue when promtool choked on the duplicate help and type.
What did you expect to see?
Prometheus should fail the scraper job and put something in the logs
What did you see instead? Under which circumstances?
Prometheus went along fat dumb and happy until at some point after a few days it decided to stop collecting data on myrouter1
Environment
System information:
Linux 3.10.0-693.5.2.el7.x86_64 x86_64
Prometheus version:
prometheus, version 2.1.0 (branch: HEAD, revision: 85f23d8)
build user: root@6e784304d3ff
build date: 20180119-12:01:23
go version: go1.9.2
Alertmanager version:
Prometheus configuration file:
# my global config
global:
scrape_interval: 300s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 60s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
- job_name: 'myrouter1'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
metrics_path: /int/myrouter1
static_configs:
- targets: ['SPCexporter:8001']
basic_auth:
username: nunya
password: bidness
- job_name: 'myrouter2'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
metrics_path: /int/myrouter2
static_configs:
- targets: ['SPCexporter:8001']
basic_auth:
username: nunya
password: bidness
- job_name: 'aws-gis-transit-prod-dx'
static_configs:
- targets: ['cloudwatch-exporter:9100']
- job_name: 'Grafana Stats'
scrape_interval: 5s
static_configs:
- targets: [ 'mygrafana' ]
n/a
nothing came out in the logs--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/90566236-2283-42ec-be53-3f6d1fddb338%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
Ok It's happening again. I am attaching a picture of the scrape_samples_scraped for the last 24 hours. I am also including the current output of the exporter. The norm is 6 samples. Any ideas? I have prometheus set to debug and I see nothing in the logs.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/90566236-2283-42ec-be53-3f6d1fddb338%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.