prometheus stops collecting data even though the exporter is still working

463 views
Skip to first unread message

joejg...@gmail.com

unread,
Feb 12, 2018, 3:11:52 PM2/12/18
to Prometheus Users

I originally put this in as a bug as that is where I thought this should be but I was told to ask here.  Hopefully someone can help me out here.  Basically prometheus decided at a random point to no longer collect data from all inputs into a custom exporter we created for Spectrum Perfromance Center except one.  I thought is was due to the fact that we had duplicate HELP and TYPE in the output(we are fixing this now) but the devs on github do not think that is the issue.  Anyway, stopping and starting prometheus sometimes fixes the issue, but deleting the data file for prometheus definitely fixes the issue.  This is worrying as that means that we have lost data.  This is in a dev environment for now but this will be prod eventually.  Nothing came out in the logs and prometheus showed the scrape job as functioning just fine.  Going to the exporter directly shows the that it is still working just fine.


The way our custom exporter works is you put the router you want to query in Spectrum Performance Center in the path and it queries SPC for that router and outputs the data in prometheus friendly format on the fly.  So if I want myrouter1 and myrouter2 I would create 2 jobs in prometheus with different metrics paths to get the data.  See the config below for more detail.


Some added detail.  We are running only 1 container of prometheus in AWS ECS with the data directory presented to the container as a volume.  Since we have multiple ECS instances, each one mounts an EFS mount point and the data directory for prometheus lives in EFS.  This is so that if the container or instance goes down, prometheus can restart using the same data in a different location and we have not had any issues with this setup.



What did you do?
Wrote a custom exporter for Spectrum Performance Center that puts out duplicate HELP and TYPE statements. The custom exporter is being fixed on our end but Prometheus scraped the data without issue. Eventually data stopped collecting even though Prometheus showed the scrape as just fine.

How the exporter is called:

http://localhost:8001/int/myrouter1

Output:

# HELP interface_bw Interface BW for Gi1
# TYPE interface_bw summary
interface_bw_in_bits{device="myrouter1",interface="Gi1",} 11
interface_bw_out_bits{device="myrouter1",interface="Gi1",} 16
# HELP interface_bw Interface BW for Gi2
# TYPE interface_bw summary
interface_bw_in_bits{device="myrouter1",interface="Gi2",} 1477
interface_bw_out_bits{device="myrouter1",interface="Gi2",} 125709
# HELP interface_bw Interface BW for Gi3
# TYPE interface_bw summary
interface_bw_in_bits{device="myrouter1",interface="Gi3",} 132831
interface_bw_out_bits{device="myrouter1",interface="Gi3",} 8135

Discovered the issue when promtool choked on the duplicate help and type.

What did you expect to see?

Prometheus should fail the scraper job and put something in the logs

What did you see instead? Under which circumstances?

Prometheus went along fat dumb and happy until at some point after a few days it decided to stop collecting data on myrouter1

Environment

  • System information:

    Linux 3.10.0-693.5.2.el7.x86_64 x86_64

  • Prometheus version:

    prometheus, version 2.1.0 (branch: HEAD, revision: 85f23d8)
    build user: root@6e784304d3ff
    build date: 20180119-12:01:23
    go version: go1.9.2

  • Alertmanager version:

  • Prometheus configuration file:

# my global config
global:
  scrape_interval:     300s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 60s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'myrouter1'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    metrics_path: /int/myrouter1

    static_configs:
      - targets: ['SPCexporter:8001']

    basic_auth:
      username: nunya
      password: bidness

  - job_name: 'myrouter2'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    metrics_path: /int/myrouter2

    static_configs:
      - targets: ['SPCexporter:8001']

    basic_auth:
      username: nunya
      password: bidness

  - job_name: 'aws-gis-transit-prod-dx'
    static_configs:
      - targets: ['cloudwatch-exporter:9100']

  - job_name: 'Grafana Stats'
    scrape_interval: 5s
    static_configs:
      - targets: [ 'mygrafana' ]
  • Alertmanager configuration file:
n/a
  • Logs:
nothing came out in the logs

Ben Kochie

unread,
Feb 12, 2018, 4:55:50 PM2/12/18
to joejg...@gmail.com, Prometheus Users
What do you see on /targets? The last error should show there.

What does the `up` metric say?
What does the `scrape_samples_scraped` say?

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/90566236-2283-42ec-be53-3f6d1fddb338%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Joe Garcia

unread,
Feb 12, 2018, 10:20:55 PM2/12/18
to Ben Kochie, Prometheus Users
/targets shows no errors.

Metric says it is up.

Didn't look at scrape_samples_scraped before I wiped the DB files.  If this happens again I will definitely get that info.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

To post to this group, send email to promethe...@googlegroups.com.

Joe Garcia

unread,
Feb 14, 2018, 2:36:23 PM2/14/18
to Prometheus Users


Ok It's happening again.  I am attaching a picture of the scrape_samples_scraped for the last 24 hours.  I am also including the current output of the exporter.  The norm is 6 samples.  Any ideas?  I have prometheus set to debug and I see nothing in the logs.



To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.

To post to this group, send email to prometheus-users@googlegroups.com.

Joe Garcia

unread,
Feb 14, 2018, 2:40:10 PM2/14/18
to Prometheus Users
Ha I just saw the issue... sorry for the wasted bandwidth.  We are putting out duplicate samples.

Joe Garcia

unread,
Feb 15, 2018, 1:12:34 PM2/15/18
to Prometheus Users
Ok to just clarify what was happening for others.  We built an exporter for CA Spectrum Performance Center.... That is feed SNMP data from Spectrum.  The SNMP data is polled every 5 minutes by spectrum and then feed to PC.  Our exporter is designed to get the latest data from PC.  We scrape the exporter every 5 minutes.  What was happening was we were getting a timing issue.  PC might only have processed current data from Spectrum for 0-x num of interfaces for the latest data.  So if the scrape job tries to get the data say at 01:55:07 and the data was collected at 01:55:00 there is a chance that it hasn't finished processing all the interfaces and is only presenting the ones it has so we were either getting no data or not all the interfaces.
Reply all
Reply to author
Forward
0 new messages