Textfile Collector timestamps

2,786 views
Skip to first unread message

rbar...@keedio.com

unread,
Jun 13, 2018, 11:51:09 AM6/13/18
to Prometheus Users
HI,

I've been working for Prometheus for some time and i just found I'm unable to use textfile_collector to scrape metrics with a timestamp.

As far as i know (and as per official documentation) timestamp is used for this purposes:

METRIC:
total_cpu_system_rate_across_namenodes 0.003166666666666629 152889604700
ERROR:
"/var/lib/node_exporter/textfile_collector/cloudera_metrics_rbarroso.prom\" contains unsupported client-side timestamps, skipping entire file" source="textfile.go:219"
DOCUMENTATION:

If this feature has been disabled, how can i get this timestamped values to prometheus? Shall i use any new component?

Thanks in advance,

Ben Kochie

unread,
Jun 13, 2018, 1:43:40 PM6/13/18
to rbar...@keedio.com, Prometheus Users
This will require a custom exporter. Another option if this is coming from another TSDB, you implement a remote_read compatible endpoint.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/58847bd0-f1c0-4e27-b253-f08a6919d9a3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

goog001.mus...@gmail.com

unread,
Feb 12, 2019, 7:42:07 AM2/12/19
to Prometheus Users
So the data is I guess timestamped based on when it was scraped by prometheus server?

If so, how do we know if there is stale data coming through the textfile (e.g. because cronjob or mtail or something else that produces the textfile is down).

I expect that prometheus just continues to report the stale metrics from the same text file over and over again as if nothing is wrong.

Perhaps prometheus looks at node_textfile_mtime_seconds and complains if that gets stale?

Just want to make sure we have a way to ensure we aren't looking at stale metrics. I suppose we could also export a jobtime/lastmodifiedtime as an additional metric and then setup alerting on that if the now-currentValue > X...but that smells bad.
Message has been deleted
Message has been deleted

goog001.mus...@gmail.com

unread,
Feb 13, 2019, 3:10:06 AM2/13/19
to Prometheus Users
(deleted/reposting..."awesome" google multi-signin UX got me twice now)

So after playing with this for a day I can add the following empirical note:
- my .prom file is only updated (via cron) once per minute
- the data inside should theoretically only change once every 5 minutes (source data is based on a */5 job)
- prometheus is scraping node_exporter metrics every 10s

I can confirm that prometheus does indeed store the metrics every 10 seconds despite node_textfile_mtime_seconds not changing.

This isn't ideal obviously because we have way more data being stored than is necessary.  I can slow down how often we scrape node_exporter of course... but that isn't really solving the problem because I either have to get too low of resolution on node_exporter metrics or oversampled (too much repeated data) on textfile metrics being made available via node_exporter.

It seems like timestamping the data in the textfile and then prometheus ignoring or deduping it (I already have datapoint for this metric/timestamp: ignore) would be the more elegant solution.  I only started with prometheus a few days ago though...so hopefully there is a clean way to handle this case already?  I would like to avoid having to keep track of a myriad of open ports to scrape on every box...ie just creating new exporters to run on new ports with different scrape configs on prometheus server feels unnecessarily complicated....but maybe that is the official answer?

Stuart Clark

unread,
Feb 13, 2019, 3:51:47 AM2/13/19
to goog001.mus...@gmail.com, Prometheus Users
What you describe is exactly what is expected. Using the textfile
collector of the node exporter or the push gateway are designed for use
cases like cron jobs - the prom metrics file will change occasionally,
but be included within the metrics from every node exporter scrape. The
storage used on the Prometheus side shouldn't be a concern due to the
way compression is used.

I would strongly suggest you limit your use of the textfile
collector/push gateway to only the things you can't do any other way
(e.g. short lived cron jobs) and use additional exporters or directly
instrument your custom applications.

This design of the Prometheus ecosystem has lots of advantages which you
would lose - the automatic metrics like "up", removal of a single point
of failure on each box, config for each job (e.g. relabling rules).

goog001.mus...@gmail.com

unread,
Feb 13, 2019, 4:15:15 AM2/13/19
to Prometheus Users
The reason this came up at all is that grafana was crashing and wouldn't load 12 hours of data since I set this up yesterday afternoon.

Grafana always point their finger at "too many data points in your TSDB" when people say they are slow.

Running the metric query on the prometheus side shows that indeed I have the same metric every 10 seconds even though the metric is stale and so something around 17k metrics from across my 4 test machines for the last 12 hours.  Of course most of those metrics are duplicates...which is why I was wondering what the point of the timestamp is in the exposition format if not to indicate to prometheus when this data is from in order to aid de-duplication.  Of course half the point of this thread is that the timestamp in the exposition format isn't allowed by the textfile collector.


Perhaps 17k data points are just too many for grafana to deal with or perhaps the performance problem lies elsewhere.  My first instinct realizing that many of these data points are duplicates due to oversampling was that this is a performance problem related to oversampling un-updated datapoints.  Without a timestamp on each metric, no way for the exporter to indicate to prometheus that this is the same metric already scraped...not an updated metric that happens to have the same value.  Right?

I don't want to write custom exporters and open/track more metric ports for every little thing (e.g. cron job status, security updates available, etc).  Of course for our own edge applications we will do custom instrumentation.  For well known services that have a specialized exporter, of course we will use that.  But getting basic server/service health monitoring in place is something that should work fairly easily out of the box I would think.

Perhaps in real world deployments people are still using nagios/naemon for all that kind of basic redlight/greenlight stuff and just using prometheus for more advanced whitebox monitoring?  Or perhaps most companies using prometheus have large dedicated devops teams, everything runs in a DC behind firewalls and they just open a large number of ports and manage the corresponding metric exporters installs/configs on the servers.

As I said, I am new to Prometheus and I am sure you guys have many more years/months of thought about how prometheus should work and why.  So I defer to your expertise...which is why I am asking questions...to better understand the hows/whys so my intuition about how to do things gets better.  The more I explain how/why I expect it to work, the easier it is for you to point out exactly what the problems are with my line of thinking.  

Thanks for your reply.  Happy to hear any further insight or suggestions you may have.


Matt P

unread,
Feb 13, 2019, 4:40:22 AM2/13/19
to Prometheus Users
Short update in case it helps anyone else that comes across this thread.

Grafana was indeed crashing due to too many data points.  In my case it was a custom annotation (cronjob_name_last_run_timestamp) and not the regular data series that was causing the problem.
It seems the key here is to make sure to specify prometheus resolution in the query (grafana may call this "step") in order to group/limit the number of data points returned by prometheus to grafana.  I guess this might be a good practice in general (limit your grafana query step/resolution to the resolution of the underlying data to avoid making grafana slower than necessary).

So my main issue is resolved.  I will trust you on the: "don't worry about prometheus collecting extra/stale/duplicate data because this is solved with compression." and assume that prometheus query speed won't be an issue (for now).

Thanks Stuart for taking the time to reply.

Stuart Clark

unread,
Feb 13, 2019, 5:39:17 AM2/13/19
to goog001.mus...@gmail.com, Prometheus Users
We actually use the textfile collector for similar use cases.

One example is we have cron jobs related to backups. The script run will
also create a prom metrics file containing at least a metric with a 0/1
success/failure value and one containing the unix timestamp of the last
time it ran. Additionally we'll add metrics for things like the run
time, etc.

We can then use those for alerts/single stat panels - for example
alerting on failure or if the backup failed to run within the last X
hours.

Stuart Clark

unread,
Feb 13, 2019, 5:43:01 AM2/13/19
to Matt P, Prometheus Users
With regards to Grafana, you do want to be careful about the amount of
data points you return. Grafana has a useful variable called $__interval
which can be used within PromQL queries. Using that with a suitable
aggregation function (depends on if the metric is a counter or gauge and
what it represents) you can easily plot graphs with arbitrarily long or
short coverage intervals without any problems - the number of datapoints
sent to Grafana would be reasonably constant, just with each point being
a min/max/average/whatever over different time periods
(minute/hour/day/etc.) based on the sensible granularity possible due to
the time range chosen.

Hope that helps :-)

Matt P

unread,
Feb 13, 2019, 6:01:45 AM2/13/19
to Prometheus Users
Thanks again for all the replies.  Yes it helps.  I am also collecting status (1/0 fail/success), duration (seconds), job timestamp and 2 job specific metrics.

I had assumed that grafana's problem was volume of data (correct).  It was also related to annotations being very expensive to render in grafana relative to normal charts.

I was looking at all the dupe/oversampled metrics in prometheus as the root cause.  

Instead the solution was to use "min step" option in grafana and align that to the underlying data frequency which I know but prometheus seems to ignore.  The interval option in grafana is for auto-setting step size (ie 1/10 is aiming for 1 data point per 10 pixels of chart width).  In my case just setting minStep seems to be enough.

I still feel like it is wasteful to collect the same non-updated metric over and over...but understand prometheus handles this internally ("compression").  

I am also not sure I understand why the exposition formats have the timestamp but that isn't allowed by the textfile collector.  Seems like we are discarding a relevant piece of data (when is this data point REALLY from...as opposed to when was it scraped).  But this may be down to the architectural/design decisions of prometheus or the prometheus TSDB (just want to write and not worry about comparing to last record for given key).  Just seems it would be useful to say: "yes I scraped it, the metrics endpoint is up, the metric hasn't been updated since the last time I scraped it because the timestamp hasn't changed".  And this raises the question of what is the purpose of the timestamp when it is allowed if prometheus doesn't use it in a way like this.  The expositional formats page is pretty light on details.


Matt P

unread,
Feb 13, 2019, 6:10:36 AM2/13/19
to Prometheus Users
Good tip on the $__interval variable.  I am doing straight metric queries right now to generate my grafana charts, so just setting the minstep size is enough (this gets passed to prometheus' http api as step=).  I am sure the $__interval variable will come in handy when building fancier queries.  Thanks.

The 1/10 thing I mentioned in my previous reply is called "Resolution" in grafana, I misspoke when I said interval.  Sorry.

Ben Kochie

unread,
Feb 13, 2019, 7:56:41 AM2/13/19
to fred...@missionfemale.com, Prometheus Users
On Wed, Feb 13, 2019 at 9:07 AM <fred...@missionfemale.com> wrote:
So after playing with this for a day I can add the following empirical note:
- my .prom file is only updated (via cron) once per minute
- the data inside should theoretically only change once every 5 minutes (source data is based on a */5 job)
- prometheus is scraping node_exporter metrics every 10s

I can confirm that prometheus does indeed store the metrics every 10 seconds despite node_textfile_mtime_seconds not changing.

This isn't ideal obviously because we have way more data being stored than is necessary. I can slow down how often we scrape node_exporter of course... but that isn't really solving the problem because I either have to get too low of resolution on node_exporter metrics or oversampled (too much repeated data) on textfile metrics being made available via node_exporter.

This is basically not an issue for Prometheus. It uses a very good compression method that reduces this data to only a few extra bits in the sample blocks.
 

It seems like timestamping the data in the textfile and then prometheus ignoring or deduping it (I already have datapoint for this metric/timestamp: ignore) would be the more elegant solution. I only started with prometheus a few days ago though...so hopefully there is a clean way to handle this case already? I would like to avoid having to keep track of a myriad of open ports to scrape on every box...ie just creating new exporters to run on new ports with different scrape configs on prometheus server feels unnecessarily complicated....but maybe that is the official answer?
 
Yes, Prometheus is designed to scrape many different ports on machines. This is an intentional feature for a bunch of reasons. It avoid a single process becoming a SPoF. If you have the same software running multiple times on the same node, you scrape each one directly. This also keeps individual exporters clean and lightweight. If you compare the node_exporter to Telegraf, it's small and simple.

The whole original design for Prometheus assumes that side-car exporters would eventually become a thing of the past, as every piece of software would have a compatible endpoint. This is quickly becoming true as the format has become popular with cloud native applications as it works with more than just Prometheus

Stuart Clark

unread,
Feb 13, 2019, 8:39:22 AM2/13/19
to Matt P, Prometheus Users

>
> I am also not sure I understand why the exposition formats have the
> timestamp but that isn't allowed by the textfile collector. Seems
> like we are discarding a relevant piece of data (when is this data
> point REALLY from...as opposed to when was it scraped). But this may
> be down to the architectural/design decisions of prometheus or the
> prometheus TSDB (just want to write and not worry about comparing to
> last record for given key). Just seems it would be useful to say:
> "yes I scraped it, the metrics endpoint is up, the metric hasn't been
> updated since the last time I scraped it because the timestamp hasn't
> changed". And this raises the question of what is the purpose of the
> timestamp when it is allowed if prometheus doesn't use it in a way
> like this. The expositional formats page is pretty light on details.
>

I don't have the full history, but the timestamp field is basically
"deprecated" (not officially but seems to be only usable for very few
things).

I think it being removed as supported from various client libraries was
due to people assuming it could be used for things like out-of-order
back population (it can't) or multiple instances of a metric returned in
a single scrape (also not possible).

But I'd defer to one of the experts for the background...

Matt P

unread,
Feb 13, 2019, 8:42:41 AM2/13/19
to Prometheus Users
Thanks for the additional color Ben and Stuart.  Appreciate you guys taking the time to respond.  It helps.

Brian Brazil

unread,
Feb 13, 2019, 8:54:10 AM2/13/19
to Stuart Clark, Matt P, Prometheus Users
It's not deprecated, but support was explicitly removed from both the pushgateway and node exporter textfile collector for these reasons. Some client libraries support it, and for the ones that don't it's more that noone has gotten to it yet. It's something that should rarely be needed, usually only when taking data from another monitoring system that has timestamps (e.g. CloudWatch, Graphite, InfluxDB, and Prometheus federation).

In this case a normal exporter sounds like what you want, as that's a fairly frequent cronjob.
 
--

Matt P

unread,
Feb 13, 2019, 8:59:32 AM2/13/19
to Prometheus Users
Thanks Brian.  And if the cron job was once a day or once a week?  

Feels like scraping that same set of basic metrics every 10s from a job that only generates those metrics once per week is sub-optimal.

Maybe I just need to get over prometheus scraping/storing the same data point over and over?  

Brian Brazil

unread,
Feb 13, 2019, 9:02:58 AM2/13/19
to Matt P, Prometheus Users
On Wed, 13 Feb 2019 at 13:59, Matt P <goog001.mus...@gmail.com> wrote:
Thanks Brian.  And if the cron job was once a day or once a week?  

That'd be fine. When things get below 15m is when I'd generally consider using a daemon rather than a cronjob for general reliability. Attempting to treat anything that updates that slowly as a counter may produce unexpected results though.
 

Feels like scraping that same set of basic metrics every 10s from a job that only generates those metrics once per week is sub-optimal.

Maybe I just need to get over prometheus scraping/storing the same data point over and over?  

It is quite cheap.
 
--

Ben Kochie

unread,
Feb 13, 2019, 11:52:22 AM2/13/19
to Matt P, Prometheus Users
On Wed, Feb 13, 2019 at 2:59 PM Matt P <goog001.mus...@gmail.com> wrote:
Thanks Brian.  And if the cron job was once a day or once a week?  

Feels like scraping that same set of basic metrics every 10s from a job that only generates those metrics once per week is sub-optimal.

Yes, this isn't exactly the use case Prometheus is optimized for. For stuff like this, I usually suggest a pushgateway on localhost of the cron job server. Scraped at a slow rate like 1 per minute. If you expose a cron start time and end timestamp metric, it's good enough to display in monitoring.
 

Maybe I just need to get over prometheus scraping/storing the same data point over and over?  

In the above example, a cron job start time, end time, and last success metric trio would take up ~1.5 million samples per year. In old-school systems like Zabbix, this would be something like 90MiB. In Graphite 24MiB.

In Prometheus we compress 20 identical samples down to something like 24 bytes. This means a year of samples in Prometheus is 1.8MiB.

Prometheus, in this case, is on the order of 13 times more efficient than Graphite, and 50 times more efficient than Zabbix.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.

David Leibovic

unread,
Jul 7, 2024, 6:41:37 PM7/7/24
to Prometheus Users
> This will require a custom exporter

I wanted to expand on this, regarding to the original question of how to get timestamped metrics into Prometheus. I implemented a custom exporter, and it wasn't as hard as I initially imagined. An exporter is essentially just a web server that exposes the timestamped metrics in the *.prom files such that Prometheus can scrape them. You can write a custom exporter in just a few lines of Python.

However, after implementing a custom exporter for timestamped metrics, you may see an error in Prometheus logs:

Error on ingesting samples that are too old or are too far into the future

I believe if the timestamped metrics you are attempting to ingest have timestamps that are older than approximately 1 hour, you may encounter this error. Prometheus has an experimental feature that solves this problem: out_of_order_time_window. See also this blog post announcing the feature. With that, you should be able to ingest timestamped metrics with Prometheus.

See my write up for more details on the whole process.
Reply all
Reply to author
Forward
0 new messages