Prometheus V2 eating through all available disk

722 views
Skip to first unread message

vit...@contextlogic.com

unread,
Nov 15, 2017, 4:09:36 PM11/15/17
to Prometheus Users
Hello,

Just wanted to say thank you for the hard work on getting 2.0 out, I'm looking forward to taking advantage of all the performance benefits. 

I launched a 2.0 instance a few days ago with a 4TB volume for storing my metrics and noticed all 4TB get used up within a period of 24 hours. Is this expected?

A lot of the Prometheus storage metrics changed in 2.0 so I'm unsure of what kind of data to provide but let me know how I can help.

Vitaliy

Ben Kochie

unread,
Nov 15, 2017, 4:34:05 PM11/15/17
to vit...@contextlogic.com, Prometheus Users
That doesn't sound right at all.  What is your sample ingestion rate?

rate(prometheus_tsdb_head_samples_appended_total[5m])

What does `du` look like in the data directory?

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/3bd6a935-3c48-45c6-8427-2b0cb5e4582f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

David Karlsen

unread,
Nov 15, 2017, 7:41:40 PM11/15/17
to Ben Kochie, vit...@contextlogic.com, Prometheus Users
What kind of filesystem is it? Maybe there are a lot of small files and the inodes get's cosumed fast?

2017-11-15 22:34 GMT+01:00 Ben Kochie <sup...@gmail.com>:
That doesn't sound right at all.  What is your sample ingestion rate?

rate(prometheus_tsdb_head_samples_appended_total[5m])

What does `du` look like in the data directory?

On Wed, Nov 15, 2017 at 10:09 PM, vitaliy via Prometheus Users <prometheus-users@googlegroups.com> wrote:
Hello,

Just wanted to say thank you for the hard work on getting 2.0 out, I'm looking forward to taking advantage of all the performance benefits. 

I launched a 2.0 instance a few days ago with a 4TB volume for storing my metrics and noticed all 4TB get used up within a period of 24 hours. Is this expected?

A lot of the Prometheus storage metrics changed in 2.0 so I'm unsure of what kind of data to provide but let me know how I can help.

Vitaliy

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/3bd6a935-3c48-45c6-8427-2b0cb5e4582f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

Ben Kochie

unread,
Nov 16, 2017, 4:27:52 AM11/16/17
to David Karlsen, vit...@contextlogic.com, Prometheus Users
That shouldn't be a problem with the new TSDB, it uses a small number of large files.

vit...@contextlogic.com

unread,
Nov 20, 2017, 8:01:06 PM11/20/17
to Prometheus Users
Sorry for the delayed response, was delaying with other issues. 

My problem seems to be related to this issue:


Sample ingestion rate was about ~500k and it was working fine for about 4 days until the issue happened. The data directory is a bunch of tmp files like this:

0 ./01BZCW3RKNNF94S9N4S5B53XF2.tmp/chunks
0 ./01BZCW3RKNNF94S9N4S5B53XF2.tmp
0 ./01BZCW8HJ1GDA475AKH3S0AKVD.tmp/chunks
0 ./01BZCW8HJ1GDA475AKH3S0AKVD.tmp
0 ./01BZCWDAMMTNH9RPHZ410WFHT3.tmp/chunks
0 ./01BZCWDAMMTNH9RPHZ410WFHT3.tmp
0 ./01BZCWJ2C5ASQ2F1FWTZT9XY4W.tmp/chunks
0 ./01BZCWJ2C5ASQ2F1FWTZT9XY4W.tmp
0 ./01BZCWPZ08HKZ6D3P6NWSMZKMB.tmp/chunks
0 ./01BZCWPZ08HKZ6D3P6NWSMZKMB.tmp
0 ./01BZCWVTA9VJ5ZRSA4237JSJT5.tmp/chunks
0 ./01BZCWVTA9VJ5ZRSA4237JSJT5.tmp
0 ./01BZCX0S38XYAARX59J5EP49YP.tmp/chunks
0 ./01BZCX0S38XYAARX59J5EP49YP.tmp
0 ./01BZCX5KT40Q1XERR7BFTXQE6E.tmp/chunks


On Wednesday, November 15, 2017 at 1:34:05 PM UTC-8, Ben Kochie wrote:
That doesn't sound right at all.  What is your sample ingestion rate?

rate(prometheus_tsdb_head_samples_appended_total[5m])

What does `du` look like in the data directory?
On Wed, Nov 15, 2017 at 10:09 PM, vitaliy via Prometheus Users <promethe...@googlegroups.com> wrote:
Hello,

Just wanted to say thank you for the hard work on getting 2.0 out, I'm looking forward to taking advantage of all the performance benefits. 

I launched a 2.0 instance a few days ago with a 4TB volume for storing my metrics and noticed all 4TB get used up within a period of 24 hours. Is this expected?

A lot of the Prometheus storage metrics changed in 2.0 so I'm unsure of what kind of data to provide but let me know how I can help.

Vitaliy

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages