Some real-life benchmarks of Prometheus 2.0 vs. 1.8

684 views

Skip to first unread message

Björn Rabenstein

unread,

Dec 11, 2017, 2:02:02 PM12/11/17

to Prometheus Users

Dear Prometheans,

In the process of migrating from Prometheus 1.8 to 2.0 at SoundCloud, we had our more important Prometheus servers run using both versions in parallel. That allowed us to do some side-by-side comparisons under a typical production workload.

The raw data can be found in this spreadsheet. It has to be taken with several grains of salt:

All three servers considered are fairly large, dedicated bare-metal instances. Your mileage, especially on cloud instances or shared servers, will certainly vary.
All servers were generously provisioned to safely do their job with Prometheus 1.8. Thus, this is not really a stress test for Prometheus 1.8, and (in most cases) even less so for Prometheus 2.0. We will only benefit from 2.0 resource savings once our server load has grown organically or we have merged sharded servers.

Let me first describe a few gains from the 2.0 migration that don’t really show up in the benchmark numbers. Given that we have a fairly mature set-up and had already tweaked functional sharding in a way to not really run into resource problem, we didn’t really have significant problems with operational complexity or OOMing anymore. But there were more or less noticeable pain points remaining that Prometheus 2.0 addressed, approximately in this order:

With larger servers and/or with more series churn, the LevelDB indexing created more and more problems. After a major deploy, the resulting influx of new series led to indexing backlog, which in turn affected the result of recording rules (as rules were evaluated using a stale state of the index, thus ignoring many of the new series, and then the result was saved for eternity). Also, a large index was quite slow at times. Most problems with slow queries were due to the time spent in LevelDB lookups. Finally, the re-indexing required in crash recovery dominated total crash recovery time on larger servers. (Luckily, crashes were rare.) With the dedicated new indexing in Prometheus 2.0, all of these issues have disappeared.
Checkpointing created a real scalability wall. Our largest servers have 128GiB, which is about the size where checkpointing starts to take annoyingly long. This is bad because more data is lost in case of a crash, and it affects shut-down and start-up times. If we had one of those large servers being overloaded, going even larger (like 256GiB RAM) wouldn’t be an option anymore because checkpointing duration would not be reduced by that. Prometheus 2.0 has a related issue with WAL replay (but only affecting start-up, not shut-down). Luckily, on our dedicated many-core servers, the WAL replay can utilize a lot of otherwise idle cores, which makes it much faster than checkpoint loading. However, this will be really different in a cloud deployment where your Prometheus server only has the cores reserved it needs under normal load.
Most of our servers really just hold ephemeral metrics data. However, some are used in scenarios where long-term history comes in handy. With the lack of consistent backups, disk failure or server migration was a huge problem (luckily constrained to those few cases). While Prometheus 2.0 makes backups elegant and easy, we still need to migrate the long-term data from the 1.x to the 2.0 format. There are some efforts underway in the community, which we will have to jump onto eventually.
Finally, complex dashboards over shorter time frames got a very noticeable speed-up by Prometheus 2.0. (That’s important to mention here as the benchmark results are about longer time frames.)

And now I’d like to give a summary and interpretation of the results from the benchmark (see the spreadsheet for the raw numbers:

Memory usage

In short, since we are not really putting the servers under pressure, and the memory management of 1.8 vs. 2.0 is so different, there isn’t really a lot to read from the numbers. However, we do expect that we could run servers with high series churn much hotter now. But we still have to try that out. A somewhat surprising result is that we managed to reliably OOM a 2.0 server with an expensive (but practically relevant) query that was working just fine on the corresponding 1.8 server. We still need to investigate where exactly the memory is consumed. The query layer is known to use a lot of memory under circumstances, but it is the same in 2.0 and 1.8. So query load must cause some additional memory usage in the storage layer or in the staleness handling.

CPU usage

CPU usage was down by 50% to 75% percent without query load. It is notable that most of the additional CPU consumption of 1.8 is due to deletion of expired time series data. It’s quite telling that we didn’t really spend a lot of thought on the deletion of expired data when we designed the storage layer for 1.x. (“Deletion is easy, isn’t it?”) In hindsight we have to realize that deletion is the biggest resource drain in 1.x.

Disk I/O

The most legendary improvement is the I/O utilization as reported by the drive. All servers tested here have SSDs. Things would look a bit different on HDDs, but that’s a different story. Reported drive utilization is down by 97.5 to 99.1%. That’s two orders of magnitude! And again, most of the utilization on 1.8 is caused by deletion (about 90%). Deleting relatively small amounts of data is something SSDs really hate.

On-disk storage

I was a bit surprised by the results here. Since 1.8 is playing a few more tricks with compression, it should, in principle, be a bit better for chunks completely filled with samples. (To be fair, the varbit compression was hacked together fairly quickly without proper research how much those additional tricks actually contribute.) The big downside of 1.8 are the constant-sized chunks. When a series ends, we get half-filled chunks. If compression works well, we get too many samples in a chunk (bad for querying), and if it doesn’t work well, larger chunks would be more efficient.

The results show that even on the system server, with very little churn, we get slightly better compression per sample. On api-mobile, however, I would have expected 2.0 to fare much better, but the difference is negligible. The k8s server seems to have really nasty-to-compress data, and 2.0 fares a lot better than 1.8 there. It would be very interesting to investigate where exactly the ups and downs happen on both systems.

The on-disk index size is about twice as big for 2.0. While 2.0 repeats parts of the index for each block, I assume the main reason is that LevelDB used in 1.8 compresses the on-disk data with Snappy, while 2.0 puts the data onto disk as it is mmap’d. There is in-memory compression, but it’s not surprising that it is less efficient. However, LevelDB has to unpack the compressed data in memory, which is a pretty big burden for the total performance.

Query performance

The benchmark has two artificial scenarios (single time series over the whole retention time and many time series for a single point in the past), which are supposed to explore certain extremes. In addition, we took a fairly complex query directly from our most popular dashboards and ran it for the last 7d as the time range.

The full range query took in general longer on 2.0, with a larger difference the larger the retention time. This makes sense as 2.0 has to access more blocks, while 1.8 probably burns most time in the index lookup and then has to access the one dedicated series file. The only time 1.8 was slower was the cold query on the server with the shortest retention time.

The query of many time series in the past is a fairly artificial scenario but exercises interesting code paths. Where the total data size was exceeding RAM, the cold query took longer on 1.8 than on 2.0, which can be nicely explained with the many files that have to be opened and processed by 1.8, while 2.0 only acts within one block, and since all the series accessed had the same metric name, they were also closely together in the block due to lexicographical sorting within the block. The hot query, in turn, was much slower on 2.0, which surprised me. My guess is that it’s the staleness handling, not the storage layer itself, that creates this overhead, as it has to be checked separately for each series. But that’s another topic for more investigation.

The differences for the “typical complex dashboard query” over 7d weren’t that big (difference between 40 and 70%). Perhaps another surprise: 1.8 wins for two out of the three servers. As noted above, for shorter time ranges, 2.0 is in general much faster. 1.8 query times don’t depend a lot on the range. My working hypothesis is that 1.8 consumes a lot of time in the index lookup. Once that’s done, the one-file-per-series design makes the rest of the query quite fast, no matter how long the time range is. 2.0 has a much faster index lookup but has to do it for each block again. The case where 2.0 wins is the one with higher series churn, where also 1.8 has to do many index lookups (different series over time). Where that’s not the case, 1.8 has the edge.

Outlook

Which ends this ad-hoc analysis. As you can see, there is certainly a lot of opportunity for more investigation. But I hope this helps a bit to get an idea about the real-life impact of the migration, at least in a “big dedicated servers” scenario. While a lot got much (and sometimes dramatically) better with 2.0, there are a few surprising nuances in the overall picture.

I’m really curious how users fare that are more resource-constrained in a shared or cloud environment. Also, I’m interested what happens if the total index size exceeds RAM, i.e. if the index lookups will involve frequent page faults. And finally, I have a (somewhat academic) interest in HDD behavior.

Happy Prometheing,

Björn Rabenstein, Engineer
http://soundcloud.com/brabenstein

SoundCloud Ltd. | Rheinsberger Str. 76/77, 10115 Berlin, Germany
Managing Director: Alexander Ljung | Incorporated in England & Wales with Company No. 6343600 | Local Branch Office | AG Charlottenburg | HRB 110657B

Tom Wilkie

unread,

Dec 11, 2017, 3:08:34 PM12/11/17

to Björn Rabenstein, Prometheus Users

Awesome write up Björn! Have you considered putting this on the Prometheus blog?

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CALSNGhmCJCLtumASLvG4WgYPLFX5%3DnH%2BPWWi5TwDV%3DaV2Du%2BWw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Björn Rabenstein

unread,

Dec 11, 2017, 3:18:34 PM12/11/17

to Tom Wilkie, Prometheus Users

On 11 December 2017 at 21:08, Tom Wilkie <tom.w...@gmail.com> wrote:
> Awesome write up Björn!

Thanks, Tom.

> Have you considered putting this on the Prometheus blog?

I'd say it's still a bit too rough for that. So much "needs more
investigation" (which in some parts would be quite straightforward,
like heap profiling or analyze query timings with the newer code
changes, or improve those timing instrumentation). But time is
limited, and so many other things to do urgently... So I thought the
mailing list is a good way to share something rough that might still
be helpful to others.

Fabian Reinartz

unread,

Dec 11, 2017, 3:52:30 PM12/11/17

to Prometheus Users

Really cool, thanks for the time to analyse and writing it down.

The OOMing for queries (probably somewhere in PromQL) in 2.0 that executes successfully in 1.8: the changes made to PromQL to fit the new storage interface were definitely non-trivial. I've hit a few performance debugging cases over time that pointed towards some of those changes being sub-optimal allocation-wise.

More investigation one the querying side would be great for sure – especially since there should be a good deal of optimization potential, e.g. if index lookups are indeed the culprit, they can trivially be parallelised across blocks. I'm also getting more suspicious of the extremes to which we took deferred/lazy evaluation. It is great for saving allocations for sure – but it might have big caveats for cache-(un)friendliness.

So far I couldn't observe issues with index size >> ram size myself (but that doesn't say much :). Reasoning purely theoretically again, just like for time series data itself, large portions of the index are likely never used. mmap should do decent about caching those pages of it that are actually being used. Generally lookups for a single query should not scatter completely randomly across the index, i.e. the format is designed for read amplification to be really low (I actually have more on that elsewhere soon :)

I suppose we really would need better insights into where query time is being spent – but the lazy evaluation makes this near-impossible.

It could be worth it to add 'expansion points', e.g. after evaluating the postings lists, looking up series in the index, etc. – if only to have better insight into what taking how long.

Though we wouldn't want to drop the property of prometheus/tsdb generally having streaming API, which PromQL unfortunately cannot make use of at this point.

I too would be interested in having some real-world reports from HDD users. In _theory_ chunks for the same series are still aligned sequentially on disk (within a block). Additionally, chunks for series for the same metric are also sequentially aligned (which should easily make up for the block-splitting). That could actually give an extra performance boost for HDDs (compared to 1.8). However, since mmap is being used, this is all subject to how reads are actually scheduled by the OS – and we frankly don't know too much about it and it could be way sub-optimal.

Reply all

Reply to author

Forward

0 new messages