Dear Prometheans,
In the process of migrating from Prometheus 1.8 to 2.0 at SoundCloud, we had our more important Prometheus servers run using both versions in parallel. That allowed us to do some side-by-side comparisons under a typical production workload.
The raw data can be found in this spreadsheet. It has to be taken with several grains of salt:
Let me first describe a few gains from the 2.0 migration that don’t really show up in the benchmark numbers. Given that we have a fairly mature set-up and had already tweaked functional sharding in a way to not really run into resource problem, we didn’t really have significant problems with operational complexity or OOMing anymore. But there were more or less noticeable pain points remaining that Prometheus 2.0 addressed, approximately in this order:
And now I’d like to give a summary and interpretation of the results from the benchmark (see the spreadsheet for the raw numbers:
In short, since we are not really putting the servers under pressure, and the memory management of 1.8 vs. 2.0 is so different, there isn’t really a lot to read from the numbers. However, we do expect that we could run servers with high series churn much hotter now. But we still have to try that out. A somewhat surprising result is that we managed to reliably OOM a 2.0 server with an expensive (but practically relevant) query that was working just fine on the corresponding 1.8 server. We still need to investigate where exactly the memory is consumed. The query layer is known to use a lot of memory under circumstances, but it is the same in 2.0 and 1.8. So query load must cause some additional memory usage in the storage layer or in the staleness handling.
CPU usage was down by 50% to 75% percent without query load. It is notable that most of the additional CPU consumption of 1.8 is due to deletion of expired time series data. It’s quite telling that we didn’t really spend a lot of thought on the deletion of expired data when we designed the storage layer for 1.x. (“Deletion is easy, isn’t it?”) In hindsight we have to realize that deletion is the biggest resource drain in 1.x.
The most legendary improvement is the I/O utilization as reported by the drive. All servers tested here have SSDs. Things would look a bit different on HDDs, but that’s a different story. Reported drive utilization is down by 97.5 to 99.1%. That’s two orders of magnitude! And again, most of the utilization on 1.8 is caused by deletion (about 90%). Deleting relatively small amounts of data is something SSDs really hate.
I was a bit surprised by the results here. Since 1.8 is playing a few more tricks with compression, it should, in principle, be a bit better for chunks completely filled with samples. (To be fair, the varbit compression was hacked together fairly quickly without proper research how much those additional tricks actually contribute.) The big downside of 1.8 are the constant-sized chunks. When a series ends, we get half-filled chunks. If compression works well, we get too many samples in a chunk (bad for querying), and if it doesn’t work well, larger chunks would be more efficient.
The results show that even on the system server, with very little churn, we get slightly better compression per sample. On api-mobile, however, I would have expected 2.0 to fare much better, but the difference is negligible. The k8s server seems to have really nasty-to-compress data, and 2.0 fares a lot better than 1.8 there. It would be very interesting to investigate where exactly the ups and downs happen on both systems.
The on-disk index size is about twice as big for 2.0. While 2.0 repeats parts of the index for each block, I assume the main reason is that LevelDB used in 1.8 compresses the on-disk data with Snappy, while 2.0 puts the data onto disk as it is mmap’d. There is in-memory compression, but it’s not surprising that it is less efficient. However, LevelDB has to unpack the compressed data in memory, which is a pretty big burden for the total performance.
The benchmark has two artificial scenarios (single time series over the whole retention time and many time series for a single point in the past), which are supposed to explore certain extremes. In addition, we took a fairly complex query directly from our most popular dashboards and ran it for the last 7d as the time range.
The full range query took in general longer on 2.0, with a larger difference the larger the retention time. This makes sense as 2.0 has to access more blocks, while 1.8 probably burns most time in the index lookup and then has to access the one dedicated series file. The only time 1.8 was slower was the cold query on the server with the shortest retention time.
The query of many time series in the past is a fairly artificial scenario but exercises interesting code paths. Where the total data size was exceeding RAM, the cold query took longer on 1.8 than on 2.0, which can be nicely explained with the many files that have to be opened and processed by 1.8, while 2.0 only acts within one block, and since all the series accessed had the same metric name, they were also closely together in the block due to lexicographical sorting within the block. The hot query, in turn, was much slower on 2.0, which surprised me. My guess is that it’s the staleness handling, not the storage layer itself, that creates this overhead, as it has to be checked separately for each series. But that’s another topic for more investigation.
The differences for the “typical complex dashboard query” over 7d weren’t that big (difference between 40 and 70%). Perhaps another surprise: 1.8 wins for two out of the three servers. As noted above, for shorter time ranges, 2.0 is in general much faster. 1.8 query times don’t depend a lot on the range. My working hypothesis is that 1.8 consumes a lot of time in the index lookup. Once that’s done, the one-file-per-series design makes the rest of the query quite fast, no matter how long the time range is. 2.0 has a much faster index lookup but has to do it for each block again. The case where 2.0 wins is the one with higher series churn, where also 1.8 has to do many index lookups (different series over time). Where that’s not the case, 1.8 has the edge.
Which ends this ad-hoc analysis. As you can see, there is certainly a lot of opportunity for more investigation. But I hope this helps a bit to get an idea about the real-life impact of the migration, at least in a “big dedicated servers” scenario. While a lot got much (and sometimes dramatically) better with 2.0, there are a few surprising nuances in the overall picture.
I’m really curious how users fare that are more resource-constrained in a shared or cloud environment. Also, I’m interested what happens if the total index size exceeds RAM, i.e. if the index lookups will involve frequent page faults. And finally, I have a (somewhat academic) interest in HDD behavior.
Happy Prometheing,
--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CALSNGhmCJCLtumASLvG4WgYPLFX5%3DnH%2BPWWi5TwDV%3DaV2Du%2BWw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.