How to deduplicate data while using HA prometheus and VictoriaMetrics?

Tapas Mohapatra

unread,

May 23, 2019, 12:40:49 PM5/23/19

to victorametrics-users

I want to try Victoria metrics as long term storage for a HA pair prometheus. How do I do data deduplication so only one copy goes into Victoriametrics

Aliaksandr Valialkin

unread,

May 23, 2019, 6:30:54 PM5/23/19

to Tapas Mohapatra, victorametrics-users

Hi Tapas,

VictoriaMetrics doesn't do deduplication, but you can create HA setup:

- Run a pair of VictoriaMetrics in different availability zones (datacenters)

- For each Prometheus HA pair configure the first Prometheus instance to write to the first VictoriaMetrics, while the second Prometheus instance should write to the second VictoriaMetrics.

- Set up Promxy in front of VictoriaMetrics pair and query all the data via Promxy.

This set up remains available in the following conditions:

- If a single Prometheus form each HA pair is down. Then the second Prometheus continue writing data to its' VictoriaMetrics.

- If a single VictoriaMetrics is down. Then the second VictoriaMetrics continues accepting data from Prometheus instances.

Promxy will take care of data merging and de-duplication during query execution.

On Thu, May 23, 2019 at 7:40 PM Tapas Mohapatra <tapasmo...@gmail.com> wrote:

I want to try Victoria metrics as long term storage for a HA pair prometheus. How do I do data deduplication so only one copy goes into Victoriametrics

--
You received this message because you are subscribed to the Google Groups "victorametrics-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to victorametrics-u...@googlegroups.com.
To post to this group, send email to victoramet...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/victorametrics-users/775e1203-6220-4340-a17d-e0a2441a5d05%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Best Regards,

Aliaksandr

Tapas Mohapatra

unread,

May 23, 2019, 8:05:07 PM5/23/19

to Aliaksandr Valialkin, victorametrics-users

Thanks a lot Aliaksandr. I liked the DB as its simple and easy to use. Most importantly runs the same promql so users dont have to learn a new language.

As our requirement is to store 1yr data worth few tbs so how can we do down sampling or its already done by Victoriametrics

Aliaksandr Valialkin

unread,

May 23, 2019, 8:17:28 PM5/23/19

to Tapas Mohapatra, victorametrics-users

On Fri, May 24, 2019 at 3:05 AM Tapas Mohapatra <tapasmo...@gmail.com> wrote:

As our requirement is to store 1yr data worth few tbs so how can we do down sampling or its already done by Victoriametrics

VictoriaMetrics stores raw data without downsampling. We plan implementing downsampling in the future via Prometheus recording rules, so VictoriaMetrics could write results from recording rules into external storage with higher retention.

VictoriaMetrics has good compression ratio for the ingested data - each data point occupies less than 0.5 bytes on disk on average. This means 1TB storage can contain 2 trillions of typical data points. Try estimating the required storage size for 1yr worth of data for your case. It is likely it will be much smaller than the original expectations, so all the raw data may be stored without downsampling.

Tapas Mohapatra

unread,

May 23, 2019, 8:21:21 PM5/23/19

to victorametrics-users

I was looking for downsampling to query performance can be better. I read your post and thanks for the quick response. Do you have any ETA for implementing that feature?

Do you also have a clustered solution? If yes, wts the cost

Aliaksandr Valialkin

unread,

May 23, 2019, 8:33:58 PM5/23/19

to Tapas Mohapatra, victorametrics-users

On Fri, May 24, 2019 at 3:21 AM Tapas Mohapatra <tapasmo...@gmail.com> wrote:

I was looking for downsampling to query performance can be better. I read your post and thanks for the quick response.

VictoriaMetrics is optimized for heavy queries - it can scan up to 50 millions of data points per CPU core and its' performance scales linearly with the number of available CPUs. So it should work reasonably fast on big date ranges such as 1yr. It downsamples raw data, so the response usually contains a few hundreds of data points, which is enough for displaying in Grafana. The number of returned data points depends on `start`, `end` and `step` arguments passed to the API and may be roughly calculated as (end - start) / step.

Do you have any ETA for implementing that feature?

Not yet.

Do you also have a clustered solution? If yes, wts the cost

Cluster solution is free and open source - it is available in the cluster branch. Read about our commercial offerings here.

Reply all

Reply to author

Forward