vmrestore and time series not showing in grafana

287 views
Skip to first unread message

Mike Cammilleri

unread,
Apr 5, 2022, 12:58:36 PM4/5/22
to victorametrics-users
Hello,

We currently take daily snapshots and then use the vmbackup-prod tool to back them up to our on-prem S3. 

I'm trying to test a restore, so I have set up a new VM host (so we are not trying to restore to the same production VM server) and also put Grafana on this test host. Using the vmrestore-prod tool, we perform the restore from S3. All the data appears to restore, and the data is the same size as on the production host. Grafana on the test host has it's data source set up to point to the victoria_metrics server that's running on the test host, and Grafana does display the data (just node_exporter information).

The problem is that it only goes back 30 days. When I try to go back farther, no data is rendered. Our retention period is set to 24 months in production and on the test VM host.

Question: What is the best approach to begin troubleshooting this problem? I've been looking at the expanded query results in Grafana's Explorer, and the response series is just empty when setting the time range to anything greater than 30 days. The data appears to be in the storaDataPath on the file system. 

Any hints appreciated.

Thanks!

Mike Cammilleri

unread,
Apr 5, 2022, 1:34:19 PM4/5/22
to victorametrics-users
Update:

Grafana debug log shows
grafana-server: logger=query_data t=2022-04-05T12:30:10.81-0500 lvl=dbug msg="Processing metrics query" query="unsupported value type"

Which seems similar to the bug report by simonszu from 28 days ago here
https://github.com/VictoriaMetrics/VictoriaMetrics/issues/2153

Of course, on production, Victoria Metrics as a data source works fine and can go back the full 24 months.

hage...@gmail.com

unread,
Apr 8, 2022, 3:53:03 PM4/8/22
to victorametrics-users
Hello! Could you please try to reset the caches and try your query again?
Please note, resetting the caches would require VM restart - see https://docs.victoriametrics.com/#cache-removal

Mike Cammilleri

unread,
Apr 11, 2022, 1:37:51 PM4/11/22
to victorametrics-users
Thank you for this reply. I have performed a new restore (from last night's snapshot) and tried putting the reset_cache_on_startup file in /<data_dir/cache/ but the magic number in Grafana is still 41 days at which no graphs render - except, oddly, the "Disk Space Used Basic" and "Disk Space Used" panels do have time series. I'm so stumped!

hage...@gmail.com

unread,
Apr 17, 2022, 10:23:32 AM4/17/22
to victorametrics-users
What is the `-retentionPeriod` flag value on the node where you restore from the backup?

Mike Cammilleri

unread,
Apr 18, 2022, 9:59:32 AM4/18/22
to victorametrics-users
-retentionPeriod 24

this value is the same as it is on our production host.

hage...@gmail.com

unread,
Apr 20, 2022, 5:50:26 AM4/20/22
to victorametrics-users
If you export raw series for period of 2month - can you see datapoints earlier than 41d?

Mike Cammilleri

unread,
Apr 20, 2022, 5:23:45 PM4/20/22
to victorametrics-users
Thanks for this suggestion. I am able to perform a native export of time series for short periods, like 1m, 5m, 1w, but when I go up to something like 8w, victoria-metrics-prod crashes. There seems to be a lot pre-processing happening before anything gets output to my file.

Here's the running process:
/usr/local/bin/victoria-metrics-prod -dedup.minScrapeInterval=15s -storageDataPath /newvmdata -retentionPeriod 24 -search.maxExportSeries 23501600000 -httpListenAddr <vm-ip>:8428

Here's the api call I made:
curl -G -g 'http://<vm-server>:8428/api/v1/export/native?match[]={__name__=~".*"}&start=-8w' > /newvmdata/export/export.bin

This is running on a 40 CPU, 32G memory, running Oracle Linux 7.9. Our tsdb is about 1.5T. Interestingly, I am seeing 100% swap utilization when running the export.
victoria-metrics-prod --version
victoria-metrics-20220412-133902-tags-v1.76.1-0-gf8de318bf
 
When the crash happens, log output looks like this:
Apr 20 15:51:35 <host> victoria-metrics-prod: github.com/VictoriaMetrics/VictoriaMetrics/lib/storage.(*partition).partsMerger(0xc0084fe180, 0xc00ce0df78)
Apr 20 15:51:35 <host> victoria-metrics-prod: github.com/VictoriaMetrics/VictoriaMetrics/lib/storage/partition.go:960 +0x185
Apr 20 15:51:35 <host> victoria-metrics-prod: github.com/VictoriaMetrics/VictoriaMetrics/lib/storage.(*partition).bigPartsMerger(0xc0084fe180)
Apr 20 15:51:35 <host> victoria-metrics-prod: github.com/VictoriaMetrics/VictoriaMetrics/lib/storage/partition.go:910 +0x4f
Apr 20 15:51:35 <host> victoria-metrics-prod: github.com/VictoriaMetrics/VictoriaMetrics/lib/storage.(*partition).startMergeWorkers.func2()
Apr 20 15:51:35 <host> victoria-metrics-prod: github.com/VictoriaMetrics/VictoriaMetrics/lib/storage/partition.go:903 +0x25
Apr 20 15:51:35 <host> victoria-metrics-prod: created by github.com/VictoriaMetrics/VictoriaMetrics/lib/storage.(*partition).startMergeWorkers
Apr 20 15:51:35 <host> victoria-metrics-prod: github.com/VictoriaMetrics/VictoriaMetrics/lib/storage/partition.go:902 +0x9c
Apr 20 15:51:35 <host> victoria-metrics-prod: goroutine 1283 [select]:
Apr 20 15:51:35 <host> victoria-metrics-prod: github.com/VictoriaMetrics/VictoriaMetrics/lib/storage.(*partition).partsMerger(0xc0084fe180, 0xc00ce08f78)
Apr 20 15:51:35 <host> victoria-metrics-prod: github.com/VictoriaMetrics/VictoriaMetrics/lib/storage/partition.go:960 +0x185
Apr 20 15:51:35 <host> victoria-metrics-prod: github.com/VictoriaMetrics/VictoriaMetrics/lib/storage.(*partition).bigPartsMerger(0xc0084fe180)
Apr 20 15:51:35 <host> victoria-metrics-prod: github.com/VictoriaMetrics/VictoriaMetrics/lib/storage/partition.go:910 +0x4f
Apr 20 15:51:35 <host> systemd: victoria_metrics.service failed.

I'm hoping there isn't something else wrong with our Victoria Metrics Implementation that may be more of a root cause? Thanks again.

hage...@gmail.com

unread,
May 2, 2022, 10:59:29 AM5/2/22
to victorametrics-users
The log seems incomplete. I'm afraid, VM is killed by OOM killer due to high memory usage. This may happen because of too expensive query "{__name__=~".*"}". Is it possible to export only one specific series which you know exactly is missing after 41d?

Sorry for the long response, I was on PTO.

Mike Cammilleri

unread,
May 2, 2022, 2:57:01 PM5/2/22
to victorametrics-users
Hey - some progress. I did an export of node_cpu_seconds_total for the past 8 weeks.
curl -g -G 'http://<vm-prod-server>:8428/api/v1/export/native?match[]={__name__=~"node_cpu_seconds_total"}&start=-8w' > /home/mikec/export.bin

I then restored and applied a new label to it.
curl -X POST http://<vm-test-host>:8428/api/v1/import/native?extra_label=foo=bar -T /home/mikec/export.bin

In Grafana I can render data going back 56 days (the full 8 weeks I imported). I first tried bringing this metric up in Explorer, which worked. I then brought up the usual Node Exporter dashboard (Node Exporter Full) and it will also render the node_cpu_seconds_total metrics in the panels.

So, I'm wondering what the difference is between the full restore, and this individual import with a new label, besides obviously the new label. Thanks for your assistance thus far.

Mike Cammilleri

unread,
May 2, 2022, 5:18:06 PM5/2/22
to victorametrics-users
And to clarify - when I perform an import of the same data without adding extra_label=foo=bar, Grafana cannot render anything from day 41 and older on that time series. If I do set the extra_label parameter during import, I can render time series older than 41 days.

I'm not sure this is relevant, but our Prometheus has two instances scraping the same hosts as an HA pair, and they both remote write to our Victoria Metrics host which has the -dedup.minScrapeInterval=15s set. Our two Prometheus hosts have identical prometheus.yml configurations.

hage...@gmail.com

unread,
May 5, 2022, 9:42:36 AM5/5/22
to victorametrics-users
Thanks for response!
From what you're saying, looks like backup restore works properly. Data export/import proves that database contains those values for 56d. The problem seems to be in querying the data. For some reason, VM does not return you data older than 41d and it looks like a cache problem. Adding a new label to the series during the import only proves it to me.

> oddly, the "Disk Space Used Basic" and "Disk Space Used" panels do have time series

Does it always display only those panels? What happens if you refresh dahsboard multiple times - does results remain consistent?

Can you pick one query which supposed to return data for all 56d and add a `nocache=1` GET param to it?

Thanks!

Mike Cammilleri

unread,
May 5, 2022, 2:46:26 PM5/5/22
to victorametrics-users
I tried a query on both a metric I can't render in the dashboard (node_cpu_seconds_total) , and the one seemingly random metric that does which is node_filesystem_avail_bytes. 

Picking timestamps that create a small range from back in January 4, 2022 still brings up an empty set:
curl -G http://<vm-test-host>:8428/api/v1/query_range -d 'query=node_filesystem_avail_bytes{instance="<hostname>:9100"}' -d 'start=1641341963' -d 'end=1641356363' -d 'nocache=1'
{"status":"success","data":{"resultType":"matrix","result":[]}}

However, when I use the graph in Grafana I can have it render data points for node_filesystem_avail_bytes when I define the time range using (any number larger than 41, in this case I used 150 days)  "From: now-150d" and "To: now" but when I use specific time/dates in those fields (e.g. 2022-01-04 00:00:00 to 2022-01-06 23:59:59) it will not render. The graph will say "no data."

When running the same query for node_cpu_seconds_total I get the expected results of no data all the time. Setting the time range using "From: now-150d" makes no difference in the case of this (and most other) metrics:
curl -G http://<vm-test-host>:8428/api/v1/query_range -d 'query=node_cpu_seconds_total{instance="<hostname>:9100"}' -d 'start=1641341963' -d 'end=1641356363' -d 'nocache=1'
{"status":"success","data":{"resultType":"matrix","result":[]}}

It appears that there is something not quite right with timestamps? I've tried cache clearing, but no change. Are there other things to be considering with backfilling and timestamps?

hage...@gmail.com

unread,
May 6, 2022, 6:15:13 AM5/6/22
to victorametrics-users
> Are there other things to be considering with backfilling and timestamps?

No, it should just work.

I ran out of ideas :-(
Will try to share this case with teammates.

Mike Cammilleri

unread,
May 10, 2022, 12:21:16 PM5/10/22
to victorametrics-users
Thanks for your help. I set up the restored data as a data source in our production environment - and our production Grafana can render the data going back 2 years. So this is not a Victoria Metrics problem - I think this is a Grafana issue. It is still bothersome since in a disaster recovery scenario, I should be able to use a fresh Grafana install on a new instance to read this data. I have tried several new Grafana server instances on different hosts and none can read the data past 41 days, yet the production Grafana can. I cannot find any differences - and it fails with simple queries in Grafana's Explorer - so it's not dashboard related. I can take this up with the Grafana community. Thank you!
Reply all
Reply to author
Forward
0 new messages