I'm running prometheus at home, to monitor a few VMs on my homelab, as well as my Linodes. My homelab hypervisor runs Arch Linux and libvirtd.
Sometimes I need to patch the OS on the hypervisor and reboot it. The process up until now has been to suspend the guests, reboot the server, and restore the guests. (Yeah, I know, this is not recommended for prometheus - but it's a homelab, not a production environment . And I've seen weird clock-skew errors on production VMware before... :).
This has been working for a couple of years at this point. But recently (like, the last month or so) I've been getting prometheus "Error on ingesting out-of-order samples" errors spewing in the logs after this operation, and no data stored in prometheus. Restarting the prometheus process fixes it immediately; the errors stop and data is stored in the database again.
The problem has something to do with the combination of suspending/resuming the VM, and clock synchronization (hypervisor -> guest, or NTP is unknown). Looking for advice on how to debug the problem (or someone who has already encountered this...)
Thanks!
--
Harald