Vmware Snapshot Best Practice

0 views

Skip to first unread message

Mandy Geise

unread,

Aug 3, 2024, 12:51:30 PM8/3/24

to ricomadin

While snapshots should definitely not be considered long-term backups, they do work well in the short term. Best practice is to delete snapshots no later than 72 hours after they were created. If you can verify that Windows Updates within that time period, snapshots would be excellent for your use case. The best method would be to script the snapshot creation AND DELETION (you do not want to forgot to delete snapshots or miss one or more snapshots while manually deleting). My scripting method is to use PowerCLI (a vmware extension to Powershell).

Recently I had a long discussion with my customer about problems in their VMware environment. We needed to find a solution for the below problems (almost) caused by snapshots: Insufficient disk space on datastore Problem with the backup of VMs...

EDIT: Download RVtools (free) if you want to get a list of the snapshots and their details, just enter the host of vCenter IP. It can give you a lot more details than just snapshots, very handy to have in your toolbox

VMware snapshots are good - but they involve manual action and the huge downside is that performance decreases especially if you have changes in the file-system what you would have. The longer the snapshot exists, the lower the performance and the more space you will use on your storage.

If you want to be able to do proper roll-backs you would need to do offline snapshots - meaning - shut the system down and take a snapshot - otherwise you risk data corruption especially on database based systems like SQL and Exchange.

Btw. an advantage of automated storage based snapshots is that you could roll back to a certain point in time with the whole cluster - or any VM that resides on it. This would be a huge help in case you have some crypto-virus that messed up all systems. You could be back up and running within minutes.

Another advantage is if you use the snapshots on the storage in combination with some hardware configuration and create a test-environment. I wrote a blog article up on how to do this here: Build your own lab environment with VMware - IT-Admins

Thank you for the quick responses!! I manage windows updates manually since this last incidient. We have a WSUS server sitting on a 2k3 machine, but I dont have much experience using WSUS. I would love to set one up on a 2012 we have, it would sure beat going into each vM and clicking which updates I want to use. I am going to read the blog frossmark wrote about creating a lab envoronment. And thanks for the tool reference BSOD. Can anyone tell me if there is downtime involved when reverting to an older snapshot or deleting and consolidating snapshots?

I'm looking to see if anyone has developed a best practice in regards to prioritizing the servers in the environment and creating a volume collection to capture snapshots for them. My original thought was to create a High, Medium, and Low priority volume collection and set snapshot schedules for each of them. I am still fairly new with the way that Nimble captures and stores it's snapshots and have been reading some other community posts that help explain the difference between Dell, EMC, etc. and Nimble to get a better idea of how Nimble separates themselves from the others.

95% of our clients that we're putting in SANs are not extremely diverse with Exchange and SQL database sitting on different volumes than the VM itself, etc. I read an article that made a statement that 50% of clients that have installed a Nimble are storing over a months worth of snapshots. I am used to working with other vendors in a previous job where you'd store maybe 1 weeks worth before purging out the oldest.

Has anyone developed this type of schedule (high, medium, low) and prioritized the volumes into a collection for snapshot storage? If so, how often are you capturing snapshots in each collection? How many are you retaining? Is there a better way that I should be thinking about the snapshot retention for the smaller to mid-sized businesses?

Each company is different so what's best for one might not work for another. In our case, based on different applications we set up a combination of hourly,daily,weekly,monthly, and yearly snapshots schedules and also set up retention plans based on different requirements on primary and DR sites. We almost completely get rid of traditional backups(only do a tape backup once every year for longer retentions)

Each application/service typically has a unique retention requirement. With your Nimble array you can setup widely different policies for single or multiple volumes. While I was IT Director at Money Mailer I would use policies that made sense for my unique environment.

For Exchange 2010 (Win2K8R2 vm on ESX with five 1TB volumes mounted with the Nimble Windows Toolkit - iSCSI initiator in the vm) all the volumes were added to a volume collection with a synchronized (VSS enabled) snapshot schedule that ran every 6 hours and replicated to my DR site. I would keep those for 7 days. In that same policy I had a job that ran on Sunday night as well that lasted 52 weeks (used for legal discovery).

For SQL 2005 (similar config as Exchange) the DB and LOG volume were included in a SQL volume collection. Those ran every 4 hours with synchronization (VSS enabled). I kept 7 days of 4hr snaps locally but only kept the last 6 snaps at the DR site.

For volumes used for file/art data (Money Mailer is an advertising company) I would snap every hour and kept 7 days locally and 2 at DR, I would snap Sunday night and keep 52 locally, and 2 at the DR site.

I'd have to echo Frank and Jason's responses. I think you'll find the Nimble protection templates to be an excellent compliment to the data protection plan for a number of different scenarios. Especially when you start looking at utilizing them for Dev/Test situations, you'll really start to appreciate the benefits of redirect on write snapshots

Start out with a simple template, get familiar with the interface and options. I'd be willing to bet after a week of testing out the capabilities you'll start to see why there are customers holding on to a large number of snapshots.

Don't forget to check out Infosight->Data Protection->Planning after you've got a week of snaps or so! It really shows you how efficient the snapshots are and takes the guesswork out of bandwidth required for a DR site.

I pretty much agree with what everyone else has said. I've got a pretty intense snapshot schedule on my system (might be a little over the top), but it has worked pretty well. Here are a few things to watch out for:

The latency you speak of will be the duration the database IO freeze that occurs. This can vary a lot depending on nimble OS on the array and the software used to take the snapshot. This is roughly the timings I've found for vss consistent snaps on sql (on 2 volumes):

Keeping in mind the 10s hard limit for vss, on 1.4.x we'd see occasional timeouts with native and regular timeouts with commvault (commvault snaps disks in series rather than parrallel). It appeared the variations were due to how busy the array was at the time. After upgrading to OS 2.1.x we observe lower and much more consistent times. Commvault supports nimble replication on 2.1.x and actually relies on a nimble volcol so this is why the snap&repl engine is faster than the snap only engine which still snaps in series.

I haven't tried doing the quiescing snapshots on SQL since I was on 1.x. I'm on 2.1.7 now, so I'll give it another shot. I think the issue was that I'm using VMDK's for my SQL volumes, not direct-iSCSI-mounted volumes. So I had Nimble-->VMware-->VSS-->VMware snapshot-->Nimble snapshot-->VMWare remove snapshot-->VSS remove snapshot as the process flow, which overall took too long. At least I think that's how that works.

Would love to know how you go after trying it again. Seems like more users are doing what you are with sql on vm's - be good to hear from others to see if they have come across the same issue as you or if their setup is working fine.

VMware Snapshot is a native VMware solution for quickly and easily safeguarding the data in your virtual machines (VMs). It is also one of the most popular data protection tools for virtual machines.

The VMware Snapshot feature is straightforward to navigate and offers all the basic functionalities inside the vSphere environment. Below, we will tackle all of the snapshot operations and dive deep into them one-by-one.

One example of a live state restoration scenario would be opening a text file and taking a memory snapshot. Afterward, the VM with the open text file goes into a blue screen. You may restore your VM from the memory snapshot taken before the blue screen incident, and it will revert to the live state where the text file is open.

To create this type of snapshot, check the option Quiesce guest file system and press OK. This option will also require a powered-on VM, and it will tell you that the VM needs VMware Tools installed, so install it first before taking a snapshot.

You MUST have the VMware PowerCLI module installed before running the cmdlets below. Once installed, import the module afterward. Open Powershell with elevated permissions (Run as Admin) and run the following commands to get started.

Administrators cannot correctly execute certain VMware functions if a VM has an existing snapshot. Examples include increasing disk space (corruption on the snapshot may occur) and performing storage vMotion migration.

For example, if you try to delete Snap_2 using Delete or Delete All functions, the consolidation process will not take place, and the process will instead discard the snapshot together with its snapshot files.

There will be instances where reverting or deleting snapshots will fail due to errors on the snapshots themselves. There are countless reasons why failure occurs, but one of the most common is disk descriptor file inconsistencies. A common method to address this issue is to perform a manual snapshot consolidation on the VM.