Images vs. snapshots: when to use one or the other?

2,681 views
Skip to first unread message

Kirill Katsnelson

unread,
Jan 6, 2020, 9:18:28 PM1/6/20
to gce-discussion
This topic has always confused me, so I'm hoping to get a bit of clarification on this.

I am running a computing cluster under GCE for compute-intensive batch loads. What I'm describing further is a simplified minimal setup that still exposes the essence of the question. Let's assume that there are only 2 types of nodes: the single control VM, responsible for scheduling compute batch loads, and multiple compute nodes, which are preemptible. The compute nodes are created and deleted freely on demand; the control node is always up and watching (and not preemptible, of course). In my architecture, there are two disks attached to every node:

1. The boot disk of an instance, obviously. This is reified from an image upon instance creation. All VMs (control and compute) are created from the single image, prepared by my tools in advance with all services ready to go.

2. The common node software (CNS) disk. This is a non-bootable, shared zonal pd-ssd drive, also preimaged, which it is mounted as /opt, read-only on every node. This one starts it life as a snapshot created by my tools. When the cluster is quiescent for maintenance, I can roll-out the new CNS disk by instantiating the disk from the snapshot, detaching the old one from the (terminated) control node, the only one existing at this time (again, only in our simplified exposure, but this is irrelevant), and delete the old disk. All new transient compute nodes will be created by the control node with the new shared R/O disk attached to be then mounted as /opt during boot by the node.

This has worked very well for me. But now I am preparing my toolset as an OSS release for the benefit of other researchers in my field. At some point, I thought whether I should try to avoid introducing an extra concept for the users (images vs snapshots). I am targeting the ML scientists, who may be absolutely naive w.r.t. infrastructure management. The simpler it is for them, the better.

What strikes me as odd is how little difference there is between images and snapshots in my use case. One minor point I'm aware of is images are more expensive to store passively but have no instantiation cost, while snapshots are cheaper to store but may incur charges for traffic when reifying them as the provisioned disk; but this is minuscule in the overall picture (when you run 24 P100s for hours or days on end, even preemptible, your tab for these dwarfs everything else by 2-3 orders of magnitude). Also, I recall seeing in the blog that even that is going to change (was set to Jan, 1, now shifted to Mar, 1, IIRC): image storage will be cheaper, but there will be a traffic charge for instantiating an image as a boot disk. This blurs the line between the two types of passive storage even more.

So my question is, what is the best use case for images as opposed to snapshots? One bullet point I know of is that snapshots may form an incremental backup chain for a live drive, while images are totally passive (but this is exactly irrelevant in my case). And aside from this, I cannot really think of any deep, principal difference between the two. I understand that a boot disk can be created from either an image or a snapshot, as well as a non-boot disk, like my CNS disk, can. Am I missing anything here? There seems to be nearly a complete overlap between the two types of disk image storage.

Can I use images for both types of disks? Can I use snapshots for both? And should I, i.e. what are, if any, downsides of choosing one of these solutions for both individual boot and shared read-only non-boot disks?

Thanks,

 -kkm

Cameron Thomas Otway

unread,
Jan 7, 2020, 12:06:48 PM1/7/20
to gce-discussion

I'm going to try my best to answer your question, so feel free to correct me if I miss any details to your question:  


-PD SLOs:


Our PDs SLOs state that: “Each device instantiation shall not have more than 43 unhealthy minutes, corresponding to 99.9% of total minutes in a 30 day sliding window, and will not have more than 15 consecutive unhealthy minutes per 30 day sliding window."


Note that an "unhealthy minute" is any minute in which at least one IO operation fails, or takes longer than 2 seconds to complete. 


PDs perform in line with its SLOs. But, in general, transient slowness or unavailability can happen. Of course we try to minimize it, but there is no way to 100% prevent it.


-Snapshots:


Snapshots require data to be available in exactly the same way as reads / writes within a VM. So, if the data is temporarily unavailable, snapshots may fail ( typically hang until the data becomes available again). Generally, snapshots should be entirely consistent. When a snapshot is requested, we freeze the device at that point in time (of course new writes come in, but we keep the old data around too). A snapshot may take a long time to finish if there is unavailability, but there is no reason it will not eventually finish, and it will be entirely consistent with the state of the device *at the time it was requested*.  With snapshot locality, all snapshots will be created and stored in the region where you define it to be. [1] Snapshots are ideal for DR ( disaster recovery) faster to create images and smaller since it does not contain the OS.


-Images: 


Public images are provided and maintained by google; therefore; will ensure higher functionality.

With custom images, you control access to the boot disk since you’re the owner of it. Having everything tailored for your environment will give you the flexibility that is needed. Storing images is optional ( if you desire to store in cloud storage)  however, your image will be stored in the multi-region closest to the image source. **“You can create an image from a disk even while it is attached to a running VM instance. However, your image will be more reliable if you put the instance in a state that is easier for the image to capture.”[2]** An image includes the operating system and boot loader which are typically larger in size compared to snapshots.


-Possible best use case for this scenario: 


Perhaps you can explore instance-templates as opposed to creating snapshots from your control node? You can define machine types, boot disks image or create an instance template based on existing images (creating the R/O /opt dir.) [3] It's easier to resize your boot partition with snapshots and overall, easier to work with.


[1]https://cloud.google.com/compute/docs/disks/scheduled-snapshots#snapshot_locality_rules_and_snapshot_labels

[2]:https://cloud.google.com/compute/docs/images/create-delete-deprecate-private-images

[3]:https://cloud.google.com/compute/docs/instance-templates/create-instance-templates


Kirill Katsnelson

unread,
Jan 9, 2020, 8:10:16 PM1/9/20
to gce-discussion
Thank you Cameron for the detailed answer, it's great to get an answer from the SRE, more so such a detailed one! I think I am using machinery for what it's made, based on your detailed suggestions.

I had to think on my follow-up for a while, trying to frame a coherent question. I am asking it rather out of curiosity, but this is something I could never grok fully.

What I'm trying to understand is why there exist two types of storage of cold bits which can be turned into a live disk, namely the image and the snapshot? Is there a fundamental difference between them, or do they exist only for the convenience of the pattern of their intended use?

I tinik I understand how either of the two is intended to be used in practice, but what, if anything, is that fundamental thing that puts them apart? Suppose I have a provisioned disk. I can image it, or I can snapshot it. Then. from either the image or the snapshot, I can reconstruct the original disk.

In the physical world, we have the USB stick to install the OS and the same USB stick for backups (a bit of exaggeration, but we aren't really far from TB-sized FLASH); but in the GCE there are images for the former and snapshots for the latter. Two seemingly different things.

For what I know, with images, some UEFI magic happens on the first boot, but I suspect it's more of the VM BIOS than the image machinery at play--is it? Snapshots do form incremental backup chains, images are readonly once created. Storage scopes may be different (an image can be mirrored to all multiregions; a snapshot can be multi or regional). But is this where the difference really ends?

As for the rest, I roughed down my initial picture to simplify the explanation. The reality is more complex than that.

1. I do pre-image boot disks using Daisy, and base them on the GCE-provided Debian 10[1]**. There are a few big changes that go into it (like preinstalling NVidia drivers, dependency libraries for all software, systemd units for the Slurm services, and many lesser tweaks, such as scripts to pull Slurm config from metadata). This ensures both a quick node startup on controller's demand (no startup scripts to run, no tweaks, no installs, everything is ready to go), and software consistency (everything extra is on the shared disk). Actually, I am super-impressed by the GCE new VM start time: it takes on average 40s between the request to start and the moment the machine registers with the controller and pulls its workload, for a CPU-only node, and close to 80s for a machine with a P100. I won't name names, but no other major cloud service I tried even come close!

2. The shared /opt disk: my scripts also build it (again using Daisy), and it's kept as a snapshot. [This is where I kinda stumbled: why do I image one as an image, and another as a snapshot?] Of course, the disk must be resurrected from its snapshot for anything at all to work, but still I snapshot them in advance, and instantiate when it's time to roll-out the updated software. This disk is important to share, in RO mode, to ensure software consistency on all nodes. I can be sure that both the control software (Slurm, which manages the workload) and actual scientific software used in experiments is consistent everywhere. It is rebuilt more often than the base image.

3. There are other nodes in the cluster that are permanent: besides the control nodes, there are the login and the file server nodes. But again, they all initially boot from exactly the same common image, and they can be dropped along with their boot disks and recreated on maintenance. Of course, the file server disks are permanent and snapshot on a schedule, as are the "home" disks of login nodes. But all boot disks are expendable by design.

4. Instance templates would probably be of not much help, as there are quite a few possible configurations of nodes (currently I have 3: one for general CPU pipeline part, one for shuffling training examples which is very I/O intensive and benefits from local SSDs, and the third is the GPU node, which also comes in different varieties by the GPU accelerator, P100, V100, T4 -- to evaluate the best price/performance on a particular load type). Since all of the compute nodes are instantiated by the central controller, all it takes is a "delta" configuration file to describe the node variant to match the Slurm's internal description, which is added to the common distributed config. I'm literally posting the content of one of these, the all whole two lines of it:

--machine-type: custom-4-6144
--accelerator: type=nvidia-tesla-p100,count=1

This can be used both with gcloud as a --flags-file argument, and/or easily read by python scripts as well. I'm putting them along with Slurm's own config files so they always stay in-sync. The templates would have to be created via the API, but since Slurm is already using config files, it's simpler to follow the suit and also use files.

Slurm has a concept of nodes being in a "power-save" state, and runs custom programs, which I use to call my quite simple scripts: delete idle nodes going into "power-save", or create a new VM when the controller wants to "wake up" the node. It's stable, powerful and fun software, used by many physical HPC facilities, and has means to recover with backoff from e.g. node non-starting. I had to add very small, limited-scope patches to it so that it shines in my usage pattern. but it would work even out of the box.

 ---
[1] As an SRE, you might be interested in this: the Debian 10 build of the cloud kernel (linux-image-4.19.x-cloud-amd64) lacks the support for RTC entirely (not even modules; I think it had CONFIG_RTC_CLASS=n), unlike the similar Ubuntu package. This makes google-clock-skew-daemon service angry. I noticed that a few months ago and has been replacing the kernel in my images with a "regular" linux-image package; I do not know how much of an issue the lack of RTC really is, and whether their cloud kernel build has changed to add RTC class since then. Just letting you know.

 -kkm

Frederic Gervais

unread,
Jan 10, 2020, 11:29:13 AM1/10/20
to gce-discussion

I understand the differences between a disk image and a snapshot can be blurry when looking at the high level implementations by Cloud providers.


Here are the differences on at a low level:


An Image is : All the bytes of a hard drive stored as a file elsewhere


A Snapshot is : A point in time (to put it simply). A snapshot depends on the content of the disk at the moment the snapshot is made. Explanation: You are using a computer and Poof you create a snapshot. The size of the snapshot starts at 0 Byte. From that point on, all the content that you modify on the disk is not actually written to the disk, it gets written to the snapshot file that is stored somewhere else. You write a 1GB file on your disk? The bytes on the disk did not actually change. You will find that the snapshot storing the delta between now and since it was created has now a size of 1GB.


A Backup is : All the bytes of a hard drive stored as a file elsewhere and “a point in time”. It differs from a snapshot because, like an image, the backup depends on nothing but itself.


To be clear, you will not be able to restore data using a snapshot if a nuclear bomb lands on your disk. You would be able to create a new disk with the past content if you had created a backup though.


Kirill Katsnelson

unread,
Jan 13, 2020, 11:51:31 PM1/13/20
to gce-discussion
Hi Frederic, thanks for your reply, but I'm afraid I'm confused even more now. I certainly can nuke a disk and recover it from a snapshot! And, as far as I could try to find, the "Backup" is not a concept in GCE at all. I understand indeed that you use Snapshots to make backups (as a generic word), and that you use Images to create new instance's boot disk (I capitalize Image and Snapshot to make them stand out as GCE object types). Just let's try this little experiment: create a disk from an image, snapshot it, nuke, and then restore from a snapshot (I've snipped many lines of YAML descriptions of the image and the snapshot, leaving, I hope, relevant ones):

$ gcloud compute images describe burrmill-compute-v001-200105

archiveSizeBytes: '564512384'

diskSizeGb: '10'

storageLocations:

- us


$ time gcloud compute disks create test --zone=us-west1-b --image=burrmill-compute-v001-200105 --type=pd-ssd --size=10GB

Created [https://www.googleapis.com/compute/v1/projects/[REDACTED]/zones/us-west1-b/disks/test].

NAME  ZONE       SIZE_GB  TYPE    STATUS

test  us-west1-b  10     pd-ssd  READY

real    0m18.417s


$ gcloud compute disks snapshot test --zone=us-west1-b --snapshot-names=test --storage-location=us

Creating snapshot(s) test...done.


$ gcloud compute snapshots describe test

diskSizeGb: '10'

storageBytes: '564512384'

storageLocations:

- us


$ gcloud compute disks delete test -q --zone=us-west1-b

Deleted [https://www.googleapis.com/compute/v1/projects/[REDACTED]/zones/us-west1-b/disks/test].


$ time gcloud compute disks create test --zone=us-west1-b --source-snapshot=test --type=pd-ssd --size=10GB

Created [https://www.googleapis.com/compute/v1/projects/[REDACTED]/zones/us-west1-b/disks/test].

NAME  ZONE       SIZE_GB  TYPE    STATUS

test  us-west1-b  10     pd-ssd  READY

real    0m23.606s


What's curious here is that the underlying cold storage size looks byte-for-byte exact for the original image and the snapshot. This does not seem a random coincidence to me. :)  The number is somewhat more round in binary, 564512384=0x21A5C680, i.e. ends in whopping 7 LSB zeros, but this is as much margin for a rounding error this observation admits, ±128 bytes.

I would not take out much of the one-point statistics comparing the provisioning time from the image and from the snapshot, but I'd say 18s and 24s land in the same ballpark, too.

So, in a sense, the question boils down to whether I gain or lose anything from preparing cold, ready to be provisioned bytes of a non-bootable disk as an Image, as opposed to a Snapshot, provided that the disk is readonly for its lifetime, and won't ever need to be backed up? Or does the line get blurred to compete oblivion in this particular use case?

Thanks,

 -kkm

Nicholas Elkaim

unread,
Jan 14, 2020, 1:36:59 PM1/14/20
to gce-discussion
For the point of storing data to be used as a disc image (not an "Image" image) the lines kinda get blurred to the point where it the differences aren't that big anymore. The biggest differences at this point would be that snapshots are cheaper ($0.03 per GB/month vs $0.085) and usually are faster to create than images.

Kirill Katsnelson

unread,
Jan 14, 2020, 2:33:54 PM1/14/20
to gce-discussion
Yeah, I also feel this way. There are a few little tie breakers toward using the snapshot: using things for what they are intended; no need to change the process--it works, so don't fix it; and this pricing point, too.

Thank you Nicholas, Frederic, Cameron, everyone who took time to think about and answer my question. GCE is awesome!

 -kkm
Reply all
Reply to author
Forward
0 new messages