Gsutil Cp Skip Existing

3 views
Skip to first unread message

Ene Vinson

unread,
Jul 25, 2024, 7:28:00 PM7/25/24
to westvitmure

The gsutil cp command allows you to copy data between your local filesystem and the cloud, copy data within the cloud, and copy data betweencloud storage providers. For example, to copy all text files from thelocal directory to a bucket you could do:

You can pass a list of URLs (one per line) to copy on STDIN instead of ascommand line arguments by using the -I option. This allows you to use gsutilin a pipeline to upload or download files / objects as generated by a program,such as:

The same rules apply for downloads: recursive copies of buckets andbucket subdirectories produce a mirrored filename structure, while copyingindividually (or wildcard) named objects produce flatly named files.

This will cause dir and all of its files and nested subdirectories to becopied under the specified destination, resulting in objects with names likegs://my_bucket/data/dir/a/b/c. Similarly you can download from bucketsubdirectories by using a command like:

Note that dir could be a local directory on each machine, or it couldbe a directory mounted off of a shared file server; whether the latterperforms acceptably may depend on a number of things, so we recommendyou experiment and find out what works best for you.

Note that by default, the gsutil cp command does not copy the objectACL to the new object, and instead will use the default bucket ACL (see gsutil help defacl ). You can override this behavior with the -poption (see OPTIONS below).

At the end of every upload or download, the gsutil cp command validates thatthat the checksum of the source file/object matches the checksum of thedestination file/object. If the checksums do not match, gsutil will deletethe invalid copy and print a warning message. This very rarely happens, butif it does, please contact gs-team @ google . com .

The cp command will retry when failures occur, but if enough failures happenduring a particular copy or delete operation the command will skip that objectand move on. At the end of the copy run if any failures were not successfullyretried, the cp command will report the count of failures, and exit withnon-zero status.

gsutil automatically uses the Google Cloud Storage resumable upload featurewhenever you use the cp command to upload an object that is larger than 2MB. You do not need to specify any special command line options to make thishappen. If your upload is interrupted you can restart the upload by runningthe same cp command that you ran to start the upload. Until the uploadhas completed successfully, it will not be visible at the destination objectand will not replace any existing object the upload is intended to overwrite.(However, see the section on PARALLEL COMPOSITE UPLOADS, which may leavetemporary component objects in place during the upload process.)

Similarly, gsutil automatically performs resumable downloads (using HTTPstandard Range GET operations) whenever you use the cp command to download anobject larger than 2 MB. In this case the partially downloaded file will bevisible as soon as it starts being written. Thus, before you attempt to useany files downloaded by gsutil you should make sure the download completedsuccessfully, by checking the exit status from the gsutil command. This canbe done in a bash script, for example, by doing:

gsutil can automatically use object composition to perform uploads in parallel for large, local files being uploaded to GoogleCloud Storage. This means that, if enabled (see next paragraph), a large filewill be split into component pieces that will be uploaded in parallel. Thosecomponents will then be composed in the cloud, and the temporary components inthe cloud will be deleted after successful composition. No additional localdisk space is required for this operation.

If the transfer fails prior to composition, running the command again willtake advantage of resumable uploads for those components that failed, andthe component objects will be deleted after the first successful attempt.Any temporary objects that were uploaded successfully before gsutil failedwill still exist until the upload is completed successfully. The temporaryobjects will be named in the following fashion:/gsutil/tmp/parallel_composite_uploads/for_details_see/gsutil_help_cp/where is some numerical value, and is an MD5 hash (notrelated to the hash of the contents of the file or object).

One important caveat is that files uploaded in this fashion are still subjectto the maximum number of components limit. For example, if you upload a largefile that gets split into 10 components, and try to compose it with anotherobject with 1015 components, the operation will fail because it exceeds the 1024component limit. If you wish to compose an object later and the componentlimit is a concern, it is recommended that you disable parallel compositeuploads for that transfer.

On Windows 7 you can change the TMPDIR environment variable from Start ->Computer -> System -> Advanced System Settings -> Environment Variables.You need to reboot after making this change for it to take effect. (Rebootingis not necessary after running the export command on Linux and MacOS.)

If the log file already exists, gsutil will use the file as aninput to the copy process, and will also append log items tothe existing file. Files/objects that are marked in theexisting log file as having been successfully copied (orskipped) will be ignored. Files/objects without entries will becopied and ones previously marked as unsuccessful will beretried. This can be used in conjunction with the -c option tobuild a script that copies a large number of objects reliably,using a bash script like the following:

The -c option will cause copying to continue after failuresoccur, and the -L option will allow gsutil to pick up where itleft off without duplicating work. The loop will continuerunning as long as gsutil exits with a non-zero status (such astatus indicates there was at least one failure during thegsutil run).

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 3.0 License , and code samples are licensed under the Apache 2.0 License . For details, see our Site Policies .

The gsutil cp command allows you to copy data between your local filesystem and the cloud, within the cloud, and betweencloud storage providers. For example, to upload all text files from thelocal directory to a bucket, you can run:

The gsutil cp command attempts to name objects in ways that are consistent with theLinux cp command. This means that names are constructed dependingon whether you're performing a recursive directory copy or copyingindividually-named objects, or whether you're copying to an existing ornon-existent directory.

When you perform recursive directory copies, object names are constructed tomirror the source directory structure starting at the point of recursiveprocessing. For example, if dir1/dir2 contains the file a/b/c, then thefollowing command creates the object gs://my-bucket/dir2/a/b/c:

In contrast, copying individually-named files results in objects named bythe final path component of the source files. For example, assuming again thatdir1/dir2 contains a/b/c, the following command creates the objectgs://my-bucket/c:

The same rules apply for uploads and downloads: recursive copies of buckets andbucket subdirectories produce a mirrored filename structure, while copyingindividually or wildcard-named objects produce flatly-named files.

This causes dir and all of its files and nested subdirectories to becopied under the specified destination, resulting in objects with names likegs://my-bucket/data/dir/a/b/c. Similarly, you can download from bucketsubdirectories using the following command:

Copying subdirectories is useful if you want to add data to an existingbucket directory structure over time. It's also useful if you wantto parallelize uploads and downloads across multiple machines (potentiallyreducing overall transfer time compared with running gsutil -mcp on one machine). For example, if your bucket contains this structure:

Note that dir could be a local directory on each machine, or adirectory mounted off of a shared file server. The performance of the latterdepends on several factors, so we recommend experimentingto find out what works best for your computing environment.

If both the source and destination URL are cloud URLs from the sameprovider, gsutil copies data "in the cloud" (without downloadingto and uploading from the machine where you run gsutil). In addition tothe performance and cost advantages of doing this, copying in the cloudpreserves metadata such as Content-Type and Cache-Control. In contrast,when you download data from the cloud, it ends up in a file withno associated metadata, unless you have some way to keepor re-create that metadata.

Copies spanning locations and/or storage classes cause data to be rewrittenin the cloud, which may take some time (but is still faster thandownloading and re-uploading). Such operations can be resumed with the samecommand if they are interrupted, so long as the command parameters areidentical.

Note that by default, the gsutil cp command does not copy the objectACL to the new object, and instead uses the default bucket ACL (seegsutil help defacl). You can override this behavior with the -poption.

When copying in the cloud, if the destination bucket has Object Versioningenabled, by default gsutil cp copies only live versions of thesource object. For example, the following command causes only the single liveversion of gs://bucket1/obj to be copied to gs://bucket2, even if thereare noncurrent versions of gs://bucket1/obj:

The cp command retries when failures occur, but if enough failures happenduring a particular copy or delete operation, or if a failure isn't retryable,the cp command skips that object and moves on. If any failures were notsuccessfully retried by the end of the copy run, the cp command reports thenumber of failures and exits with a non-zero status.

gsutil automatically resumes interrupted downloads and interrupted resumableuploads,except when performing streaming transfers. In the case of an interrupteddownload, a partially downloaded temporary file is visible in the destinationdirectory with the suffix _.gstmp in its name. Upon completion, theoriginal file is deleted and replaced with the downloaded contents.

Reply all
Reply to author
Forward
0 new messages