Recursive copy using gsutil cp: Getting around the inability to exclude files

6,746 views
Skip to first unread message

Solace

unread,
Dec 20, 2011, 9:27:04 PM12/20/11
to gsutil-discuss
Hi,

I want to avoid copying .svn directories, so I can't directly use
gsutil's recursive copy switch until a exclude-files option is added.
(I'm using gsutil version 2011-12-12 09:48:10-08:00.)

So I instead tried

gsutil cp directory/* gs://bucket/directory

where I've created the "directory" folder in my bucket using the
online browser.

However I get the error

"CommandException: Destination URI must name a bucket or directory for
the multiple source form of the cp command."

It seems a GS destination must be either a bucket or a file, never a
directory.

My current workarounds are to do an svn export to make a copy without
the .svn directories, to use recursive mode for copying whole
directories, and to use a shell loop to copy individual files in
directories I want selectively copied.

Any tips on how to do this better?

Thanks.

Google Storage Team

unread,
Dec 21, 2011, 1:02:42 PM12/21/11
to gsutil-...@googlegroups.com
Hi,

> where I've created the "directory" folder in my bucket using the online browser.

The browser UI is creating an object named "directory" and simulating folder behavior (using the slash character in object names as a delimiter) but your "directory" is really a flat object, not a folder. That's why when you try to copy multiple files to that object, you get this error:

"CommandException: Destination URI must name a bucket or directory for the multiple source form of the cp command."

As with local shell commands, the target of a multiple source copy must be a container, i.e., a directory/folder in your local file system or a bucket in the Google Cloud Storage world.

Are you running on Linux (or some variant of Unix)? If so, you could use find and xargs to filter out your .svn files, like this:

find . -not -name .svn -type f | xargs -I '{}' gsutil cp '{}' gs://bucket

This is inefficient for a large number of files because it invokes gsutil once for every file. A better approach for a large number of files would be to write a simple wrapper shell script that runs gsutil with all supplied source arguments, like this:

     #!/bin/sh
     # gsutil_wrapper.sh - bundles command line args into one gsutil copy command (using -m for parallelism)
gsutil -m cp $* gs://bucket

With such a script, the previous find/xargs pipeline could be rewritten in a simpler and more efficient way:

find . -not -name .svn -type f | xargs gsutil_wrapper.sh

Hope that helps,

Marc
Google Cloud Storage Team

Google Storage Team

unread,
Dec 21, 2011, 2:51:37 PM12/21/11
to gsutil-...@googlegroups.com
P.S. Apparently, a number of other users have asked for this capability. A colleague pointed me to this thread, which has some additional discussion about this problem.

Marc

Solace

unread,
Dec 23, 2011, 7:31:53 AM12/23/11
to gsutil-discuss
Google Storage Team wrote:
> Hi,
>
> > where I've created the "directory" folder in my bucket using the online
> browser.
>
> The browser UI is creating an object named "directory" and simulating
> folder behavior (using the slash character in object names as a delimiter)
> but your "directory" is really a flat object, not a folder. That's why when
> you try to copy multiple files to that object, you get this error:
>
> "CommandException: Destination URI must name a bucket or directory for the
> multiple source form of the cp command."
>
> As with local shell commands, the target of a multiple source copy must be
> a container, i.e., a directory/folder in your local file system or a bucket
> in the Google Cloud Storage world.


I've tried again without first creating the directory in the browser,
but the error still occurs.

I understand that the storage file hierarchy is just conceptual, but I
think that the multiple-file "gsutil cp" command should work with
destination paths by treating the destination as a directory name (key
prefix), only giving an error if the destination name exists (as a
file/key).

>
> Are you running on Linux (or some variant of Unix)? If so, you could use
> find and xargs to filter out your .svn files, like this:
>
> find . -not -name .svn -type f | xargs -I '{}' gsutil cp '{}' gs://bucket
>
>
> This is inefficient for a large number of files because it invokes gsutil
> once for every file. A better approach for a large number of files would be
> to write a simple wrapper shell script that runs gsutil with all supplied
> source arguments, like this:
>
> #!/bin/sh
> # gsutil_wrapper.sh - bundles command line args into one gsutil copy
> command (using -m for parallelism)
>
> gsutil -m cp $* gs://bucket
>
>
> With such a script, the previous find/xargs pipeline could be rewritten in
> a simpler and more efficient way:
>
> find . -not -name .svn -type f | xargs gsutil_wrapper.sh

Great suggestions.

I found that find can be made to exclude all hidden files via

find . -not -path '*/.*'

But both copy methods given will create all files in the root of the
bucket, when I want to re-create the directory structure.

Your first method can however be made to work:

find . -not -path '*/.*' -type f -printf '%P\n' | xargs -I '{}' gsutil
cp '{}' gs://bucket/'{}'

But I don't see a way of making the parallel version work.


> Hope that helps,

Yes. Thank you Marc for your detailed reply.

Google Storage Team

unread,
Dec 24, 2011, 8:07:18 PM12/24/11
to gsutil-...@googlegroups.com
Another option you might try is to use gsutil -m cp -r to upload the whole directory tree in one shot and then remove the .svn files (and any other objectionable files) after the fact. Depending on how big this file tree is, it might be easier than trying to retrofit a custom filtering scheme outside of gsutil.

You can remove the .svn files after the fact like this:  gsutil rm gs://bucket/\*.svn\*  
Note the backslahes to prevent the * metachars from being consumed by the local shell.

By the way,

> I understand that the storage file hierarchy is just conceptual, but I think that the multiple-file "gsutil cp" command should work with
> destination paths by treating the destination as a directory name (key prefix), only giving an error if the destination name exists (as a
> file/key).

I can't disagree with that (although I might argue for making that behavior condition). As noted earlier, similar comments have been raised previously so we'll definitely take this need into consideration as we plan gsutil enhancements going forward.

Thanks for the feedback,

Marc
Google Cloud Storage Team
Reply all
Reply to author
Forward
0 new messages