gsutil 3.0 release

257 views
Skip to first unread message

Mike Schwartz (Google Storage Team)

unread,
Mar 20, 2012, 2:30:26 PM3/20/12
to gs-dis...@googlegroups.com, gsutil-...@googlegroups.com
Hi,

A new major release of gsutil (v3.0) is available by zipfile or tarball.

This release provides significant functionality and performance enhancements, a new hierarchical file tree abstraction layer, and numerous bug fixes. However, it also changes the behavior of the ls command and the * wildcard, in ways that may require changes to scripts that depend on ls and * behavior. Please see the Release notes for full details; below are highlights of the new features and potentially required script changes.

New Features:
  • Built-in help for all commands and many additional topics (try "gsutil help").
  • Support for copying data to/from bucket sub-directories (see “gsutil help cp”).
  • Support for renaming bucket sub-directories (see “gsutil help mv”).
  • Support for listing individual bucket sub-directories and for listing directories recursively (see “gsutil help ls”).
  • Support for Cross-Origin Resource Sharing (CORS) configuration (see "gsutil help cors").
  • Multi-threading support for the setacl command (see “gsutil help setacl”).
  • Support for using the UNIX “file” command to do content type recognition as an alternative to filename extensions (see "gsutil help metadata").
  • The gsutil update command is no longer beta/experimental.

As part of the bucket sub-directory support we changed the * wildcard to match only up to directory boundaries, and introduced the new ** wildcard to span directories the way * used to. We made this change both to be more consistent with how wildcards work in command interpreters (like bash), and to enable a variety of use cases for distributing large transfers across many machines. For example you can run the following commands on 3 machines:
 gsutil cp -R gs://my_bucket/result_set_[0-3]* dir
 gsutil cp -R gs://my_bucket/result_set_[4-6]* dir
 gsutil cp -R gs://my_bucket/result_set_[7-9]* dir
and end up with all of the result_set_* directories nested under dir.

Script Changes You May Need to Make To Use gsutil 3.0:
If your script depends on listing the entire (flat) contents of a bucket using something like:
   gsutil ls gs://my_bucket
you'll need to change to use:
   gsutil ls gs://my_bucket/**

If your script uses the * wildcard to name objects spanning directories, such as:
   gsutil cp gs://mybucket/*.txt ./dir
(where you want to match objects several directories down from the top-level bucket), you'll need to change to use:
   gsutil cp gs://mybucket/**.txt ./dir

Mike Schwartz and the Google Cloud Storage Team

Chetan Shah

unread,
Mar 20, 2012, 4:44:27 PM3/20/12
to gsutil-discuss
I downloaded the zip file and I get the following error message.
Please let me know what am I doing wrong.

c:\code\gsutil>python gsutil
Traceback (most recent call last):
File "gsutil", line 314, in <module>
main()
File "gsutil", line 102, in main
gsutil_ver)
File "c:\code\gsutil\gslib\command_runner.py", line 47, in __init__
self.command_map = self._LoadCommandMap()
File "c:\code\gsutil\gslib\command_runner.py", line 57, in
_LoadCommandMap
__import__('gslib.commands.%s' % module_name)
File "c:\code\gsutil\gslib\commands\help.py", line 15, in <module>
import fcntl
ImportError: No module named fcntl

On Mar 20, 2:30 pm, "Mike Schwartz (Google Storage Team)" <gs-
t...@google.com> wrote:
> *Hi,
>
> A new major release of gsutil (v3.0) is available by
> zipfile<http://commondatastorage.googleapis.com/pub/gsutil.zip>or
> tarball <http://commondatastorage.googleapis.com/pub/gsutil.tar.gz>.
>
> This release provides significant functionality and performance
> enhancements, a new hierarchical file tree abstraction layer, and numerous
> bug fixes. However, it also changes the behavior of the ls command and the
> * wildcard, in ways that may require changes to scripts that depend on ls
> and * behavior. Please see the Release
> notes<http://commondatastorage.googleapis.com/pub/ReleaseNotes_3.0.txt>for
> full details; below are highlights of the new features and potentially
> required script changes.
>
> New Features:
>
>    - Built-in help for all commands and many additional topics (try "gsutil
>    help").
>    - Support for copying data to/from bucket sub-directories (see “gsutil
>    help cp”).
>    - Support for renaming bucket sub-directories (see “gsutil help mv”).
>    - Support for listing individual bucket sub-directories and for listing
>    directories recursively (see “gsutil help ls”).
>    - Support for Cross-Origin Resource Sharing (CORS) configuration (see
>    "gsutil help cors").
>    - Multi-threading support for the setacl command (see “gsutil help
>    setacl”).
>    - Support for using the UNIX “file” command to do content type
>    recognition as an alternative to filename extensions (see "gsutil help
>    metadata").
>    - The gsutil update command is no longer beta/experimental.
>
> As part of the bucket sub-directory support we changed the * wildcard to
> match only up to directory boundaries, and introduced the new ** wildcard
> to span directories the way * used to. We made this change both to be more
> consistent with how wildcards work in command interpreters (like bash), and
> to enable a variety of use cases for distributing large transfers across
> many machines. For example you can run the following commands on 3 machines:
>   gsutil cp -R gs://my_bucket/result_set_[0-3]* dir
>  gsutil cp -R gs://my_bucket/result_set_[4-6]* dir
>   gsutil cp -R gs://my_bucket/result_set_[7-9]* dir
> and end up with all of the result_set_* directories nested under dir.
>
> Script Changes You May Need to Make To Use gsutil 3.0:
> If your script depends on listing the entire (flat) contents of a bucket
> using something like:
>    gsutil ls gs://my_bucket
> you'll need to change to use:
>    gsutil ls gs://my_bucket/**
>
> If your script uses the * wildcard to name objects spanning directories,
> such as:
>     gsutil cp gs://mybucket/*.txt ./dir
> (where you want to match objects several directories down from the
> top-level bucket), you'll need to change to use:
>     gsutil cp gs://mybucket/**.txt ./dir
>
> Mike Schwartz and the Google Cloud Storage Team*

Chetan Shah

unread,
Mar 20, 2012, 5:00:53 PM3/20/12
to gsutil-discuss
Needless to say that my earlier installation of gsutil was working
perfectly.

Michael Schwartz

unread,
Mar 20, 2012, 6:59:35 PM3/20/12
to gsutil-...@googlegroups.com
Hi, the code had a dependency on a Linux/MacOS-specific library, which caused this problem. Please try again, with gs://pub/gsutil_3.1.tar.gz or gs://pub/gsutil_3.1.zip

Thanks,

Mike

Chetan Shah

unread,
Mar 21, 2012, 11:47:46 AM3/21/12
to gsutil-discuss
Hi Mike -

Thank you for the fix.

Would you mind updating this link also : https://developers.google.com/storage/docs/gsutil_install

The windows tab of this link points to "tar.gz" and therefore caused
some grief before I actually bumped onto this link : http://code.google.com/p/gsutil/

Thanks,

-Chetan

Mike Schwartz (Google Storage Team)

unread,
Mar 21, 2012, 12:58:11 PM3/21/12
to gsutil-...@googlegroups.com
Thanks Chetan. We'll update our documentation as you suggested.

Mike

Evan Worley

unread,
Mar 23, 2012, 4:18:54 PM3/23/12
to gs-dis...@googlegroups.com, gsutil-...@googlegroups.com
Very cool stuff, all. I'm wondering about the multi-threaded support. All the documentation leads me to believe that it only applies to a large number of files, and not a single large file. I've got some 100GB+ files to upload to Google Cloud Storage, and the transfer rate is far below the available bandwidth. Have you considered parallelism on the single file level? We've run some tests, and have found that we can achieve must better throughput with concurrent connections on a decent network (I've observed up to 6-8X throughput).

Thanks for any information,
Evan Worley

Tom Huppi

unread,
Mar 23, 2012, 4:36:25 PM3/23/12
to gsutil-...@googlegroups.com, gs-dis...@googlegroups.com
On Fri, Mar 23, 2012 at 1:18 PM, Evan Worley <ev...@dnanexus.com> wrote:
> Very cool stuff, all. I'm wondering about the multi-threaded support. All
> the documentation leads me to believe that it only applies to a large number
> of files, and not a single large file. I've got some 100GB+ files to upload
> to Google Cloud Storage, and the transfer rate is far below the available
> bandwidth. Have you considered parallelism on the single file level? We've
> run some tests, and have found that we can achieve must better throughput
> with concurrent connections on a decent network (I've observed up to 6-8X
> throughput).

FWIW, I did some work along the same lines, but the need for dealing
with the large files vanished and I never put the work into
production. IIRC, I found a significantly greater than 6-8x
performance increase, but I am working within a production data-center
with pretty good connectivity. As I recall, I was getting a 10G file
up (and also down) in around 30 seconds where any other form of single
stream transfer (rsync, scp, etc) would take many times
that...particularly a transcontinental transfer.

The approach I took was to split the file and store it in parts in
Google Storage, and adhering to a convention I defined. The download
part of the code knew how to interpret the convention to fork
downloads and re-assemble the file. This is somewhat cumbersome of
course. It would be great if there were some way to build support for
such an operation into the Google Storage infrastructure itself.

Thanks,

- Tom

Mike Schwartz (Google Storage Team)

unread,
Mar 23, 2012, 10:04:39 PM3/23/12
to gsutil-...@googlegroups.com, gs-dis...@googlegroups.com
Hi Tom and Evan,

Google Cloud Storage does not support parallel reads of the chunks within a single file. But thank you for you inputs - we're always interested in hearing about customers' requirements.

Mike

Mike Schwartz (Google Storage Team)

unread,
Mar 23, 2012, 10:33:15 PM3/23/12
to gsutil-...@googlegroups.com, gs-dis...@googlegroups.com
Hi,

Sorry, I didn't answer your question clearly and correctly:

Google Cloud Storage does support parallel *reads* within a single file, in the form of range GETs. But we don't support parallel *writes* within a single file.

Mike

Simon Leinen

unread,
Mar 24, 2012, 6:59:56 AM3/24/12
to gs-dis...@googlegroups.com, gsutil-...@googlegroups.com
On Sat, Mar 24, 2012 at 03:33, Mike Schwartz (Google Storage Team) <gs-...@google.com> wrote:
Google Cloud Storage does support parallel *reads* within a single file, in the form of range GETs. But we don't support parallel *writes* within a single file.

Thanks for clarifying - I was surprised when I saw your previous message.  So you support "scatter", but not "gather"... I can understand why parallel-write support is more tricky.  What about the following extension of the interface:

An additional operation that, given an ordered set of files, efficiently and destructively* merges (concatenates) them into a single large file.  It's fine if this imposes some size restrictions (e.g. "a multiple of 2097152 bytes") on the partial files (except for the last one).  In this case, there should be a way for a client to find out what this size restriction is.

*by destructively, I mean that the original files disappear if the merge operation is succesful.

Given this, client tools and libraries could implement parallel uploading relatively easily.  And that could solve the problem of the original poster.

Probably other cloud storage APIs already have this or something similar (Azure "block blobs"?).
-- 
Simon.

Evan Worley

unread,
Mar 26, 2012, 1:44:34 PM3/26/12
to gs-dis...@googlegroups.com, gsutil-...@googlegroups.com
Thanks for your thoughts, Simon. Indeed other cloud providers offer operations like this. I'm not familiar with Azure services, but Amazon S3 does offer multi-part upload (http://aws.typepad.com/aws/2010/11/amazon-s3-multipart-upload.html) which does indeed have some size restrictions on the parts (namely that all parts must be >= 5MB except for the last part).

- Evan

--
You received this message because you are subscribed to the Google Groups "Google Cloud Storage" group.
To post to this group, send email to gs-dis...@googlegroups.com.
To unsubscribe from this group, send email to gs-discussio...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/gs-discussion?hl=en.

Reply all
Reply to author
Forward
0 new messages