How to load data from a google storage bucket into my AI Platform Jupyter Notebook

216 views
Skip to first unread message

Matan G

unread,
Sep 22, 2019, 3:15:15 PM9/22/19
to google-dl-platform
I am trying to load an .npz file to then run a CNN but I am having trouble loading the file that is inside a google storage bucket.  I'm using the gsutil syntax but it doesn't seem to work inside jupyter.  Am I just missing a library?


# load train and test dataset
def load_dataset():
    # load dataset
    data = load('gs://for-imet/iMet_data_unsampled.npz')
    X, y = data['arr_0'], data['arr_1']
    # separate into train and test datasets
    trainX, testX, trainY, testY = train_test_split(X, y, test_size=0.3, random_state=1)
    print(trainX.shape, trainY.shape, testX.shape, testY.shape)
    return trainX, trainY, testX, testY


here is the error code:
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-5-9cf853d8acf7> in <module>
     35 
     36 # load dataset
---> 37 trainX, trainY, testX, testY = load_dataset()
     38 # make all one predictions
     39 train_yhat = asarray([ones(trainY.shape[1]) for _ in range(trainY.shape[0])])

<ipython-input-5-9cf853d8acf7> in load_dataset()
     10 def load_dataset():
     11     # load dataset
---> 12     data = load('gs://for-imet/iMet_data_unsampled.npz') #######################
     13     X, y = data['arr_0'], data['arr_1']
     14     # separate into train and test datasets

/usr/local/lib/python3.5/dist-packages/numpy/lib/npyio.py in load(file, mmap_mode, allow_pickle, fix_imports, encoding)
    426         own_fid = False
    427     else:
--> 428         fid = open(os_fspath(file), "rb")
    429         own_fid = True
    430 

FileNotFoundError: [Errno 2] No such file or directory: 'gs://for-imet/iMet_data_unsampled.npz'

John Fields

unread,
Sep 22, 2019, 6:33:09 PM9/22/19
to google-dl-platform
There is a BERT tutorial that mentions the use of storage buckets. “The bucket location must be in the same region as your virtual machine and your TPU node. VMs and TPU nodes are located in specific zones.” Not sure if this helps but passing along in case this info is useful.

Matan G

unread,
Sep 23, 2019, 10:11:29 AM9/23/19
to google-dl-platform
thanks John.  Yes, the bucket is in the same region and zone.  
I randomly found out  that you can run commands in jupyter and make them act like a command line instruction by placing an exclamation mark before it so, eg, !gsutil -m cp gs://bucket/file ...
you can make an entire juyter cell act like the command line by passing '%%bash' 
I tried that with the gsutil function mentioned above to return the file in the bucket...It gave me a list of its contents but still did not let me read it into the notebook.  

Martin Gorner

unread,
Sep 23, 2019, 6:56:17 PM9/23/19
to Matan G, google-dl-platform
For loading data from a GCS bucket, two solutions:
1) copy it to local first
gsutil cp ...
2) use Tensorflow. In Tensorflow, all file access functions understand gs:// filenames. I recommend using tf.data.Dataset to read your data. It handles out of memory dataset automatically. If you just want file access: tf.io.gfile

Martin

--
You received this message because you are subscribed to the Google Groups "google-dl-platform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-dl-platf...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-dl-platform/65f20e3f-c5cc-476f-9e0d-77cb31141a7e%40googlegroups.com.


--

Martin Görner | Developer Relations | mgo...@google.com | +1 425 273 0605

Matan G

unread,
Sep 30, 2019, 4:36:01 PM9/30/19
to google-dl-platform
Thanks you Sir!


On Monday, September 23, 2019 at 6:56:17 PM UTC-4, Martin Gorner wrote:
For loading data from a GCS bucket, two solutions:
1) copy it to local first
gsutil cp ...
2) use Tensorflow. In Tensorflow, all file access functions understand gs:// filenames. I recommend using tf.data.Dataset to read your data. It handles out of memory dataset automatically. If you just want file access: tf.io.gfile

Martin

On Mon, 23 Sep 2019 at 07:11, Matan G <mnga...@gmail.com> wrote:
thanks John.  Yes, the bucket is in the same region and zone.  
I randomly found out  that you can run commands in jupyter and make them act like a command line instruction by placing an exclamation mark before it so, eg, !gsutil -m cp gs://bucket/file ...
you can make an entire juyter cell act like the command line by passing '%%bash' 
I tried that with the gsutil function mentioned above to return the file in the bucket...It gave me a list of its contents but still did not let me read it into the notebook.  

--
You received this message because you are subscribed to the Google Groups "google-dl-platform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-dl-platform+unsub...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages