Using GAE to store images for other machine out of GAE

33 views
Skip to first unread message

ALT-EMAIL Virilo Tejedor

unread,
Aug 22, 2019, 12:18:20 PM8/22/19
to Google App Engine
Hi all,

I'd like to create a static web server to store almost 1 TB of images.

It is an opensource dataset that I'd like to use to train a Deep Learning model.

I have free usage of GPUs and Internet conexion in another plattform, but they don't provide me 1 TB storage.

I've also 600$ credits in Google Cloud, I was wondering if there was an easy way to create something to feed with images the server in the other plattform.

The datasource is available as an AWS bucket. I tried to connect the GPU machine directly to the ASW bucket via awscli, but it is too much slow. Like if the bucket were thought for a complete sync but not for coninuous random access to the files.

I've though two possible approaches:

- Execute a python script in GAE to download the dataset and to create a GAE web server: https://cloud.google.com/appengine/docs/standard/python/getting-started/hosting-a-static-website

- Execute a python script in GAE to download the dataset and to create a Google Cloud CDN.

Do you think any of this approaches are valid to feed the model during the training?

I'm a newbie in GAE and any help, starting point or idea will be very wellcomed.

Thanks in advance

Barry Hunter

unread,
Aug 22, 2019, 12:56:50 PM8/22/19
to google-appengine
Well putting them into 'static' files in an app wont work! Its a bit hidden, but there is a 10,000 file limit

... Plus you dont really 'upload' apps incrementally, so would need 'somewhere' to first download entire dataset, package it, then upload. Doubt that would be any easy process with one 1TB (Even if can workaround the 10,000 file limit! )


You could upload the data to https://cloud.google.com/storage/ - which is roughly comparable with S3 Bucket. But again you will be downloading all the data, uploading to Cloud Storage, then just downloading them again for use in the process. (downloading  from Cloud Storage, is going to be roughly comparable from S3, maybe bit quicker, but not massively) 


... seems wasteful. You going to have to download the data to anyway, so just download from AWS, and use directly. It might be painful, but should work. If you find AWS slow, then download iamges in parallel (while individual images might be relatively slow, it can sustain high (even massive) concurrency - ie downloading lots of images at once!) 


This is an exercise in concurrent processing and throughput. Don't get distracted trying to build another storage platform, its unlikely you will do better than S3. 







--
You received this message because you are subscribed to the Google Groups "Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-appengi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-appengine/dbd0a8f8-859b-4f50-a108-80b21e27267f%40googlegroups.com.

Joshua Smith

unread,
Aug 22, 2019, 1:12:34 PM8/22/19
to Google App Engine
GAE isn’t going to serve those images any faster than S3 does (unless they’re in glacier storage).

What you propose won’t be possible because of limits on the size of GAE apps (10K files, max file sizes, etc.)

If you are reading the images more than once, putting a cloudfront distribution between your app and the S3 bucket might help.

ALT-EMAIL Virilo Tejedor

unread,
Aug 22, 2019, 1:15:56 PM8/22/19
to Google App Engine
thanks Barry,

I'm not very sure if paralellizing could work, because it has a delay of seconds to get a single image from this bucket.

I should open several threads, and probably I'm going to have problems with another limitations
To unsubscribe from this group and stop receiving emails from it, send an email to google-a...@googlegroups.com.

ALT-EMAIL Virilo Tejedor

unread,
Aug 22, 2019, 1:17:26 PM8/22/19
to Google App Engine
thanks Joshua... has GAE something similar to cloudfront?

I could move the S3 bucket to GAE and use the GAE's cloudfront?

I've no credits for AWS
> To unsubscribe from this group and stop receiving emails from it, send an email to google-a...@googlegroups.com.

Joshua Smith

unread,
Aug 22, 2019, 1:29:22 PM8/22/19
to Google App Engine
Google Cloud Storage is like S3 with cloudfront built-in.

You could absolutely migrate that database to GCS and then access those objects via https or via API.

To unsubscribe from this group and stop receiving emails from it, send an email to google-appengi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-appengine/6dc6f001-f27e-4206-96e0-dca3d1f3fbee%40googlegroups.com.

Barry Hunter

unread,
Aug 22, 2019, 1:31:34 PM8/22/19
to google-appengine
On Thu, Aug 22, 2019 at 6:16 PM ALT-EMAIL Virilo Tejedor wrote:
thanks Barry,

I'm not very sure if paralellizing could work, because it has a delay of seconds to get a single image from this bucket.

If an single image has a latecny of say 2 seconds, and 2 seconds to download. so 4 seconds per image. Say 15/minute. 

But if download say 10 in parallel, then thats 150/minute. Even with the latency per image. 
... overall the 'latency' gets spread around, So your script wouldn't be waiting for all 10 at the same time. so will waiting for one to start, it can be downloading a differnt one. 

AWS does not store all 10 million images on the same physical disk. Its possibly spread over millions of disks.

 

I should open several threads, and probably I'm going to have problems with another limitations

In theory yes. But AWS will cape with high concurrency. It's designed that way. Could easily download 1,000 images concurrently, if you had the bandwidth. 


But however you download the data (even if it just to upload elsewhere - still dont understand why) - will have to deal with this latency to download it all in a realistic timeframe. 

Downloading them all one by might take 463 days ;)
Reply all
Reply to author
Forward
0 new messages