Can I use S3 bucket for training data?

1,590 views
Skip to first unread message

G Reina

unread,
Feb 4, 2018, 7:06:17 PM2/4/18
to Discuss
I've got a 1.5 TB dataset on AWS S3.  Is it possible to just point my local installation of TensorFlow to the S3 file in order to train the model on my local machine? Or, can I only do this from an EC2 instance?

Thanks.
-Tony

Toby Boyd

unread,
Feb 5, 2018, 11:24:23 AM2/5/18
to G Reina, Discuss
I believe s3 is supported and it would not matter where your calling it from assuming you have the correct permissions and such setup.  That said, I suspect this would be really slow, although I am just guessing.  If your dataset is 1.5TB and you do say 100 epochs (without local caching) you would be moving 150TB of data.  When I am doing work on AWS I normally use EFS (mounted as local disk) or local disk.  I could very well be really wrong but I am not sure s3 would be a great option.  Disclosure:  I have used EFS a good bit.  I do not have any direct s3 experience.  

Toby

--
You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+unsubscribe@tensorflow.org.
To post to this group, send email to dis...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/45bbfc04-19ac-49b0-9369-41767f368d91%40tensorflow.org.

Sebastian Raschka

unread,
Feb 5, 2018, 1:02:13 PM2/5/18
to Toby Boyd, G Reina, Discuss
I think you would need to mount the S3 bucket as a file system to do that, eg using tools like S3FS. For the reasons mentioned, this would probably not be a good idea for training the model anyways. I can see two issues with that

1) it will be slow due to network transfer speed constraints

2) it will become relatively expensive since S3 has a pay by access model. I think it’s currently around 0.023 cents per GB. After 10 epochs, you would be over 3k bucks. Here, it’s probably better to invest that money in a machine with a >1.5 Tb hard drive ;). 

My recommendation would be to just buy a relatively cheap hard drive to put the dataset on. Instead of buying an external USB hard drive, I’d recommend buying a regular hard drive with SATA port + a docking station (they cost around 15 bucks) with which you can either connect it to your computers SATA port, if you have one free, or USB if it must be (I am using that setup as well, works great). 

Best,
Sebastian
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.

To post to this group, send email to dis...@tensorflow.org.

Toby Boyd

unread,
Feb 5, 2018, 1:07:48 PM2/5/18
to Sebastian Raschka, G Reina, Discuss
Here is the s3 support, not that it would be a good idea in this case.  I remember complaining it was logging all the time a few weeks ago.  Sorry no "how to", just wanted you to know it is there and included in default builds.  

Toby

G Reina

unread,
Feb 5, 2018, 1:21:34 PM2/5/18
to Toby Boyd, Sebastian Raschka, Discuss
Thanks to all. I managed to find goofys (https://github.com/kahing/goofys) last night. It allowed me to mount the S3 locally. From my tests with just python accessing an HDF5 file on the S3 bucket, the lag was well under 1 second to pull out a batch of data. However, I am concerned about the cost as Sebastian mentioned. That could be the rate limiting step.

Best,
-Tony

Sebastian Raschka

unread,
Feb 5, 2018, 1:25:05 PM2/5/18
to G Reina, Toby Boyd, TensorFlow Mailinglist
Yeah, I think it might get expensive in the long run. I had a typo there and meant 100 epochs, not 10 epochs regarding the 3k bucks (150,000 Gb * 0.023 cent/Gb).

Best,
Sebastian

> On Feb 5, 2018, at 1:21 PM, G Reina <gre...@eng.ucsd.edu> wrote:
>
> Thanks to all. I managed to find goofys (https://github.com/kahing/goofys) last night. It allowed me to mount the S3 locally. From my tests with just python accessing an HDF5 file on the S3 bucket, the lag was well under 1 second to pull out a batch of data. However, I am concerned about the cost as Sebastian mentioned. That could be the rate limiting step.
>
> Best,
> -Tony
>
>
> On Mon, Feb 5, 2018 at 10:07 AM, Toby Boyd <toby...@google.com> wrote:
> Here is the s3 support, not that it would be a good idea in this case. I remember complaining it was logging all the time a few weeks ago. Sorry no "how to", just wanted you to know it is there and included in default builds.
>
> Toby
>
> On Mon, Feb 5, 2018 at 10:02 AM, Sebastian Raschka <se.ra...@gmail.com> wrote:
> I think you would need to mount the S3 bucket as a file system to do that, eg using tools like S3FS. For the reasons mentioned, this would probably not be a good idea for training the model anyways. I can see two issues with that
>
> 1) it will be slow due to network transfer speed constraints
>
> 2) it will become relatively expensive since S3 has a pay by access model. I think it’s currently around 0.023 cents per GB. After 10 epochs, you would be over 3k bucks. Here, it’s probably better to invest that money in a machine with a >1.5 Tb hard drive ;).
>
> My recommendation would be to just buy a relatively cheap hard drive to put the dataset on. Instead of buying an external USB hard drive, I’d recommend buying a regular hard drive with SATA port + a docking station (they cost around 15 bucks) with which you can either connect it to your computers SATA port, if you have one free, or USB if it must be (I am using that setup as well, works great).
>
> Best,
> Sebastian
>
>
> On Feb 5, 2018, at 11:24 AM, 'Toby Boyd' via Discuss <dis...@tensorflow.org> wrote:
>
>> I believe s3 is supported and it would not matter where your calling it from assuming you have the correct permissions and such setup. That said, I suspect this would be really slow, although I am just guessing. If your dataset is 1.5TB and you do say 100 epochs (without local caching) you would be moving 150TB of data. When I am doing work on AWS I normally use EFS (mounted as local disk) or local disk. I could very well be really wrong but I am not sure s3 would be a great option. Disclosure: I have used EFS a good bit. I do not have any direct s3 experience.
>>
>> Toby
>>
>> On Sun, Feb 4, 2018 at 4:06 PM, G Reina <gre...@eng.ucsd.edu> wrote:
>> I've got a 1.5 TB dataset on AWS S3. Is it possible to just point my local installation of TensorFlow to the S3 file in order to train the model on my local machine? Or, can I only do this from an EC2 instance?
>>
>> Thanks.
>> -Tony
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups "Discuss" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.
>> To post to this group, send email to dis...@tensorflow.org.
>> To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/45bbfc04-19ac-49b0-9369-41767f368d91%40tensorflow.org.
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups "Discuss" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.

Sebastian Raschka

unread,
Feb 5, 2018, 1:26:51 PM2/5/18
to Sebastian Raschka, G Reina, Toby Boyd, TensorFlow Mailinglist
It does seem way too expensive, and I of course forgot to divide by 100 to get dollars, not cents. Arg, Mondays are not my days :P. In any case, prob worth investing in an external drive in the long run :P

Best,
Sebastian
> To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/4FF98437-C86F-4DDF-B871-BE013A4AE0D4%40gmail.com.

G Reina

unread,
Feb 5, 2018, 1:28:46 PM2/5/18
to Sebastian Raschka, Toby Boyd, Discuss
I have the same math trouble on Mondays!

Yes. For 3k I might as well buy a new rig.

Thanks.
Tony

>>> To unsubscribe from this group and stop receiving emails from it, send an email to discuss+unsubscribe@tensorflow.org.

>>> To post to this group, send email to dis...@tensorflow.org.
>>> To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/45bbfc04-19ac-49b0-9369-41767f368d91%40tensorflow.org.
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups "Discuss" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an email to discuss+unsubscribe@tensorflow.org.

>>> To post to this group, send email to dis...@tensorflow.org.
>>> To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/CAKpuZpnAJCQyEbhQUAK7sC8bM%2B4kGOZGAG9QJ1RTNAkjovYQMQ%40mail.gmail.com.
>>
>>
>
> --
> You received this message because you are subscribed to the Google Groups "Discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to discuss+unsubscribe@tensorflow.org.

Ka-Hing Cheung

unread,
Feb 7, 2018, 1:51:14 AM2/7/18
to Discuss
goofys author here. This depends if you are running TensorFlow in AWS or not. If yes, then bw is free (assuming everything is in the same region). If you are running this at home then speed is likely more of a bottleneck. goofys does have some caching support but obviously you need to have the space for the cache to begin with :-)

(On AWS, bandwidth is 0.10/GB and $0.023/GB is cost of storage, this is region specific of course)


On Monday, February 5, 2018 at 10:02:13 AM UTC-8, Sebastian Raschka wrote:
I think you would need to mount the S3 bucket as a file system to do that, eg using tools like S3FS. For the reasons mentioned, this would probably not be a good idea for training the model anyways. I can see two issues with that

1) it will be slow due to network transfer speed constraints

2) it will become relatively expensive since S3 has a pay by access model. I think it’s currently around 0.023 cents per GB. After 10 epochs, you would be over 3k bucks. Here, it’s probably better to invest that money in a machine with a >1.5 Tb hard drive ;). 

My recommendation would be to just buy a relatively cheap hard drive to put the dataset on. Instead of buying an external USB hard drive, I’d recommend buying a regular hard drive with SATA port + a docking station (they cost around 15 bucks) with which you can either connect it to your computers SATA port, if you have one free, or USB if it must be (I am using that setup as well, works great). 

Best,
Sebastian

On Feb 5, 2018, at 11:24 AM, 'Toby Boyd' via Discuss <dis...@tensorflow.org> wrote:

I believe s3 is supported and it would not matter where your calling it from assuming you have the correct permissions and such setup.  That said, I suspect this would be really slow, although I am just guessing.  If your dataset is 1.5TB and you do say 100 epochs (without local caching) you would be moving 150TB of data.  When I am doing work on AWS I normally use EFS (mounted as local disk) or local disk.  I could very well be really wrong but I am not sure s3 would be a great option.  Disclosure:  I have used EFS a good bit.  I do not have any direct s3 experience.  

Toby
On Sun, Feb 4, 2018 at 4:06 PM, G Reina <gre...@eng.ucsd.edu> wrote:
I've got a 1.5 TB dataset on AWS S3.  Is it possible to just point my local installation of TensorFlow to the S3 file in order to train the model on my local machine? Or, can I only do this from an EC2 instance?

Thanks.
-Tony

--
You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.
To post to this group, send email to dis...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/45bbfc04-19ac-49b0-9369-41767f368d91%40tensorflow.org.
Reply all
Reply to author
Forward
0 new messages