ImageNet to RDD Spark in Python for Deep Learning

114 views
Skip to first unread message

Wojciech Krukar

unread,
Jan 11, 2018, 6:36:16 PM1/11/18
to BigDL User Group

I am building python image recognition model in Spark / BigDL. I work on ImageNet dataset. My problem is how to build RDD for machine learning in BigDL. I combed the internet and all examples I can find python/spark/(BigDL), somehow omit the process of creating the dataset out of images.

Is there any ready DL script on python/spark/(BigDL)/imageNet or similar dataset so I can reverse engineer. Or some examples of how to build RDD from ImageNet. Maybe a chunk of code.

(There is an example of RDD out of the mnist dataset. It is 10MB and loaded straight to the RDD, I am not sure if this is the way with ImageNet).

What is the proper way to build RDD from ImageNet (or similar) dataset? 

Maybe it is a vanilla question but at this point, my thinking is experimenting with RDD build from 200GB of data, wait and learn that it was all wrong, would be a massive waste of resources.

Xin Qiu

unread,
Jan 11, 2018, 9:07:52 PM1/11/18
to BigDL User Group
Hi Wojciech

There is a old PR for imagenet dataset based on BigDL 0.2.0, https://github.com/intel-analytics/BigDL/pull/789. It's unmerged, because we are waiting for the opencv support(already support in 0.4.0). We will update this PR these day, and we will notify you when we finished.

For your question about "RDD build from 200GB of data". To my understanding about Hadoop and Spark, 200GB is relative small. Hadoop and Spark is very good at processing large scale dataset, maybe TB even PB. And the loading time is too small VS the training time on Spark. If you have 16 servers's cluster, each of them have 8 hdd disk. Then each server could provide  600-700 MB/s reading throughput easily. Then the cluster can read 9.6-11.2 GB data per second. Loading 200GB data into memory only take less then 1 minutes.

Bests,
-Xin
Reply all
Reply to author
Forward
0 new messages