How to load data from HDFS to create tf.dataset to fit the keras.estimator?

islem mehablia

unread,

Apr 6, 2023, 6:56:44 AM4/6/23

to User Group for BigDL

I want to create a tensorflow dataset to fit my estimator, my data is stored in hdfs, so I used this code to load it:

def dataset_from_HDFS(img_path):
image_set = ImageSet.read(img_path,sc,4)
transformer = ChainedPreprocessing(
[ImageResize(h, w),
ImageMatToTensor(),
ImageSetToSample()])
image_data = transformer(image_set)
samples = image_data.get_image()
images= np.array(samples.collect())
return images

I call this function and I store my Images in local variable:

train_images=dataset_from_HDFS(train_img_pth)

Then I use this function to create the dataset :

def train_data_creator(config, batch_size):
dataset = tf.data.Dataset.from_tensor_slices((train_images,train_lab))
dataset = dataset.map(lambda image, label: (preprocess_image(image), preprocess_label(label)))
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)
return dataset

Finally, I fit my estimator using the above function like this:

est.fit(data=train_data_creator,
batch_size=2,
epochs=2,
validation_data=val_data_creator)

My problem, is that my machine memory can not fit the whole dataset, my code throughs memory errors(ray out of memory error) when I try to read a big dataset (in here:

train_images=dataset_from_HDFS(train_img_pth) ),

sometimes it throughs the in the fit process...

I suspect that this approach to create the tf.dataset is totally wrong, and I dont know what to do instead? any information would be very helpful.

Thank you,

Ge Song

unread,

Apr 6, 2023, 8:29:52 AM4/6/23

to User Group for BigDL

Hi Mehablia,

I hope this email finds you well.

May I ask if the data has been successfully stored locally by calling `train_images=dataset_from_HDFS(train_img_pth)`? Additionally, would it be possible for you to provide more error logs? This would greatly assist us in providing you with solutions.

Thanks,

Ge

islem mehablia

unread,

Apr 6, 2023, 9:00:26 AM4/6/23

to User Group for BigDL

Hi Ge,

Thank you for answering my question.

The answer for your first question is Yes. this code `train_images=dataset_from_HDFS(train_img_pth)' can access the images data stored in hdfs as numpy arrays, I can even show them.

About the errors, my code throws; when I try to read a big amount of images it shows that ray is out of memory.

When I tried to test the code with small amount of data it shows this :

"Failed to put object fffffffffffffffffffff10000000 in object store because it is full. Object size is 3,119,515,579 bytes.

The local object store is full of objects that are still in scope and cannot be evicted.

Use 'ray memory' command to list active objects in the cluster."

(I am writing this errors by hand because I am using different machine)

Thanks,

Mehablia

Ge Song

unread,

Apr 7, 2023, 2:45:42 AM4/7/23

to User Group for BigDL

Hi Mehablia,

The following are possible solutions that you can try to solve the Ray Out Of Memory (OOM) error:

Increase the object store memory.

Specify the object store size when starting the Ray cluster with OrcaContext as shown below:

init_orca_context(init_ray_on_spark=True, object_store_memory="4g")
Monitor the object store usage.
- Execute the ray memory command during runtime to check if the store is full. If so, stop the cluster using the ray stop command, modify the configuration file to increase the object store size, and then restart the cluster.
- You could also run a Python command to delete unnecessary objects as below:
  
  del object_ref or ray.internal.free(object_ref_id)