I want to create a tensorflow dataset to fit my estimator, my data is stored in hdfs, so I used this code to load it:
def dataset_from_HDFS(img_path):
image_set = ImageSet.read(img_path,sc,4)
transformer = ChainedPreprocessing(
[ImageResize(h, w),
ImageMatToTensor(),
ImageSetToSample()])
image_data = transformer(image_set)
samples = image_data.get_image()
images= np.array(samples.collect())
return images
I call this function and I store my Images in local variable:
train_images=dataset_from_HDFS(train_img_pth)
Then I use this function to create the dataset :
def train_data_creator(config, batch_size):
dataset = tf.data.Dataset.from_tensor_slices((train_images,train_lab))
dataset = dataset.map(lambda image, label: (preprocess_image(image), preprocess_label(label)))
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)
return dataset
Finally, I fit my estimator using the above function like this:
est.fit(data=train_data_creator,
batch_size=2,
epochs=2,
validation_data=val_data_creator)
My problem, is that my machine memory can not fit the whole dataset, my code throughs memory errors(ray out of memory error) when I try to read a big dataset (in here:
train_images=dataset_from_HDFS(train_img_pth) ),
sometimes it throughs the in the fit process...
I suspect that this approach to create the tf.dataset is totally wrong, and I dont know what to do instead? any information would be very helpful.
Thank you,