How to use tf.data.Dataset with list of dictionaries....

520 views
Skip to first unread message

Michal Lukáč

unread,
Mar 17, 2020, 9:50:43 AM3/17/20
to Discuss

I have a dataset stored in a list of dictionaries. The dictionary can be pretty complex, it can contain objects and metadata critical for data loading. I would like to load this list through TensorFlow dataset API. How can I do this? I tried something like this however, it is not working:


import tensorflow as tf
import json

LABELS_IDS = ["cat", "dog", "animal"]

def parse_record(record):
    image = tf.io.read_file(record["_file"])
    image = tf.image.decode_jpeg(image)
    image = tf.image.convert_image_dtype(image, tf.float32)
    image = tf.image.resize(image, [224, 224])
    image = tf.image.random_flip_left_right(image, seed=None)

    labels = []
    for element in record["_categories"]:
        if element in LABELS_IDS:
            labels.append(LABELS_IDS.index(element))

    one_hot_labels = tf.reduce_sum(tf.one_hot(labels, len(LABELS_IDS)), axis=0)
    return image, one_hot_labels

json_records = [{"_file":"images/test.jpg", "_categories": ["cat", "animal"]}]

train_x = tf.data.Dataset.from_tensor_slices(json_records).map(parse_record)
I find tf.data.Dataset hard to use. Before that I was using DataGenerators which were great as they were more pythonic. However it was very slow and I cannot use multiprocessing because of deadlocks. Has anyone had experience with this?

Paige Bailey

unread,
Mar 17, 2020, 12:11:36 PM3/17/20
to Michal Lukáč, Discuss
Hi, Michal -
Would Dataset.from_generator help with this? API docs with additional examples here.

(Also, in the future, please ask similar questions on StackOverflow or Github.)

Thanks!
.pb

--
You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/c635d3f5-4d5f-4403-a7ee-fee23eb29d5b%40tensorflow.org.


--

Paige Bailey   

Product Manager (TensorFlow)

@DynamicWebPaige

webp...@google.com


 

Jiri Simsa

unread,
Mar 17, 2020, 3:24:52 PM3/17/20
to Paige Bailey, Michal Lukáč, Discuss
Thanks Paige. tf.data.Dataset.from_generator will result in a performant solution, as the execution will be serialized because of Python's GIL.

Michal, as per Paige's suggestion, asking such a question on Stackoverflow (for usability) or Github (for bugs / feature requests) is more appropriate. In addition, providing an example that can be executed "as is" and the error message you are encountering (if applicable) will increase the likelihood that someone will be able to help you.

Last but not least, there is a trade-off between performance and usability. As you have learned, the "pythonic" way is in general not performant. Using tf.data (and more broadly TensorFlow) allows you to express computation as a dataflow graph, which can be executed more efficiently (e.g. utilizing parallelism of the underlying hardware).

Best,

Jiri

Reply all
Reply to author
Forward
0 new messages