Read TFRecord image data with new TensorFlow Dataset API

Gabriel Perdue

unread,

Dec 1, 2017, 1:26:50 PM12/1/17

to Discuss

Hi,

I posted a question on how to use the Dataset API on StackOverflow on Monday and it hasn't received any comments.

https://stackoverflow.com/questions/47521213/read-tfrecord-image-data-with-new-tensorflow-dataset-api

In short, I'm having trouble reading image files encoded in TFRecords. For this example, I've encoded the MNIST dataset into TFRecord files (one for training, one for validation, and one for testing). You can find a gist that reproduces my error plus links to the data files here

https://gist.github.com/gnperdue/56092626d611ae23370a21fdeeb2abe8

I have this sort of code working with the "old" file and batch-queue API. I'm curious, in general though, as to whether my approach here is philosophically correct. What is the recommended way to consume very large (many millions) of "image" datasets? My actual application domain is physics and it is very easy to put together, for example, HDF5 files that contain millions of images (they are very sparse), but not super efficient to consume them with TensorFlow. I've had decent success converting HDF5 files into TFRecord files, but I have to break the HDF5 files up into many TFRecords files (there is a practical limit of about 20k of my images to a TFRecord file, but it is easy to use the file APIs to loop over them, so that is fine).

Any thoughts on my specific question on SO and/or on my general recommendations question will be deeply appreciated.

Thanks!

pax

Gabe

Marianne Linhares

unread,

Dec 1, 2017, 3:23:14 PM12/1/17

to Gabriel Perdue, Discuss

Hi Gabriel,

> What is the recommended way to consume very large (many millions) of "image" datasets?

I think you're on the right track using tf records + Datasets API.

I didn't try to run your gist file but if it helps there is a full example of how to do it with MNIST here.

Hope this helps!

--
You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+unsubscribe@tensorflow.org.
To post to this group, send email to dis...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/2a49dc4c-1403-410f-ab56-e1989487fcd3%40tensorflow.org.

--

Marianne Linhares Monteiro

Undergraduate Student at UFCG - Universidade Federal de Campina Grande

Gabriel Perdue

unread,

Dec 1, 2017, 7:43:27 PM12/1/17

to Marianne Linhares, Discuss

Thanks! This is super-helpful. I've discovered my problem was not with the Dataset API, but with how I was creating the TFRecord file - I was making one big TFRecord per file. What I don't quite understand now is how my previous application that used the file and batch-queue API worked, but I'm sure I'll figure that out shortly. (I'll also go answer my own question on SO this weekend and explain what I had wrong over there.)

Thanks again!

pax

Gabe

Reply all

Reply to author

Forward