Re: pandas dataframe to tensorflow dataset with textual data

54 views
Skip to first unread message
Message has been deleted

Paige Bailey

unread,
Apr 16, 2020, 3:49:41 PM4/16/20
to Jeff Verdegan, Discuss
Hi, Jeff -
Thanks for the question! Does this tutorial help?


On Thu, Apr 16, 2020 at 12:45 PM Jeff Verdegan <jver...@youmail.com> wrote:
Hi there! I'm just getting started with TF and pandas, so if this has an obvious answer that I'm just not finding, please point me toward the appropriate docs.

I have a 2-column CSV file: match (Y/N) and text (a few sentences, some of which match my criteria and some of which don't).

I'm following the example at https://www.tensorflow.org/tutorials/structured_data/feature_columns, and it works with their sample data which is all numerical (except for the popped label column).

However, when I try to use my data, I get 
ValueError: Can't convert Python sequence with mixed types to Tensor.

As I mentioned, my input data is all text, but when I head() the dataframe, it seems pandas has injected a row number column.
TypeError: Could not build a TypeSpec for 10972    blah blah this is my text

I even removed all the digits from my sample data, so as far as I can tell, it's the pandas injected column number that  from_tensor_slices doesn't like.

Is there something I should use instead of from_tensor_slices , or ahead of it to prepare the data?
Or an easy way to tell it to treat the injected column numbers as text?
Or tell it to ignore it?

This is the example code I copied from the above link. 
Thanks for any guidance you can give!

# A utility method to create a tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
  dataframe
= dataframe.copy()
  labels
= dataframe.pop('target')
  ds
= tf.data.Dataset.from_tensor_slices((dict(dataframe), labels)) # FAILS HERE
 
if shuffle:
    ds
= ds.shuffle(buffer_size=len(dataframe))
  ds
= ds.batch(batch_size)
 
return ds



--
You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/49b8e490-1c73-4090-9c5b-08b9da718b1e%40tensorflow.org.


--

Paige Bailey   

Product Manager (TensorFlow)

@DynamicWebPaige

webp...@google.com


 

Jeff Verdegan

unread,
Apr 16, 2020, 3:54:06 PM4/16/20
to Discuss
Hey, Paige,

Thanks for the quick response! 

Unfortunately, that doesn't help. That's the tutorial I'm starting from. It works using their sample data, which is all numbers (except the "target" column, which gets stripped off for the classification labels).

My problem seems to be that my textual data + the column number that pandas seems to be injecting is making it look like it's mixed types. So ultimately I think I need a way to tell it to ignore that injected column when turning the pandas dataframe into a TF Dataset. (Although, as a rank newbie, I could be just missing something obvious.)

To unsubscribe from this group and stop receiving emails from it, send an email to dis...@tensorflow.org.

Jeff Verdegan

unread,
Apr 16, 2020, 4:35:38 PM4/16/20
to Discuss
Okay, so I'm totally wrong about the root problem.

I should've done this sooner, but I stripped my data down to just a few rows, and it worked fine.

So apparently there's something in the data itself, but only in certain rows that's making it look like "mixed types."

So any help in tracking down that would be much appreciated.

Thanks!

Jeff Verdegan

unread,
Apr 16, 2020, 5:30:35 PM4/16/20
to Discuss
It turned out to be the dumbest thing ever, and not related to TF or pandas at all.

At least one of my data rows had a newline embedded in the sample text, so it create an extra line with only one column.

Pardon the intrusion, you may continue with your social distancing exercises.  :-)
Reply all
Reply to author
Forward
0 new messages