Binary columns and missing values

71 views
Skip to first unread message

Peter Prettenhofer

unread,
Feb 28, 2014, 3:39:44 AM2/28/14
to pyd...@googlegroups.com
Hi all,

I've a csv file with lots of binary columns. True values are encoded as 'true' and missing values indicate False values. When I use ``pandas.read_csv('file.csv', true_values=['true'])`` the columns have dtype ``object`` because of the missing values. When I properly encode False values as 'false' then the columns are indeed bool. Is there a way to treat columns with one value and missing values as boolean? It would be a huge memory safer for me - the proper encoded version has a 4x larger file size but 10x smaller memory footprint as a DataFrame.

thanks,
 Peter

PS: if its not possible - is that something people are interested in and if so should I prepare a PR for it?

Jeff Reback

unread,
Feb 28, 2014, 6:36:03 AM2/28/14
to pyd...@googlegroups.com

you can df['col'] = df['col'].fillna(False).astype(bool)

after you read it in
--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Peter Prettenhofer

unread,
Feb 28, 2014, 7:02:41 AM2/28/14
to pyd...@googlegroups.com
Hi Jeff,

jep - but then my import process memory footprint still peaks with a 10x -- also, my Python processes hardly release memory to the OS. I'd rather do it in ``pandas.read_csv`` directly.

thx, 
 Peter


--
You received this message because you are subscribed to a topic in the Google Groups "PyData" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/pydata/pOz9LCx3JF0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to pydata+un...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.



--
Peter Prettenhofer

Jeff Reback

unread,
Feb 28, 2014, 7:05:42 AM2/28/14
to pyd...@googlegroups.com

you can specify false_values as well (as '')

furthermore you could read in chunks to avoid peak memory usage

Peter Prettenhofer

unread,
Feb 28, 2014, 7:06:46 AM2/28/14
to pyd...@googlegroups.com
that makes sense - thanks!
Reply all
Reply to author
Forward
0 new messages