On 29/10/2013 12:38, Jeffrey Tratner wrote:
> Why specifically don't you want to have intermediate data structures?
Memory usage. If I'm building a dataframe with 80 million rows in it, I
don't want to first have to create either an 80 million by 10 numpy
array, or a massive list of dicts, just to throw that way. I'm trying to
avoid doubling my memory requirements...
> (You're going to have them regardless because that's what the db drivers
> produce).
Not so, most database adapters (certainly psycopg2!) provide good
interfaces for streaming huge numbers of rows without loading them all
into memory at once. Yes, there will be objects created for each row,
but these are created a few at a time and can be garbage collected
during the load.
> Have you tried read_sql?
This naively loads all the rows into a list, and we're back to the
double memory usage problem.
> For your specific question: you can create a DataFrame with empty index
> and columns or empty columns with index, but you can't create an 'empty'
> DataFrame
I'm, I don't think I phrased my question very well, I have two different
scenarios:
1. I know the columns, but want to stream the data into the frame rather
than creating some other data structure up front (the 80 million problem
above)
2. I know both the columns and row(index) labels.
I sounds like you're describing both of these; in each case, what's the
correct way to create the dataframe?
In case 1, I'm particularly interested in ways that don't make it very
slow to iteratively add the 80 million rows.
In case 2, the index values will be data types and I know the data type
(floats) for the columns.
Hope you can help :-)
Chris
> - it always has to be filled with something. Since the missing
> value (nan) is a float, you can't prep an empty integer column (empty
> datetime cols can use NaT). You could use np.empty to create columns
> with appropriate dtypes.
>
> You'd probably be better served by creating a dict with your data and
> passing that to the constructor.
>
> On Oct 29, 2013 4:32 AM, "Chris Withers" <
ch...@simplistix.co.uk
> <mailto:
ch...@simplistix.co.uk>> wrote:
>
> Hi All,
>
> How would I go about creating an empty dataframe with a particular
> set of columns and index values?
>
> I'm trying to do so without creating any unneeded data structures in
> advance (nested dicts of dicts, numpy arrays, etc). I want to
> populate this dataframe from a sql query once it exists.
>
> cheers,
>
> Chris
>
> --
> Simplistix - Content Management, Batch Processing & Python Consulting
> -
http://www.simplistix.co.uk
>
> --
> You received this message because you are subscribed to the Google
> Groups "PyData" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to pydata+unsubscribe@__
googlegroups.com
> <mailto:
pydata%2Bunsu...@googlegroups.com>.
> For more options, visit
https://groups.google.com/__groups/opt_out
> <
https://groups.google.com/groups/opt_out>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "PyData" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to
pydata+un...@googlegroups.com.
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit
http://www.symanteccloud.com
> ______________________________________________________________________