creating an empty data frame

6,497 views
Skip to first unread message

Chris Withers

unread,
Oct 29, 2013, 3:37:01 AM10/29/13
to pyd...@googlegroups.com
Hi All,

How would I go about creating an empty dataframe with a particular set
of columns and index values?

I'm trying to do so without creating any unneeded data structures in
advance (nested dicts of dicts, numpy arrays, etc). I want to populate
this dataframe from a sql query once it exists.

cheers,

Chris

--
Simplistix - Content Management, Batch Processing & Python Consulting
- http://www.simplistix.co.uk

Jeffrey Tratner

unread,
Oct 29, 2013, 8:38:29 AM10/29/13
to pyd...@googlegroups.com

Why specifically don't you want to have intermediate data structures? (You're going to have them regardless because that's what the db drivers produce). Have you tried read_sql?

For your specific question: you can create a DataFrame with empty index and columns or empty columns with index, but you can't create an 'empty' DataFrame - it always has to be filled with something. Since the missing value (nan) is a float, you can't prep an empty integer column (empty datetime cols can use NaT). You could use np.empty to create columns with appropriate dtypes.

You'd probably be better served by creating a dict with your data and passing that to the constructor.

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Miki Tebeka

unread,
Oct 29, 2013, 8:48:48 AM10/29/13
to pyd...@googlegroups.com
Like this?

In [1]: pd.DataFrame(columns=['a', 'b', 'c'])
Out[1]:
Empty DataFrame
Columns: [a, b, c]
Index: []

In [2]:

Chris Withers

unread,
Oct 30, 2013, 4:25:07 AM10/30/13
to pyd...@googlegroups.com
Okay, but can I then add a lot of data to that DataFrame without
incurring either performance or memory overheads?

cheers,

Chris
> --
> You received this message because you are subscribed to the Google
> Groups "PyData" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to pydata+un...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> ______________________________________________________________________

Chris Withers

unread,
Oct 30, 2013, 4:40:59 AM10/30/13
to pyd...@googlegroups.com
On 29/10/2013 12:38, Jeffrey Tratner wrote:
> Why specifically don't you want to have intermediate data structures?

Memory usage. If I'm building a dataframe with 80 million rows in it, I
don't want to first have to create either an 80 million by 10 numpy
array, or a massive list of dicts, just to throw that way. I'm trying to
avoid doubling my memory requirements...

> (You're going to have them regardless because that's what the db drivers
> produce).

Not so, most database adapters (certainly psycopg2!) provide good
interfaces for streaming huge numbers of rows without loading them all
into memory at once. Yes, there will be objects created for each row,
but these are created a few at a time and can be garbage collected
during the load.

> Have you tried read_sql?

This naively loads all the rows into a list, and we're back to the
double memory usage problem.

> For your specific question: you can create a DataFrame with empty index
> and columns or empty columns with index, but you can't create an 'empty'
> DataFrame

I'm, I don't think I phrased my question very well, I have two different
scenarios:

1. I know the columns, but want to stream the data into the frame rather
than creating some other data structure up front (the 80 million problem
above)

2. I know both the columns and row(index) labels.

I sounds like you're describing both of these; in each case, what's the
correct way to create the dataframe?

In case 1, I'm particularly interested in ways that don't make it very
slow to iteratively add the 80 million rows.

In case 2, the index values will be data types and I know the data type
(floats) for the columns.

Hope you can help :-)

Chris

> - it always has to be filled with something. Since the missing
> value (nan) is a float, you can't prep an empty integer column (empty
> datetime cols can use NaT). You could use np.empty to create columns
> with appropriate dtypes.
>
> You'd probably be better served by creating a dict with your data and
> passing that to the constructor.
>
> On Oct 29, 2013 4:32 AM, "Chris Withers" <ch...@simplistix.co.uk
> <mailto:ch...@simplistix.co.uk>> wrote:
>
> Hi All,
>
> How would I go about creating an empty dataframe with a particular
> set of columns and index values?
>
> I'm trying to do so without creating any unneeded data structures in
> advance (nested dicts of dicts, numpy arrays, etc). I want to
> populate this dataframe from a sql query once it exists.
>
> cheers,
>
> Chris
>
> --
> Simplistix - Content Management, Batch Processing & Python Consulting
> - http://www.simplistix.co.uk
>
> --
> You received this message because you are subscribed to the Google
> Groups "PyData" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to pydata+unsubscribe@__googlegroups.com
> <mailto:pydata%2Bunsu...@googlegroups.com>.
> For more options, visit https://groups.google.com/__groups/opt_out
> <https://groups.google.com/groups/opt_out>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "PyData" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to pydata+un...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> ______________________________________________________________________

Francesc Alted

unread,
Oct 30, 2013, 6:08:08 AM10/30/13
to pyd...@googlegroups.com
If I understand correctly, you could solve your problem by creating your
big DataFrame first with the final number of rows (if you don't know
them, just perform a SQL COUNT() first). Then just use a fetchmany()
inside a loop so that you can fill the DataFrame with values. Something
like:

df = pd.DataFrame(...)
CHUNKSIZE = 1000
i = 0
for results in c.fetchmany(CHUNKSIZE):
schunk = len(result)
if schunk == 0: break
df[i:i+chunk] = result
i += schunk

Please note that the above is pseudocode. You will need to adapt it to
your database adapter.

Now for the shameless plugin. In case you want to avoid the conversion
of every cell into a Python object (which is what most of the existing
adapters do), you may want to use our iopro.pyodbc adapter
(http://docs.continuum.io/iopro/pyodbc.html) that, by using the
fetchdictarray(size) or fetchsarray(size), you create numpy-based
containers for the data, avoiding all the Python overhead. This result
typically in much faster in-memory loading times. More info in:

http://test.continuum.io/blog/iopro-pyodbc-performance

[Disclaimer: I work for Continuum Analytics]

--
Francesc Alted

Jeff

unread,
Oct 30, 2013, 11:27:14 AM10/30/13
to pyd...@googlegroups.com
Chris,

conversion ATM from SQL to pandas is pretty inefficient in the memory space

Francesc offers a (paid!) solution (shameless plus for open source!).

It is not clear what you are going to do with this data. But here are some tips.

read the data out of the SQL in chunks and convert to an HDF5 store.

This will allow you to then read/calculate using chunking, and even do queries on
your data set in a very efficient manner (space and time-wise).

Most of these features are in 0.12, but enhancement are in 0.13


here are some links for managing this type of data:

Francesc Alted

unread,
Oct 30, 2013, 12:56:18 PM10/30/13
to pyd...@googlegroups.com
Well, I do offered a solution for avoiding duplication of memory without
any commercial tools. Technically the only paid solution that I
suggested is an *optimization* for avoiding creating a Python object for
every read cell from the database.

But your point is a good one: reading in chunk and storing in HDF5
format is the less memory consuming (but this does not preclude using a
lot of CPU for doing this).

Francesc
> --
> You received this message because you are subscribed to the Google
> Groups "PyData" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to pydata+un...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.


--
Francesc Alted

Wes McKinney

unread,
Nov 3, 2013, 1:04:55 AM11/3/13
to pyd...@googlegroups.com
Would it help if I offered a bounty for someone to create a
BSD-licensed fork of pyodbc that efficiently yields pandas DataFrame
objects? This is years overdue and would not be an enormous amount of
work for someone with familiarity with the NumPy C API and tolerance
for a bit of fiddliness with the ODBC C API (the SQLFetchScroll
function and dealing with non-fixed-width data types). Please contact
me on or off list if you are interested in this.

Thanks

Chris Withers

unread,
Nov 5, 2013, 2:16:12 AM11/5/13
to pyd...@googlegroups.com
On 03/11/2013 06:04, Wes McKinney wrote:
> Would it help if I offered a bounty for someone to create a
> BSD-licensed fork of pyodbc that efficiently yields pandas DataFrame
> objects? This is years overdue and would not be an enormous amount of
> work for someone with familiarity with the NumPy C API and tolerance
> for a bit of fiddliness with the ODBC C API (the SQLFetchScroll
> function and dealing with non-fixed-width data types). Please contact
> me on or off list if you are interested in this.

Why ODBC? I'm certainly not on Windows and I don't imagine that many
people are nowadays, it certainly seems a weird choice if your database
is MySQL or Postgres.

Surely something that could pour the output of a dbapi-compliant cursor
(server-side cursor support would be good!) efficiently into a Dataframe
would be more generically useful?

I feel like I'm missing something...

Robert Kern

unread,
Nov 5, 2013, 5:30:48 AM11/5/13
to pyd...@googlegroups.com
On Tue, Nov 5, 2013 at 7:16 AM, Chris Withers <ch...@simplistix.co.uk> wrote:
>
> On 03/11/2013 06:04, Wes McKinney wrote:
>>
>> Would it help if I offered a bounty for someone to create a
>> BSD-licensed fork of pyodbc that efficiently yields pandas DataFrame
>> objects? This is years overdue and would not be an enormous amount of
>> work for someone with familiarity with the NumPy C API and tolerance
>> for a bit of fiddliness with the ODBC C API (the SQLFetchScroll
>> function and dealing with non-fixed-width data types). Please contact
>> me on or off list if you are interested in this.
>
> Why ODBC? I'm certainly not on Windows and I don't imagine that many people are nowadays, it certainly seems a weird choice if your database is MySQL or Postgres.
>
> Surely something that could pour the output of a dbapi-compliant cursor (server-side cursor support would be good!) efficiently into a Dataframe would be more generically useful?
>
> I feel like I'm missing something...

The DB-API is a Python-level API. All of the numbers are boxed into Python int and float objects, and you have Python's iterator API to get each row. Wes is proposing doing something at the C level to avoid creating all of those Python objects, just the DataFrame and the few numpy arrays inside it. The loop to fetch the data from the cursor would be in C and fill those numpy arrays directly. The choice of ODBC as the C API is just to hit a broad range of databases in one go. It's the DB-API of the C world, more or less. ODBC works on Unices as well as Windows. If you'd like to work on a similar project for MySQL or Postgres in particular, I am sure that would be more than welcome.

--
Robert Kern
Reply all
Reply to author
Forward
0 new messages