Pandas and huge dataset – is it possible?

zlju...@gmail.com

unread,

Sep 4, 2018, 7:10:02 PM9/4/18

to PyData

Hi,

I have a dataset in clickhouse database that has 3.000.000.000+ (3+ billion, not million) rows and 40+ columns.

Columns are datetime or integers, except 9 String columns.

All strings could be converted with Series.astype('category') because there are lots of repeated values.

I have a laptop with i7 processor and 16 GB of RAM, Windows 10, 64-bit.

Database is accessed with clickhouse_driver (https://github.com/mymarilyn/clickhouse-driver), so I execute

query_result = client.execute('select * from table’)

df = pd.DataFrame(query_result)

in order to get the dataframe.

It works for 20.000.000 (million not billion) rows, but as I can see jupyter-lab.exe is consuming 10 GB of RAM.

For these 20.000.000 rows, query and dataframe creation take about 10 minutes.

Can I expect that the whole dataset (that is growing) will fit in into pandas DataFrame in jupyter notebook?

What are practical limitations of pandas/jupyter?

What could be an alternative approach?

Regards.

Tom Augspurger

unread,

Sep 4, 2018, 7:16:06 PM9/4/18

to pyd...@googlegroups.com

Pandas is limited to datasets that fit in memory.

You might look at Dask: http://dask.pydata.org/en/latest/

--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Stephan Hoyer

unread,

Sep 4, 2018, 8:10:36 PM9/4/18

to pyd...@googlegroups.com

I would take a look at Ibis, which already includes Clickhouse support: https://docs.ibis-project.org/

Francois Dion

unread,

Sep 5, 2018, 10:44:52 AM9/5/18

to PyData

On Tuesday, September 4, 2018 at 7:10:02 PM UTC-4, zlju...@gmail.com wrote:

I have a dataset in clickhouse database that has 3.000.000.000+ (3+ billion, not million) rows and 40+ columns.

I have a laptop with i7 processor and 16 GB of RAM, Windows 10, 64-bit.

That is a small amount of ram. I wouldn't expect to fit a huge number of row/cols. If you want to do all analytics in memory, you need more than what the Pandas dataframe will require. And if you end up doing a .copy() or not using inplace=True, expect memory requirement to double.

Database is accessed with clickhouse_driver (https://github.com/mymarilyn/clickhouse-driver), so I execute

I've not tried it, but I suspect using https://github.com/xzkostyan/clickhouse-sqlalchemy would be better, that way you can do df = pd.read_sql('select * from table', con) where con is your sqlalchemy engine. This would probably use half the memory you are using by doing:

query_result = client.execute('select * from table’)
df = pd.DataFrame(query_result)

It works for 20.000.000 (million not billion) rows, but as I can see jupyter-lab.exe is consuming 10 GB of RAM.

For these 20.000.000 rows, query and dataframe creation take about 10 minutes.

Specify the dtypes at dataframe time creation. Use the smallest container data type you can. ie. int8, float32, category. For dates, make sure you use the date format string and pass it to pd.to_datetime. Specifying the format makes a huge difference in conversion time. Consider loading a subset of columns. Do you need all the features?

This is about as much memory optimization you can do. Get memory usage using: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.memory_usage.html

That will give you actual memory usage for X rows and you can then extrapolate memory usage for Y rows.

Can I expect that the whole dataset (that is growing) will fit in into pandas DataFrame in jupyter notebook?

Not on your laptop.

What are practical limitations of pandas/jupyter?

Practical? 1/2 to 1TB of ram on a single server is probably the inflection point in term of cost, but it is possible to get 24TB of ram on a single server. Having said that, Pandas will not leverage all your cores, so the time increase of many analytic functions is directly related to the amount of data.

What could be an alternative approach?

dask is one option. You deploy dask workers on multiple servers. You'll have to figure out what makes sense in terms of cost when it comes to how many servers and how much ram. Dask also can leverage all your cores, so even on a single server, it could potentially make things faster (but, it is complicated). Either way, distributed computing is always a bit more complex and involved.

Regards,

Francois

------------------------------
Francois Dion
Chief Data Scientist
Dion Research LLC
------------------------------

zlju...@gmail.com

unread,

Sep 11, 2018, 8:27:46 AM9/11/18

to PyData

Thanks for suggestions of all of you.

I am now in the process of testing what suits me the best.

Thank you all.

Best regards.

anton madjid

unread,

Sep 18, 2018, 6:51:46 PM9/18/18

to PyData

may be this can help you:

http://dask.pydata.org/en/latest/

Nikhil VJ

unread,

Oct 9, 2018, 9:18:45 AM10/9/18

to PyData

Hi zlju (thats all google shows me of your name),

Is there any possibility to do the work you want to do, in chunks / segments?

Regular ops like averaging, summing, min, max, count etc are easily chunk-able.

And even for specific ops you could figure out a multi-step process:

- First stage you loop through chunks of the raw data, say 1 million rows each time, and process that into a smaller table.

- Concatenate the small tables together and then process that to get your final output.

Spend some time thinking through this. I had a challenge with a chunking process of possible breaking of contiguous blocks of data (example: timings rows belonging to the same trip_id, getting split between two chunks), but I figured out a workaround by making a spillover dataframe that retained rows having the last id value. In the next loop this spillover df would be concatenated with the next chunk where that ID's remaining rows were present. I made a lookup json so that for every unique ID the chunk containing it could be traced back.

You can check out my code segment here.

It takes some more coding but the data processing gets much faster with lower memory load if you move to chunking. A big advantage is that the upper limit on how much data you can process gets removed, so in future if the data scales up then the same program on the same hardware will be able to handle it (subject to the limits of your intermediate data that is.. in my case it was a simple dict array holding just one key-value data pair for every 50 records or so), plus in cases of smaller data we don't end up wasting high-end resources. In my case I was able to make my program compatible to run on simple personal laptops no matter how big the data got, instead of requiring hefty amounts of RAM that all my users wouldn't have.

--Nikhil VJ

Reply all

Reply to author

Forward