Hi,
I have a dataset in clickhouse database that has 3.000.000.000+
(3+ billion, not million) rows and 40+ columns.
Columns are datetime or integers,
except 9 String columns.
All strings could be converted with Series.astype('category') because there are lots of repeated values.
I have a laptop with i7 processor and 16 GB of RAM, Windows 10, 64-bit.
Database is accessed with clickhouse_driver (https://github.com/mymarilyn/clickhouse-driver),
so I execute
query_result = client.execute('select * from table’)
df
= pd.DataFrame(query_result)
in order to get the dataframe.
It works for
20.000.000 (million not billion) rows, but as I can see jupyter-lab.exe is consuming
10 GB of RAM.
For these 20.000.000 rows, query and dataframe creation take about 10 minutes.
Can I expect that the whole dataset (that is growing) will fit in into pandas DataFrame in jupyter notebook?
What are practical limitations of pandas/jupyter?
What could be an alternative approach?
Regards.
--
You received this message because you are subscribed to the Google Groups "PyData" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pydata+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
I have a dataset in clickhouse database that has 3.000.000.000+ (3+ billion, not million) rows and 40+ columns.
I have a laptop with i7 processor and 16 GB of RAM, Windows 10, 64-bit.
Database is accessed with clickhouse_driver (https://github.com/mymarilyn/clickhouse-driver), so I execute
query_result = client.execute('select * from table’)
df = pd.DataFrame(query_result)
It works for 20.000.000 (million not billion) rows, but as I can see jupyter-lab.exe is consuming 10 GB of RAM.
For these 20.000.000 rows, query and dataframe creation take about 10 minutes.
Can I expect that the whole dataset (that is growing) will fit in into pandas DataFrame in jupyter notebook?
What are practical limitations of pandas/jupyter?
What could be an alternative approach?