From ClickHouse query to the Python Pandas DataFrame

4,935 views
Skip to first unread message

kriticar

unread,
Sep 5, 2018, 3:26:50 AM9/5/18
to ClickHouse
Hi,

what would be the best method to get data from clickhouse to python pandas dataframe?

At the moment, I am using:

from clickhouse_driver import Client
client = Client(host='localhost', port=xxxx, database='yyyy')
sql = 'Select * from table'
cols = [table field1, field2, ..., fieldn]
query_result = client.execute(sql, settings = {'max_execution_time' : 3600})
df = pd.DataFrame(query_result, columns = cols)

It works on my laptop (i7, 16GB). For 1M rows and 40+ columns it takes 30 seconds.
For 10M rows and the same number of columnt, it takes 335 seconds

I tried the same for 100M, but I got the error that max_execution_time has been exceeded.



On https://github.com/mymarilyn/clickhouse-driver there is a section "Block by block results streaming"

settings = {'max_block_size': 100000}
rows_gen = client.execute_iter('QUERY WITH MANY ROWS', settings=settings)

for row in rows_gen:
    print(row)
Looks like I could use block by block approach and put all strings to categorical pandas fields.


What would be the best and the most efficient way to get data in pandas dataframe from the clickhouse query?
How about HDFStore?

Does anyone here has some experience about this subject?

Regards.
Reply all
Reply to author
Forward
0 new messages