From ClickHouse query to the Python Pandas DataFrame

4,936 views

Skip to first unread message

unread,

Sep 5, 2018, 3:26:50 AM9/5/18

to ClickHouse

Hi,

what would be the best method to get data from clickhouse to python pandas dataframe?

At the moment, I am using:

from clickhouse_driver import Client

client = Client(host='localhost', port=xxxx, database='yyyy')

sql = 'Select * from table'

cols = [table field1, field2, ..., fieldn]

query_result = client.execute(sql, settings = {'max_execution_time' : 3600})

df = pd.DataFrame(query_result, columns = cols)

It works on my laptop (i7, 16GB). For 1M rows and 40+ columns it takes 30 seconds.

For 10M rows and the same number of columnt, it takes 335 seconds

I tried the same for 100M, but I got the error that max_execution_time has been exceeded.

On https://github.com/mymarilyn/clickhouse-driver there is a section "Block by block results streaming"

settings = {'max_block_size': 100000}
rows_gen = client.execute_iter('QUERY WITH MANY ROWS', settings=settings)

for row in rows_gen:
    print(row)

Looks like I could use block by block approach and put all strings to categorical pandas fields.

What would be the best and the most efficient way to get data in pandas dataframe from the clickhouse query?

How about HDFStore?

Does anyone here has some experience about this subject?

Regards.

Reply all

Reply to author

Forward

0 new messages