What's an efficient way to insert data into ClickHouse from a Java program?

3,673 views
Skip to first unread message

Yj H

unread,
Apr 27, 2018, 7:07:54 AM4/27/18
to ClickHouse
We need insert data into ClickHouse from a Java program efficiently.
Data to insert is a collection of Java objects resident on the JVM heap.

We are evaluating the JDBC driver (https://github.com/yandex/clickhouse-jdbc), but find it not efficient enough. 
This is because it is http based, not tcp-based. Numbers, dates ... are represented in text, not in the more efficient binary form.

We wonder if there is a TCP-based JDBC driver.

If not, we may implement one ourselves. Is there any document on the TCP protocol of ClickHouse ?
The official document https://clickhouse.yandex/docs/en/interfaces/tcp/ does not describe it.


Yj H

unread,
Apr 27, 2018, 7:39:05 AM4/27/18
to ClickHouse
Also in the http case, it possible to send gzipped data to ClickHouse server, and server will transparently decompress it before processing ?

Dmitry Berezhnov

unread,
Apr 27, 2018, 10:43:24 AM4/27/18
to ClickHouse
Hi, what is problem to use official jdbc driver?
I used with education aim and it works very well.
I used another http-based driver (and wrote one of them) - it is works!

Yj H

unread,
Apr 27, 2018, 9:57:23 PM4/27/18
to ClickHouse
We have the following benchmark:

Insert a list of 10000 records (in a Java ArrayList) repeatedly by 100000 iterations, into one MergeTree table. That is, to insert 1 billion records in total.
Each record has 38 columns (mostly String columns), and is 384 byte on average.

We insert them from a ClickHouse client to a ClickHouse server on the same local machine: 

CPU: Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz,6 cores
Memory:    64GB
Disk write throughput:  94.78 MB/sec, measured by  "time dd if=/dev/zero of=test.dbf bs=8k count=300000 oflag=direct"
Disk read throughput:  141MB/sec, measured by "dd if=test.dbf bs=8k count=300000 of=/dev/null"

The above benchmark to insert 1 billion records by the official JDBC driver took 3 hours to complete.
Is this performance expected ? Any room to improve ?

Dmitry Berezhnov

unread,
Apr 28, 2018, 5:43:55 AM4/28/18
to ClickHouse
You can use strategies:
1. Increase batch size from 10k to 1kk
2. Store rows to file and use clickhosue-client for import
Message has been deleted

Yj H

unread,
May 2, 2018, 5:42:27 AM5/2/18
to ClickHouse
According to my latest benchmark, changing batch size from 10k to 1kk (one million) does not reduce the insertion time.

Personally, I think there is no need to persist data into a  csv file on disk, just for cilckhouse-client to import form. Disk persistence can be avoided from a architectural view.

Tristan Burch

unread,
May 9, 2018, 6:48:28 PM5/9/18
to ClickHouse
Hi,

I've built a java app using the official jdbc driver to insert data into Clickhouse from Kafka. Our average row is ~600bytes (13 columns inserting into a ReplicatedReplacingMergeTree) and our average insert is around 27K rows which takes about 1.5 seconds to insert. Our sustained insert rate is around 250K rows/sec (I've seen it get close to 400K rows/sec when spiking). Even though the driver is http, I believe it does do request compression.

-Tristan

Tristan Burch

unread,
May 9, 2018, 6:52:21 PM5/9/18
to ClickHouse
Depending on your Clickhouse setup it might be limited there. Can you post that as well as you test?

-Tristan

Yj H

unread,
May 10, 2018, 7:56:01 AM5/10/18
to ClickHouse
As mentioned in the previous post, our benchmark is inserting into a standalone ClickHouse server on a single machine of the following hardware:

CPU: Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz,6 cores
Memory:    64GB
Disk write throughput:  94.78 MB/sec, measured by  "time dd if=/dev/zero of=test.dbf bs=8k count=300000 oflag=direct"
Disk read throughput:  141MB/sec, measured by "dd if=test.dbf bs=8k count=300000 of=/dev/null"

The ClickHouse server uses default configuration.


May I know what is your Clickhouse cluster setup, and hardware specification ?

Tristan Burch

unread,
May 10, 2018, 10:09:46 AM5/10/18
to ClickHouse
We run a cluster of 10 shards (2 instances per shard) on r4.8xlarge (32 vCPUs, 244 GB Memory) instances in AWS with EBS for storage (gp2 with 10000 IOPS). 

Can you post the source code for your test?

-Tristan

cos...@gmail.com

unread,
May 21, 2018, 11:57:08 PM5/21/18
to ClickHouse
We are trying this part of the work. The protocol section has been implemented and can be referred to https://github.com/houseflys/ClickHouse-Native-JDBC

在 2018年4月27日星期五 UTC+8下午7:07:54,Yj H写道:

Nguyen Manh Ha

unread,
Apr 29, 2019, 2:42:00 PM4/29/19
to ClickHouse
I am trying the ClickHouse-Native-JDBC driver and the performance is actually worse than Yandex driver. Not sure where I got wrong. 

Within same test of inserting 1 million records with batch size of 100,000 : 
- Time taken for executing a batch using "Native-JDBC":  is around 5-10 seconds 
- using Yandex driver,  average time taken ~ 2.9 seconds. With lz4 decompresor enabled,  it could be reduced to ~ 2 seconds. 
Reply all
Reply to author
Forward
0 new messages