Loading data to ClickHouse in realtime

Burak Emre Kabakcı

unread,

Jun 15, 2016, 1:01:33 PM6/15/16

to ClickHouse

In documentation, the TinyLog table is stated as "In Yandex.Metrica, TinyLog tables are used for intermediary data that is processed in small batches.". From my understanding, INSERTs to TinyLog has minimal overhead and you use TinyLog table as commit-log for events, similar to Apache Kafka which also use Zookeeper and have similar semantics (It's row oriented but TinyLog divides the row into columns).

How do you move data from TinyLog table to MergeTree table in real-time (processing micro-batches), since ClickHouse doesn't support DELETE yet, I couldn't find a way to do it.

man...@gmail.com

unread,

Jun 15, 2016, 4:18:35 PM6/15/16

to ClickHouse

Yes, we store intermediate data in TinyLog* tables and then INSERTing them to ReplicatedMergeTree tables.

* more exactly, we are using StripeLog tables. They are not documented right now. They differs from TinyLog in one way: all columns are stored in one file in "striped" form, so there are less files when working with many small tables.

There are program, that have queue of tables to process. It just SELECTs data from TinyLog table, do some necessary preprocessing and do INSERT into another table.
ReplicatedMergeTree supports idempotence: on server fail, program must do another try to INSERT same data into table, maybe on another replica, and data will be inserted only once.

ClickHouse doesn't support DELETEs directly. Instead, there are (very specific) CollapsingMergeTree tables.

Burak Emre Kabakcı

unread,

Jun 15, 2016, 5:30:56 PM6/15/16

to ClickHouse

What I want to do is to INSERT rows to ClickHouse periodically every second directly from API server and eventually move data to a MergeTree table. Should I create new table for each second and externally perform INSERT INTO fact_table SELECT FROM tinylogtable and delete tinylogtable? Creating many temporary tables (One per second per server) doesn't seem to be the best idea; creating tables usually have overhead (I actually don't know if it's not an issue for ClickHouse) because the database must create new files and save metadata for the new table.

What would be the best idea for this use case? Should we have a queue in front of ClickHouse and periodically flush the queue to TinyLog tables in ClickHouse? Or is it possible to INSERT data to TinyLog table and periodically flush data to MergeTree table? DELETE doesn't implemented yet and TinyLog table doesn't flush the data so I think I'm lost here.

man...@gmail.com

unread,

Jun 16, 2016, 2:36:43 PM6/16/16

to ClickHouse

You should form one-second batches in any way and do INSERTs to MergeTree.

Simplest way is to form batches in your application memory and not to store it anywhere temporarily.
It is not reliable, because when you could not do INSERT, memory of your application will grow, and when your application fails, data will be lost.
To do it more reliable, you need a queue.

It is not necessary to use ClickHouse for queue. Kafka is also Ok.

One TinyLog or (better) StripeLog temporary table for each second is Ok.
~500 000 StripeLog tables per server works in production, but your server will startup for tens of minutes.
And please use separate servers for 'final' data and for that huge amount of chunks.

In fact, in Yandex.Metrika, we use separate cluster for intermediate data, and application is storing each chunk in StripeLog table on three different nodes.

Reply all

Reply to author

Forward