Inserting streaming data in Clickhouse

1,157 views
Skip to first unread message

nav...@helpshift.com

unread,
Dec 15, 2016, 3:29:13 AM12/15/16
to ClickHouse
hi ,
we are evaluating Clickhouse for our use case of inserting continuous data into ClickHouse .

i have made a mergeTree table names "events" , which is working fine with continuous data .

now we want to move to distributed mode , where i created a table looking at "events" table named "events_distributed" with engine distributed.

My questions:
1) how will data be moved to different nodes , if i am only inserting into "events" (mergetree table ) .
2) do i need to run separate query to insert into distributed engine table . if that how to handle duplication , or keep removing old data from events or make insert insert query to insert only new data .

please explain if there is mistake in understanding.

Thanks in advance.


thanks,
Navdeep


Ryan Waters

unread,
Dec 15, 2016, 10:54:09 AM12/15/16
to ClickHouse
The information you're interested is here:

https://clickhouse.yandex/reference_en.html#Data%20replication

In summary, just insert your data as per normal and Zookeeper + ClickHouse will take care of it with eventual consistency.  You do not need to take any extra steps or treat replicated tables/inserts differently from non-replicated tables/inserts.

- - -

"Replication is only supported for tables in the MergeTree family. Replication works at the level of an individual table, not the entire server. A server can store both replicated and non-replicated tables at the same time.
...
INSERT and ALTER are replicated ... Replication is not related to sharding in any way. Replication works independently on each shard.
...
There are no quorum writes. You can't write data with confirmation that it was received by more than one replica.
...
For each INSERT query (more precisely, for each inserted block of data ... <= max_insert_block_size = 1048576 rows) <inserts are replicated immmediately with slightly additional latency>
...
Replication is asynchronous and multi-master. ... Data is inserted on this server, then sent to the other servers. Because it is asynchronous, recently inserted data appears on the other replicas with some latency.

...if the INSERT query has less than 1048576 rows <or max_insert_block_size>, it is made atomically."
Reply all
Reply to author
Forward
0 new messages