Planing a substantial ClickHouse POC

451 views

Skip to first unread message

ste...@activitystream.com

unread,

Jun 4, 2017, 8:45:23 PM6/4/17

to ClickHouse

Hi,

First I would like to thank for this impressive project. I have enjoyed testing ClickHouse immensely.

I would really appreciate assistance on plan a POC we need to do or get any pointers in the right direction.

The basic requirements are:

Test a mix of typical Time-Series workload with a more traditional Event based workload (smaller records with fixed interval vs. larger records for arbitrary events)
Ingest a few thousand records a second for Time-Series data points and a few hundred events per second
Make ingested data query-able in as short a time as possible
Evict old data to slow(er) storage over time while keeping it accessible
Accommodate multi-tenancy with separation of "similar" data.

I'm particularly interested in ingestion and partitioning and tips/tricks regarding ReplacingMergeTree and Materialized views (rather than down-sampling, aggregations or roll-ups)

A few questions that have come while going through the documentation and this group

Is there a sweet spot when it comes to cluster/shard size?
Are there best practices when it comes to replication/sharding?
Should servers favor drive speed, threads, memory or all of the above?
Are the Yandex ingestion machines configured differently from main query/storage servers and what are the best practices regarding that?
Is the JDBC (backed by HTTP) approach enough when it comes to reliable ingestion or are message queue approaches popular?
What is the internal process at Yandex when it comes to moving records from temporary (simple) table engines to their destination tables?
Do self joins (array joins) carry a lot of performance penalty or any at all?
Is there any way to evict long-tail (hardly used) data to slower storage and mount partition located on secondary storage (like S3)?
What sort of setup is ideal to ensure concurrent query processing and system integrity (graceful degradation) under load?
Will there ever be a published road-map (I know it's pretty "internal" now)?
I appreciate the HDD/SDD eccentric processing approach (over GPU) but I wonder what GPU gains could possibly be had for ClickHouse.
Does ClickHouse aim to become a fully-fledged member of the Hadoop ecosystem or is the completely out of scope? (we have used Spark for data-science related processing (I'm aware of available plugins))

Please point me to suited material if it's available or throw us any kind of pointers here.

Very best regards,

-Stefan Baxter

Raji Sridar

unread,

Oct 14, 2020, 2:07:11 PM10/14/20

to ClickHouse

Hi Stefan,

Very interesting use case and questions. What are your findings please? Can you please share if thats ok. I think it will benefit the bigger Clickhouse community.

Best Regards

Raji

Reply all

Reply to author

Forward

0 new messages