Planing a substantial ClickHouse POC

451 views
Skip to first unread message

ste...@activitystream.com

unread,
Jun 4, 2017, 8:45:23 PM6/4/17
to ClickHouse
Hi,

First I would like to thank for this impressive project. I have enjoyed testing ClickHouse immensely.

I would really appreciate assistance on plan a POC we need to do or get any pointers in the right direction.

The basic requirements are:
  • Test a mix of typical Time-Series workload with a more traditional Event based workload (smaller records with fixed interval  vs. larger records for arbitrary events)
  • Ingest a few thousand records a second for Time-Series data points and a few hundred events per second
  • Make ingested data query-able in as short a time as possible
  • Evict old data to slow(er) storage over time while keeping it accessible
  • Accommodate multi-tenancy with separation of "similar" data.
I'm particularly interested in ingestion and partitioning and tips/tricks regarding ReplacingMergeTree and Materialized views (rather than down-sampling,  aggregations or roll-ups)

A few questions that have come while going through the documentation and this group
  • Is there a sweet spot when it comes to cluster/shard size?
  • Are there best practices when it comes to replication/sharding?
  • Should servers favor drive speed, threads, memory or all of the above?
  • Are the Yandex ingestion machines configured differently from main query/storage servers and what are the best practices regarding that?
  • Is the JDBC (backed by HTTP) approach enough when it comes to reliable ingestion or are message queue approaches popular? 
  • What is the internal process at Yandex when it comes to moving records from temporary (simple) table engines to their destination tables?
  • Do self joins (array joins) carry a lot of performance penalty or any at all?
  • Is there any way to evict long-tail (hardly used) data to slower storage and mount partition located on secondary storage (like S3)?
  • What sort of setup is ideal to ensure concurrent query processing and system integrity (graceful degradation) under load?
  • Will there ever be a published road-map (I know it's pretty "internal" now)?
  • I appreciate the HDD/SDD eccentric processing approach (over GPU) but I wonder what GPU gains could possibly be had for ClickHouse.
  • Does ClickHouse aim to become a fully-fledged member of the Hadoop ecosystem or is the completely out of scope? (we have used Spark for data-science related processing (I'm aware of available plugins))
Please point me to suited material if it's available or throw us any kind of pointers here.

Very best regards,
  -Stefan Baxter

Raji Sridar

unread,
Oct 14, 2020, 2:07:11 PM10/14/20
to ClickHouse
Hi Stefan,

Very interesting use case and questions. What are your findings please? Can you please share if thats ok. I think it will benefit the bigger Clickhouse community.

Best Regards
Raji

Reply all
Reply to author
Forward
0 new messages