Hi,
First I would like to thank for this impressive project. I have enjoyed testing ClickHouse immensely.
I would really appreciate assistance on plan a POC we need to do or get any pointers in the right direction.
The basic requirements are:
- Test a mix of typical Time-Series workload with a more traditional Event based workload (smaller records with fixed interval vs. larger records for arbitrary events)
- Ingest a few thousand records a second for Time-Series data points and a few hundred events per second
- Make ingested data query-able in as short a time as possible
- Evict old data to slow(er) storage over time while keeping it accessible
- Accommodate multi-tenancy with separation of "similar" data.
I'm particularly interested in ingestion and partitioning and tips/tricks regarding ReplacingMergeTree and Materialized views (rather than down-sampling, aggregations or roll-ups)
A few questions that have come while going through the documentation and this group
- Is there a sweet spot when it comes to cluster/shard size?
- Are there best practices when it comes to replication/sharding?
- Should servers favor drive speed, threads, memory or all of the above?
- Are the Yandex ingestion machines configured differently from main query/storage servers and what are the best practices regarding that?
- Is the JDBC (backed by HTTP) approach enough when it comes to reliable ingestion or are message queue approaches popular?
- What is the internal process at Yandex when it comes to moving records from temporary (simple) table engines to their destination tables?
- Do self joins (array joins) carry a lot of performance penalty or any at all?
- Is there any way to evict long-tail (hardly used) data to slower storage and mount partition located on secondary storage (like S3)?
- What sort of setup is ideal to ensure concurrent query processing and system integrity (graceful degradation) under load?
- Will there ever be a published road-map (I know it's pretty "internal" now)?
- I appreciate the HDD/SDD eccentric processing approach (over GPU) but I wonder what GPU gains could possibly be had for ClickHouse.
- Does ClickHouse aim to become a fully-fledged member of the Hadoop ecosystem or is the completely out of scope? (we have used Spark for data-science related processing (I'm aware of available plugins))
Please point me to suited material if it's available or throw us any kind of pointers here.
Very best regards,
-Stefan Baxter