Hi Glauber, et al,
I don't post here often, but in this case, I believe this could help you understand how CS works, and why it is so blistering fast with high-volume data.
We append load/insert and query a single table with 96 columns and 4,87 billion rows every 15 minutes. We have multiple InnoDB tables for receiving the streaming data from IoT devices and every 15 minutes move the previous 15 minutes data to the CS table. We have over 100 analytics users on our database and they do over 300,000 queries a week. Some queries are quite predictive and repeat queries with varying filters and conditions in the WHERE clause. Other queries are unpredictable and very random explorative in nature. With most database engines you will partition and index the tables for any possible query, but not always successful. With CS you don't need to do any of that.
Understanding the internal workings of the CS storage engine and query optimizer is important. How and where you deploy the hosts and its dbroots is critical. If you have a clustered setup, be sure that the dbroots are not on the same storage subsystem as the other storage engines. CS is not that I/O heavy, but CPU heavy. Having more CPU's is better. Fast disk (SSD) does help but it will not be the main performance factor.
Please read the CS architecture documentation (
https://mariadb.com/docs/server/architecture/components/enterprise-columnstore/) about segments, extents, extent elimination, extent maps and partitions. You have control over some of the parameters, but it needs to be configured at initiation. It is not something to change later on. To be honest, we kinda kept our installation vanilla. Then again we have top notch hardware (Nutanics on-premises) to run this on and we learnt a lot through trial and error.
Data types are also an important factor. Making every column VARCHAR/CHAR will only degrade performance. Any calculation or aggregation on FLOAT data types is an absolute no-no. Get the actual data type correct from the start by doing column profiling before you decide on the final table structure and data types.
If you can, insert/load the data in CS pre-ordered/pre-sorted - it helps speed up the bulk loads.
And finally, user education is vital. They need to think 'columnar' data with every query they write, meaning no more SELECT * FROM... we find this type of query quite often:
SELECT A,B,C,D, SUM(E)
FROM (SELECT * FROM TABLE_A) -- TABLE WITH 50 COLUMNS
GROUP BY A,B,C,D
This type of query for CS is no good. I am sure you can see why. But for MariaDB ISAM, Aria, or InnoDB, this is great because it was designed to retrieve entire rows of data, hence SELECT *. The best is to choose the exact fields you will need in the projection list.
Push it to your imaginable limits and then some more. Have fun!
F