Hey Patrick,
Thanks for reaching out! Our cardinality limits for our M3DB clusters at Uber have mostly been restricted by the number of time series that you can fetch in a single query as opposed to the total number of time series you can store. In production, we store over 6 billion time series across multiple clusters, each one containing up to hundreds of millions of time series, so storing 800K series per namespace is not an issue.
The limitation for us has been when we query this information - for example, if you wanted to retrieve all 800K time series at once and sum them up, the query nodes - as you can imagine - will run out of memory. While we are making performance optimizations in our new query engine (
https://github.com/m3db/m3/tree/master/src/query) to address this, such as keeping the blocks of data compressed in memory until they need to be evaluated and streaming blocks of data back to the query engine in small pieces at a time, we've found that the better solution to tackle use cases like this is via the ingestion pipeline. Inside Uber, we use our aggregation tier (
https://github.com/m3db/m3aggregator) to perform the summation (or other aggregation functions) on the data when it is ingested such that we only store and retrieve a single time series for the aggregated query. This is similar to Prometheus' roll-up rules.
Our aggregation tier also lets us tweak the retention of metrics so a common use case would be that we store the high cardinality metrics for a shorter period of time and then drop some of the labels when we store them for much longer retention.
I hope that answers your question. Let us know if you need anything else.
Cheers,
Martin