Regarding Loki Ingester - Replication factor + WAL + flush

26 views
Skip to first unread message

Wiard van Rij

unread,
Mar 10, 2024, 11:54:01 AMMar 10
to lokiproject
Dear Loki folks,
I'm reaching out to discuss some challenges I've encountered while configuring the ingester component of Loki. After spending some time with the configuration settings, I find myself lacking a comprehensive understanding of the reasoning behind certain choices. Specifically, I'd like to address a few points related to the combination of Write-Ahead Log (WAL), replication factor, and flushing behavior.
Firstly, regarding the use of WAL in conjunction with a distributed setup like Loki's, where data streams are distributed among multiple ingesters, I'm encountering some conceptual difficulties. While the purpose of a WAL in a non-distributed environment is clear—to ensure consistent data streams—I'm finding it less intuitive in a distributed setup. In such a scenario, when an ingester goes down and then restarts, it won't necessarily receive the same data streams it processed before downtime. This results in replaying the WAL, effectively duplicating chunks that may never be flushed ('soon') due to the absence of corresponding data streams. This seems to introduce unnecessary overhead and complicates the management of ingester resources.
Example memory usage: 
memory-example.png
Secondly, I've observed some unexpected behavior related to the `flush_on_shutdown` setting when WAL is enabled. According to the documentation, with WAL enabled, data should not be flushed on shutdown. However, I've noticed that upon exit, it still flushes and the ingester still replays the WAL on startup. Which seems counterintuitive. I would expect that data would be flushed during shutdown, clearing the WAL, and thus eliminating the need for replaying it upon startup. However, this doesn't seem to be the case in practice.
Ultimately, my goal is to achieve a balance between redundancy and efficient ingester operation. However, the current combination of replication factor, WAL, and flushing behavior appears to introduce significant overhead and complexity. I'm eager to better understand the rationale behind these configuration choices and explore potential solutions or optimizations to improve the efficiency of ingester operation while maintaining redundancy.
Perhaps to summarise my thoughts/idea:
  • Flush on Shutdown Behavior: Ideally, the flush_on_shutdown setting should ensure that all data is flushed before shutdown, eliminating the need to replay the WAL upon startup. This would ensure that ingesters only process fresh data upon restart, avoiding the retention of old chunks in memory and reducing unnecessary overhead.
    • Then the WAL still exists in case the ingester did not exit cleanly. 
  • Consideration of WAL and Replication Factor Combination: Currently, the combination of WAL and the replication factor appears to introduce significant overhead. It might be beneficial to reconsider this combination, as having both WAL and a high replication factor could result in unnecessary compute costs. For instance, if a high replication factor provides sufficient redundancy, it might be reasonable to forego WAL or reduce its usage to optimize resource utilization. I wish there were some thoughts about this from the project. 
I would greatly appreciate any insights or guidance you could provide on these matters. 
P.s. Perhaps this issue I've created is partially relevant: https://github.com/grafana/loki/issues/11900
In the end, perhaps I'm just thinking differently, but if you want, I have no problems joining some project session to explain my use-case better or to just have a chat about my thoughts. Let me know! 🙂 
Thank you for your time and assistance. I look forward to your response.
Best regards,
Wiard van Rij
Reply all
Reply to author
Forward
0 new messages