This is a bit hard to replicate, so bear with me (though i'd drop by here before filing a github issue with replication).
I am also maksing som assumptions that i will highlight with green since i am not very familiar with the envoy model/code around als:
If the AccessLogService goes down during envoy serving traffic (saw a bunch connection error to the ALS cluster), all of the messages will queue and eventually the main thread will become resource exhausted, meaning control plane updates will get dropped. but worker threads will still serve traffic no problem.
This results in stale configuration, which may lead to connection errors on cluster/routes updates.
Just wondering if this is by design (seems a bit weird to me to have logs compromise production traffic so this might be unintended)?
Is there a knob to tweak this behavior? (if an ALS message fails, drop the message don't keep it in memory and don't retry)
Or is this a potential bug/issue that was previously unreported?