I wanted to give an update on the current status of these logs. This morning we have enabled these logs to accept new entries again. Our investigation into the outage revealed that we saw a 3-4x increase in calls to the get-entries endpoint which caused the ctile PODs to crash due to memory constraints. This indicates we had our rate limits set too high for the infrastructure setup. As of now we have increased the memory of our ctitle PODs and lowered our get-entries rate limits to be 10/requests per second per IP (with a maximum of 256 entries per request). We will continue to work on expanding infrastructure resources and then revlatue if we should increase the rate limit again.
The other item we are addressing is why our monitoring alerts did not get triggered. We internally run woodpecker for end-to-end testing and we have monitoring on the logs services themselves. Both of those feed alerts into the same system alert system which failed to trigger the notifications.
Thanks,
Rick