marking this all clear, run times have returned to normal after mitigation
postmortem
root cause
unknown
- no code changes occurred before or after the incident or to mitigate the incident
- observable host level metrics (cpu / io) were not elevated on any of the affected hosts
what went well
- run-level metrics were extremely helpful for identifying the affected timeframe and validating the fix
- host rotation was quick and easy (already scripted)
- helpful issue created by @matthewfeickert alerting to the problem
what didn't go well
- detection: slow and entirely manual
- prevention: unknown what caused the actual root issue
follow-up
- detection: add automated alerting for elevated timing
- prevention: investigate larger ec2 instance sizes for performance and to lessen "noisy neighbor" effects
Anthony