System Outage 6.29 - 9.14

12 views

Skip to first unread message

Ahmy Yulrizka

unread,

Sep 13, 2014, 4:17:16 AM9/13/14

to commonsense...@googlegroups.com, All Capone

This morning we have some problem with our API server.

Time-line of the outage
At 6.29 we received some notification from our monitoring system that small part of API call were failing.
After looking more into the problem, we have identified that this was caused by the failure on the storage system that's responsible for recording usage and statistic.

At 8.15 we have implemented a solution and restart the storage server.
But this have some implication affecting the main storage. At this point we saw that all of the user can't authenticate,
but we also saw some of the API call to store sensor data are still working.

At 9.14, we have identified the problem and again restart the storage server.
Our monitoring system report that the problem were solved and system are operational once again.

Effort to mitigate this issue in the future
We are currently in th process of migrating the usage and statistics into system that is separate from the API.
this would mean that in the future, outage in the usage and statistics should have less impact on the API.

Our sincerest apologies for this issues and thank you for your patience.

Reply all

Reply to author

Forward

0 new messages