My apologies for the inconvenience. The most recent three "server" issues were caused by the continuous retraining of our own research algorithm Fugu -- Fugu is a learning-based ABR algorithm and it retrains five neural network models every day. For some reason (probably memory overutilization?), it failed to save one of the five models very occasionally, which prevented the retraining on the next day and brought down the entire system. I don't currently have a good way to reliably reproduce the symptom, so unfortunately it might happen again in the future.
In general, Puffer runs many research algorithms behind the scenes without thoroughly profiling them; this greatly reduces the turnaround time compared with the best industrial practice, at the cost of reliability and availability.
By the way, prior to this, the downtime issues were often caused by something else -- the file receiver process that we run on the Puffer server to receive encoded media chunks from other servers could become CPU-bound and fail to catch up with the three senders sometimes. I fixed that issue by introducing two additional file receiver processes, which seems to be working a lot better.
Best,
Francis