Hey everyone,
If 2020 wasn't already slated for being famous for other, more horrible, things than Diplomacy services, then in my mind it would be famous for my mistaken deployment today.
For posterity, since most of this was only covered in various bcc'd emails and Discord chats, I'll give a not-so-short post mortem:
- At around 11 am today I wanted to roll out an updated map with a fixed province abbreviation.
- I forgot to check which branch of the server I had checked out, and didn't notice that it was the branch where we were working on our very experimental new feature where players have to declare themselves ready to play before the game starts, or it goes back to staging (without the non-ready players).
- This first caused the server to start spawning errors due to the new code using task queues not yet enabled in the server. This was 100% due to my previous mistake.
- Then it caused all current games (which weren't created with the new feature) to be considered in this "everyone has to declare ready" state. That meant that when they resolved, if not everyone in them had committed their orders they would reset back to staging. This was 100% a bug in the new feature branch that we never thought of :O
- Then people started noticing, and sending me emails and IMs. I started trying to understand what was happening, by reading the code, reading the emails, and finally decided to just pause the main "game resolution" flow in the server. Unfortunately I was panicky enough by then to pause the wrong frigging queue.
- Finally I understood what was happening, and managed to run a job that marked all the old games as "no longer in the declare ready" state. This solved most of the problem.
- Then I found 7 games that had been reset to staging, among them one of my own test games that I don't care about. I started gathering the email addresses of everyone I could track down in the log files from the remaining 6 games (since the "reset to staging" process was not (unfortunately) buggy, no remains of the old players that were removed remained - only those that had committed orders when the game reset.
- I made sure those 6 games were again "started for real" and not staging, so they didn't get new prospective players to confuse things. I'm not 100% sure I did this in time. Time will tell... I also created empty ghost players for the missing ones, as placeholders until I got the real players back in.
- I sent an email (thanks for authoring it Joren) to all the players I had tracked down, asking them what nation they played. This email also was a bit messy - I managed to replace the URL text in the email, but not the address it pointed to in some emails - not fun to do this while stressing out.
- I built a new process that could accept nations and email addresses, and reinstate them into named games where there were empty placeholders.
All this while juggling my day job and family while working from home due to the corona situation :O
And now here we are.
It was stress and drama, but hopefully it's done now, or at least most of it.
Cheers,
Martin