Hi David, thanks for your patience.
In the end, this did turn out to be a bug; we were not giving tests a fractional score for subtests (it was either all or nothing). This
has been fixed, and the numbers on wpt.fyi re-generated. There was no top-level score change from this (i.e. it was generally minor) and no browser benefited significantly better from any other (i.e. they all had some partially-passing tests that are now being counted), so overall we believe it just gives a more accurate read on the status.
There is now a jump between 4549 and 4557 for flexbox, from 0.939 to 0.957, that I believe should be your changes. (Note that it won't align with Chrome releases because the changes you made were to tests not the Chrome binary, so its instead about when the tests landed in WPT - a bit of guessing involved :D).