Hi Nicolás,
Sorry for the delay in replying.
Overall, I think your suggestion makes sense, but there are complications.
The purpose of having platform-specific -expected.txt files in general is a Blink web_tests 'thing', not a WPT thing. It exists because there are tests which legitimately produce different output on different platforms. Even different Mac versions can output different things!
So in such a world, how do we detect when differing outputs across platforms is from flake versus legitimate difference? Note that for the autoroller to land a CL, the test must have produced the same output on those platforms twice - once for us to generate expectations and once to pass the CQ! So at the worst cast the test is < 1/2 flaky - which is not a high bar but we have to balance CI resources and latency against detecting such flakes. And just adding to TestExpectations might give the wrong impression - if my test deterministically passes on Mac10.11 but fails on Mac10.12, now the TestExpectations entry would claim that I am Pass/Failure flaky.
One could see various solutions, including using the upstream WPT stability checks as input data, but they will all be heuristic-y. Possibly others on the team may have more thoughts on detecting legitimate differences vs flakes.
Thanks,
Stephen