First of all, thank you for testing!!
I have got an idea that is not related to scaling (in addition to test elo strength for nets that still is very necessary).
The idea is to test ending strength.
In fact, Dietrich is already doing for his specialized net "ender".
He has got a big set of ending positions in pgn format (11 or 12 man positions).
The idea is to make matches against an external engine (usually SF9 with or without TBs).
For instance: 11248 against SF9 with those starting positions (color reversed), etc. Leela can be tested with or without TB to assess its use.
We do not kwnow if test20 plays the endings better or worse than test10. Idem for test30 (for test 30 is specially interesting to play with TB since it has been trained with TB rescoring).
In summary, to make endings strength rather overall strength (endings is Leela's weakest point).
The idea could be testing latest nets but you also can select some of them to follow how evolves with training.
This is a field not very explored and I think much, much cheaper (and different) than scaling.
I hope this helps. Thanks again!!
You can find starting positions and an idea on how ender works here: