Thank you for sharing more great work.
I had trained a CCRL net using the blog yaml except the value weight was set to 1.0.
If I can find it again (sigh), I'll run a match vs yours (value 0.25)
Separately, I'm wondering about RL v SL. I tend to think of SL as training from pgn input.
Of course, nets trained without policy info (beyond only the one move played) are many hundreds of Elo weaker.
Another way to think of SL is not using the more rapid training cycle with many nets and some regression test like what Lc0 does.
If a net is trained with policy data (either from pgn with it added, or from Lc0 self-play games with it already included) in a longer training run (many more steps and samples), does that count as SL?
Perhaps a name for something in-between would be helpful?