Thank you very much for this global factual overview. I understand the importance of policy, I think at some mathematical level, as required as long as the position evaluation NN is having some error to be cancelled out, which makes it a requirement for any self-play training.
If the position evaluation NN was expressive enough to sufficiently approximate the true evaluation function (or represent the true outcome probabilities under best chess from given position, which one could posit to exist, for example if the entire legal position space were known and any path part of some playing pair of move quality (e.g. perfect chess, but could also be other pairs, and their outcome distribution). But the fact that policy during play, and its relation to something of the self-play database, that is not represented yet in the evaluation NN, is needed toward engine competitio measures of performance, might just indicate that work may have to be done still there. I know this is not necessarily feasible. And is possibly a quirky point of view on how lc0 may have been performing all along. So I would think both the policy improvement angle or the NN architecture/training set-up, might be plausible avenue for improvmenet. I still need to understand what is captured by the policy NN about the self-play games, and wether that is about all batches of self-play, from uniform policy to "best chess" last batch considered. I am working on that,, might take some time.
In the mean time I thought of things I should tell the op, why the iron is still hot in the development. Please do not hard-wire the intitial standard position into your neural net executable. make it as modular as RL versus SL, and other currently desgined as modular features of lc0. As for example the loss functoin being delegated to some configuration file allowing for more experiements to be made, without overhaul of the source code.
Why this advice, because I have had some experiments, I wanted to do, as a non dev, but interested "scientist" understanding machine learning concept and other related mathetical notions. However, upon some discussions, it was made clear to me that in order to use Endgames (eg. TB classes) position, as root for self-play training experiments, both to reduce hardware requirements, and to have a common laboratory to test all sorts of ideas on common preformance refenrential system.
In TB land, one has the TB table as solved (from any position, not only those that best chess would explore from non-TB roots upstream games, but positions that could arise from any level of play "before" falling into TB classes. This allows for all the questions and hypotheses to have a toy laboratory less costly, but also allowing various performance measures to be defined and applied under many type of chess play, from all walks of chess life, even engines of different mathematical frameworks like SF could be part of the experiements. A common non-tournament based experiemental set-up. with mathematically precise parameters, and infinite database sampling control (well we can sample the whole set of positios in the limit, to measure statements i may have hinted above).
But, I was made to understand, that I would have to go trhough many lines of code to reproduce the whole lc0 training process called self-play, with the only change in condition being the Initial position, where the zero-knowledge prior of uniform move probability given any position would be applied as per self-play training method.
Also, even without the above. It might be instructive to test all the parts of your developement with that reduced complexity set-up. It would keep implementing all other determinants of the game of chess, only without having to wait for the scale of harware constraint to spit out resutls, and allow you some insight. As I assume, your intent in going from scratch is to learn from there. Also, human chess, often is helped from the clearer chess mechanics that fewer men on board allow. making each move potentially having larger delta changes in outcome odds (not all moves, but the best one are likely to be, compare to the more subtle opening root candidates deltas).
Other bonus: some relationship, between precise mathetical measures of performance in TB land, and engine pair ELO types of performance measure could be established.
Has any of the above, been approached already, and if rejected where could I read about it, and get convinced perhaps of where this is a dead-end proposition.
I think, anyway, that the hybrid approaches, need some characterizing tools from the clearer frameworks that SF11, and lc0, provide. Which means both those engine mathematical foundations need to be considered, and perhaps made to become comparable... again TB land......