Refreshing my understanding of policy matters

91 views
Skip to first unread message

DBg

unread,
Aug 20, 2022, 2:31:47 PM8/20/22
to LCZero
I understand pretty deep supervised learning with NNets.  
But when it comes to reinforcement learning IN CHESS , I feel dumb.
as if something obvious to those explaining on the web** was in my blind spot (and has been for a while now).

Instead of working my understanding on tic tac toe or Go, and then having to work transfering that to chess to.. I would prefer to do all my head scratching on chess as example visual support. it may not really matter..

Anybody could help me understand better what is happenning during play, given a policy head?  I have difficulty with the separation between probability sampling, in training versys in play.  and how position in the chess state space (a graph in my view) get to be explored, using "tree" search.  

I can understand MDPs and the optmisation formulas in math format.  however, when it come to the sampling, in training versus in play, I feel I missing something in order to make sense of all the explanations repeating the same rollout story.  it does not stick.. (i may be lacking the right common sense....)

i get lost with the roll-out things. (for me it just mean sampling the probability distribution of action given position embedded in the policy head output vector (? vector right).  I don't know if it is only a training question, or the persistence of such simulations acting like the equivalent of an exhaustive type tree search, but on top of a graph of eval head evaluations on a graph)

Also I am trying to tie together sparse chunks of information i have read in the past.
  1. There exist policy data that is not the policy head weights.
  2. There exist supervised trainee of LC0 with RL LC0s as target.  Those have sometimes been dubbed "no policy" Networks. 
1 what was that. is it about training (during one iteration without weigh update?)

2 perhaps before i answer any other question, the baseline of a typical PLAY search by one of those "no policy" versions could help me.  Is there even a search?  is that some trivial policy based only on evaluations differential in the breadth considered, if so what is the stopping criterion. or is it just the best successor eval from root?

whatever the correct understanding of such policy sampling searches, are all the nodes explored getting their full head evaluation calculated and contributing to the decisions?
Not only some branch leafs (if that even applies).  

the state space of positions being a graph, I have difficulty reconciling that with tree search. i guess I could if we call that a decision tree search on top of a graph of positions and all possible actions connecting all positions.  The games being paths, but the decision making looking at some graph of transitions from there.  May seem like a silly difference, for people who are used to tree traversals for a living, but I my visual support seems to be that chess positions are on a graph, nodes bijecting to positions.

transposition being just many edges going into a position.. 

Sorry for packing it all in.. Please chop as needed.  
Reply all
Reply to author
Forward
0 new messages