I've been formulating an idea - I call it "low-entropy-driven exploration" - which would use entropy calculations to bias exploration toward "sharp" areas of the decision tree. My intuition (and couple casual emulations) is that search efficiency can be improved by recognizing that some nodes are "cheaper" to search than others. In chess, low-entropy nodes comport to moves driven by checks, attacks, and sacrifices that lend to forcing sequences.
The driver is not that these sequences are more likely to be better, but instead, that it is much quicker (requires far fewer visits) to determine whether they are good or bad compared to "branchier" nodes. This idea is obvious in other contexts. If I'm looking for my keys, even if I'm 95% sure they are somewhere in my bedroom and only 5% sure they are on my (otherwise clear) kitchen table - because my kitchen table has "nowhere to hide" and 1 glance will definitively tell me whether they're on the table or not... it's more efficient to spend a few seconds to search the 5% table before dedicating several minutes to search the 95% bedroom. If you find yourself nodding along with this, stop to ask: how could one possibly reach that decision if the only thing they were told about the search space was
* 95% likelihood bedroom
* 5% likelihood kitchen table
* 0 visits each
Anyway, the basic concept is to add a new variable to the exploration part of exploration+exploitation. This new variable, entropy, would track the distribution of probabilities of a playout to a node reaching a given unexplored (or terminal) node of its subtree. The collected set of all probabilities of all leaf nodes would thus sum to 1. Consider two nodes, both with subtrees with 10 reachable leaf nodes. The first node (based on current policy/value/visit-count) might have a 10% each chance of visiting any of the 10 possible leaf nodes. The other might have a 91% chance of reaching 1 leaf node and 1% chance for the other 9. The first has a much higher entropy (the mathematical maximum of entropy, in fact) than the second; thus, even if the first node looks a little better than the second (say, Q-value of 0.3 vs Q-value of 0.1) - the low entropy of the second node should incline Leela to prioritize searching it over the first. Again - NOT because the second node is more likely to be better, but simply because if it *is* truly better (or truly worse), it'll take far fewer playouts to learn that fact than it will the first.
I have no idea what the ultimate math should look like, but I expect that any small nudge that biased Leela's exploration factor to spend a little more time looking at sharp/forcing lines, as this strategy would do, would achieve wonders for her tactical game.