Pre Alpha Zero deep learning and chess

914 views
Skip to first unread message

DB

unread,
Nov 1, 2019, 3:32:58 PM11/1/19
to LCZero
CNN versus DNN

I'm trying to research the literature, about (deep) machine learning approaches tried before Alpha-Go for Chess (or within that project before publication)

Anybody knows if fully connected deep neural networks (DNN) architectures have been tried before adopting the convolution neural net approach (CNN). 

Was it a processing power question, or a superfluous parameters problem (CNN does as well as DNN for chess data sets)?

I'm trying avoid getting lost in implementations literature, before getting the gist of my question answered, which is of a basic nature.  

I'm not up to date about deep machine learning, in the sense that i don't know much about reinforcement learning, and its constraints on the neural net architecture (full connected at unsupervised initialisation phase?, if there is still such a phase, ...).  There may be things explaining the use of CNN over DNN, other than the options in my question (processing power, or equivalence in results).

The motivation for the question, is about the input encoding, and the local correlation assumptions underlying CNN;  while i would think that DNN are the real zero
approach, it would be nice to know if even DNN approaches behave like CNN, supporting the CNN bias.

lit. pointers to shorten my search, appreciated.  or other threads in this forum.

Brian Richardson

unread,
Nov 2, 2019, 6:27:09 AM11/2/19
to LCZero

Brian Richardson

unread,
Nov 2, 2019, 6:28:41 AM11/2/19
to LCZero

Cary Knoop

unread,
Nov 3, 2019, 9:20:47 PM11/3/19
to LCZero

Anybody knows if fully connected deep neural networks (DNN) architectures have been tried before adopting the convolution neural net approach (CNN).

A convolutional neural network is a kind of deep neural network.  

Alpha zero's neural network used both convolutional and fully connected layers.
 

Deep Blender

unread,
Nov 4, 2019, 7:30:56 AM11/4/19
to LCZero
The advantage of convolutions is that there are plenty of well understood architectures which use them. This includes ResNet which is used as a building block of Leela. Those architectures are an important factor when it comes to stable training. On the other hand, fully connected layers may use similar ideas, but they are not well researched. You would also need a lot more parameters. If I remember correctly (can't find the source right now), DeepMind tried both approaches and got similar results.

The choice does not impact at all whether you can call it a "true" zero approach. There are plenty of parameters that have to be tweaked and the architecture of the neural network is just one of those. Zero just means, that it used no human games for the training. Overall, there is still a lot of human intervention to stabilize the learning, but there is no chess specific knowledge which is directly fed into the neural network.

Graham Jones

unread,
Nov 4, 2019, 11:36:57 AM11/4/19
to LCZero
I think the reason that convolutional nets are more widely used is they are a better match for current hardware. The problem with fully connected nets is that each weight is used only once. My GPU (GTX 1070) can do about 50 multiply-and-adds (ie consume 50 weights) in the time it takes to load one weight. The ratio would be more extreme with tensor cores. In a convolutional net, the same weights are used all over the image (or board). It is this advantage of convolutional nets that led to things that Deep Blender said (more research etc).

Convolutions also make a lot of sense for images.They probably make good sense for Go, and rather less for chess. If you process a batch of chess positions, you might be able to successfully exploit GPUs with a fully connected net.

Personally, I avoid using the 'zero' concept altogether: it's a brand, leave it that. It makes more sense to me to talk about generality. Convolutions suit board games played on regular grids. It's not obvious how you'd use them for a card game. Maybe, just maybe, fully connected nets could be shown to be better than convolutional nets, if the task is to play both board games and card games with the 'same' architecture.

Graham

Cary Knoop

unread,
Nov 4, 2019, 11:46:38 AM11/4/19
to LCZero


Personally, I avoid using the 'zero' concept altogether: it's a brand, leave it that. It makes more sense to me to talk about generality. Convolutions suit board games played on regular grids. It's not obvious how you'd use them for a card game. Maybe, just maybe, fully connected nets could be shown to be better than convolutional nets, if the task is to play both board games and card games with the 'same' architecture.


The "Zero" concept refers to zero subject knowledge, i.e. zero chess or go knowledge.  It does not refer to zero architectural or meta parameter knowledge.   For that, you need to go one step above and use meta-learning.
Message has been deleted
Message has been deleted

Dariouch Babaï

unread,
Nov 4, 2019, 5:34:36 PM11/4/19
to lcz...@googlegroups.com

Hello, thanks  for your reply.  i take it as  a notice of interest in the question of CNN versus fully connected DNN.  Ready to read more?

Le 03/11/2019 à 21:20, Cary Knoop a écrit :

Anybody knows if fully connected deep neural networks (DNN) architectures have been tried before adopting the convolution neural net approach (CNN).

A convolutional neural network is a kind of deep neural network.

yes, but take the complement of CNNs in DNNs, and call it DNN, (fDNN?).  

The deep layer training paradigm shift is for those more general DNNs (more free parameters). It was more than 10 years ago, Hinton et al., and a bit after Bengio et al.., sorry to not be exhaustive, those are the ones I was exposed to, 10 years ago, or more.

CNNs, have been around before that , since the 80s, i am pretty sure, already being used for visual discrimination tasks (digits?!).  They were trainable before the term "deep" was introduced in the scientific literature and pierced into the commercial world.  But the innovation was that even wilder DNNs were now trainable, and proven to be able to solve many problems that ANN were thought unable to tackle.

Adjusting terminology:

Because of that innovation, I was reserving DNN for the fully connected DNN, excluding CNNs, an abuse, because it is the one closer to zero assumptions about the training data, while CNNs come from vision problems assumptions and animal visual cortex architecture, where pixel as input is naturally well suited to convolution bias on the connections of initial multi-convolution-layers network (local lateral correlations at the input encoding levels, that's the assumptions, for each convolution layer, i think).

Vision context:

pixel images for discrimination or detection tasks, show high local correlations (positive or negative), so yes, you can generate some feature detectors through convolutions layers (different kernels, but still local correlations at each layers, neighbouring pixel input neurons have their parameters pooled, different pooling at different layers, even if not pixels anymore).  We are very good at detecting texture continuity, and edge detection, might be part of it, a feature easily found with convolution layers; in animals we even know which layers do which job, how images are decomposed into the various layers features assignments, maybe not all the way to the grand-ma cell, in humans.  This is stuff i learned more than 15 years ago You may already know all that, this just to emphasize the context of recent "A.I." efforts, technological, commercial, and well diffused successes.  And also, where inertia may come from.


Alpha zero's neural network used both convolutional and fully connected layers.


only the last layer, otherwise the universal approx. would not hold.  the deep part: inner layers never fully inter-connected.  that's my current understanding,

Now, i hope we agree that there are two type of DNN.  Fully connected  DNN on one hand, (no pooling, more params), and convolution DNN==CNN here.

not all DNNs are CNNs.  I  abusively make the assignment: DNNs == DNNs - CNNs.

You may stop here, if interest is waning, or, even more so, if i'm wrong, given my intended meaning which is now clear, i think.  Any pointer to support the existence of prior experiments or attempts with fully inter-connected inner-layers, before the zero Go story, or after, but before just going straight to the next board game, chess.  If so, I would gladly pause the discussion here. awaiting your reply.  The rest, will have been good practice to package my ideas.

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Otherwise, and if you are still curious, at some later time (i'm not in a hurry, basic questions are so nice that way, they stay valid a long time....), i propose

some arguable elements for a discussion, below.  You are welcome to propose rebuttals, pointers to go with them? even better. Take your time, please don't get stuck in the side points.

1st side point:

I wonder, if given the high parallelism of the hardware at hand, CNN were not preferred as best suited to showcase the technology or some kind of professional deformation. Not a central point i want to make,  this would just be a psychological explanation, for not considering the fully connected initialization approach.

2nd side point:

my other non central hypothesis, is that they were so surprised to see a probability based approach be so efficient at learning a non probabilistic game, learn so fast, in Go, that they decided that all non-probabilistic board games were alike, and their traditional deterministic yet greedy search based engine could be beaten by learning as if they were probabilistic (there is a trick point here, what is randomness?   are dice really random, or is it just that we are too lazy to compute the initial conditions with enough precision for the given friction parameters of the landing table)

Stronger points:

My point, is that for chess, it may not have been shown that a fully connected deep learning initialization would not have done better.

given the processing power of google and that as long as there is at least one fully connected layer, any separation functions could in theory be approximated, it is ensured that all board games could be mind blown, by this alpha go statistical engine.

But, lc zero, the open source project, does not work with same computing power.  the fully connected unsupervised initialization may still an option to try, leaving the rest of the recipe as is (unless it has been shown that it does not do better, or unless there are fundamental incompatibilities between those recipes, and the unsupervised many parameters phase).

The local correlation of inputs assumption may be natural in Go (i don't know much, but that's my next search), but i wonder if chess is different.  If only pawns in chess, perhaps, one chess position encoding would be highly connected to its "neighbouring" input, see how one has to be careful in the encoding in relation to the convolution setup, i can't even formulate the assumption correctly without a specific chess position encoding suitable for the input layer?

CNN may actually be the long way, but google processing power did not care, the paradigm shift (in game engine gladiators, this time, not ML) could be had, without fussing with my kind of question, the machine was already sitting there for Go, why not try it as is, and make a proof of concept, only need to beat the best of the old guard (the automaton?).

I will keep searching on one hand for the contrary, but if, now that i made precise the distinction, you have pointers that do show experiments with fully connected initial condition for the whole network (zero assumptions on the training data and its internal structure).  please do so, I will understand that your statement above means all layers being fully connected, not just the last.

If the uselessness of the extra connections has been shown for Chess (i insist on Chess, no Go, not Vision problems), then i will stop.

Go has a simpler local move generation rule set than chess, but a bigger territory is my current naive understanding (never played).  Maybe they can put new beads at a distance....which would make me even wonder if CNN assumption from the start has been sub-optimal for Go as well (obviously it is enough to beat human biased engines in both games....with the brute force of google parallel processing).

Do you know about metric spaces, topology (not the network, but that of the space containing the data)?  The initialization by a fully connected deep network (layer-wise) as some sort of multi-dimension auto-encoder, is a way to re-encode or transform the training set to give it nice partition functions for the final task (discrimination, or probability estimation, contour levels instead of partition).   The feature set accessible to convolution is tuned to vision problems, into things like edge detection or its opposite continuity detection, this decomposition being presented further in the cortex somehow made available as meaningful object by whatever consciousness is...10+ years old knowledge, so you can tell me wrong by obsolescence, with new knowledge (or pointers to).

For now, the previous group member who replied, has given me a set of pointers, mostly populated, on the theory side, by convolution with various depth (of convolution) and some widening experiments, but always with vision as the underlying tasks to determine where parameters are superfluous or not, or the equivalence of such wider connections with adjustment on the convolution parameters.  .

The pointers on the implementations before alpha zero (giraffe, deepchess), in GitHub projects hold more promise as i will be able to scrutinize the various input encoding used.

I will make a comparison of Go input encoding versus Chess,  learning about Go.  my current hypothesis, while waiting for information on conclusive DNN experiments, is that Go can be shown to be similar in its state definition (position at end of turn, gnu-go encoding), to a vision problem, of that the encoding shows that the local correlation assumption is natural there.  I don't think it will be as obvious with Chess encoding, from classical engines practices or in Alpha-Goes to Chess.

Any pointer, or assurance on your part that such a pointer exist, any pointer, i say, to  evidence that fully connected approaches are not better than CNN, given the same computing power, and i will stop wasting my time, and yours..... although i like fundamental questions more than implementations, less quantity of details to consider.... and spaces are so much fun.

Dariouch Babaï

unread,
Nov 4, 2019, 11:50:26 PM11/4/19
to lcz...@googlegroups.com
I meant thank you very much for taking the time to give me a reading program. 

Are you interested in the question that motivates me?  Namely, the neural net re-encoding of input data to make some topology in the training data.  I elaborate.

The constraints of the CNN  on the DNN should be translated, i think, in a restriction in the type on input encoding for the training data, i guess some thinking has been done there. Or, without choosing a particular encoding, just the natural existing descriptions from classical chess programming, turned out to fit a convolution architecture assumption?  I need to look into those input....

my perhaps very personal and possibly wrong understanding, was that the initial unsupervised initialization of the DNN (layer-wise, as some sort of auto-encoders, at each layer and so forth), was building the most separating topology possible for the space containing the data manifold, based on all possible (auto-) correlations from the input for which no prior knowledge was necessary.  At the time, it was still back-propagation applied in  that context.   Having a separable space, allows to find envelopes of categories of interest easily, minimizing non-smooth boundaries (maybe that's too much interpretation).

like Vapnik theory toward SVM, but without using target classes, only all auto-correlations possible, is that completely outdated as a concept?

i need to learn how that story changes when using reinforcement, or these are not mutually exclusive, I'll see.

But i invite you to comment on my possibly wishful ideation of DNN learning (some 10 years ago), it might put me on the right perspective while reading your links.

If you have interest and time.  of course.

Best regards.
--
You received this message because you are subscribed to a topic in the Google Groups "LCZero" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/lczero/NHCgy-ARsnk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to lczero+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/097eab7c-d930-4d1e-928b-a6003344bd74%40googlegroups.com.

DB

unread,
Nov 5, 2019, 2:03:57 AM11/5/19
to LCZero
Erratum: in the previous long post, i made a very unfortunate choice of words that could lead to confusion in ML, and probability computations.
 

Partition can be a separation of a space into complementary sub-sets.  However, Partition function is an important notion in statistical physics, in ML probabilistic formalism,.  I think that the Partition function can generate all the statistical moments of a probability law, upon successive differentiation.
It is also an object important to the theory of certain MCMC algorithms.  Best look at goodwill, courville and bengio's ML book for the rigorous math.presentation.  
I just don't want to use this term loosely,  Which i did up there.  Better use, separating hyper-surface.  (not really a function,although a function could be used in its specification, such as f =0 define the surface.

DB

unread,
Nov 5, 2019, 2:36:00 AM11/5/19
to LCZero

Some point need to be developed, I'm sorry, and hope you manage to read through. and continue discussion.

DeepMind tried both approaches and got similar results.

This is what i want to get my hands on, in details.  at least traces where the input encoding and unsupervised initialization is spelled out.

"Deepmind tried both approaches"
for Chess, (not Go, not Vision tasks),  you are sure?
 

there should be some trace somewhere, no?  not just assumptions that, surely, this must have been done.  That's generally not a good assumption, specially when not bothering already gives huge results (still google processing power at play, and late awareness to ML and its probability point of view in the Chess engine programming community). 


So, please try to remember where.  The articles on ResNet, and equivalence statements i have seen so far, from the first reply, were made not in board game problems, but in vision problems.  where convolution is appropriate (all animals have such layers in their visual cortex).


Let.'s not be picky in vocabulary.  for me zero means zero possible human bias, whether there is or not, we are not taking the chance that some historical human biases in the width of exploration of all possible Chess games, has been restricting opening theories, for example.


While at it, heuristics compromises to the full exhaustive search by classical engines, are also human knowledge, and may not just be dismissing random well spread outliers, but some connected perhaps even convex in the right re-encoding, learn-able manifolds  (not a space, but more than just a list or set, some ability to have local metrics, global tractable separating hype-surfaces or boundaries). 


That is the intent behind letting the training be an all directions of exploration, as much as possible, for as long as possible.  Maybe not the for as long as possible, but something has to decide and not too early that enough isotropic random move self-games have been explored, that is part of the training, i guess, to converge to that point? under control?  Anyway, initially, all direction (assuming we have a space representation of chess state, or position,  otherwise direction is meaningless).  Convolution is a limited transformation of the raw input, into such a space, it may have collapsed entire sub-spaces, or manifolds, by assuming all non-local inner-layers contributions to next layer be equal to zero  (that is what convolution means)


in theory one hidden layer can approximate any function (such as winning termination boundary surface). It is just not efficient, and forces you on very tiny set of parameter regions to tweak because of the lack of flexibility of an inflexible initial encoding (one layer means you already think that you encoding is completely learn-able, (i.e. separable). 


Do you really, think that FEN or PGN or bandwidth compression minded encoding of those, offers a well separable space to learn from whe n used at the input layer,  That would be very lucky. 


the lack of intelligent mapping of the whole data space before learning to win a game, may be missing a simple opportunity, to compensate for not having google brute force parallel processing.


Human biases:


Convolution is so ubiquitous, because Visions problems are as well, specially where the commercial forces are at play. Also, they have been around for a long time.  I would not be surprised if the Go success was not a hasting factor in going straight to Chess without any changes, given the spectacular success, and how late ML has been introduced in Chess computing.


Regarding fully connected (well, restrain a bit to inner-layer to inner-layers fully connected): you may have missed the Deep Learning publications initiated  by Hinton et al. and Bengio et all,,  was it 10 years ago, or more?  I stopped following ML research some 10 years ago, (hence the 10 time mark), but have their work been proven wrong since then.  or is big processing power, making biases in architecture workable, as long as one has a universal approximation in the end.  


Make the fancier elaborate training algorithms correct the wrong initial bias (convolution) which may have introduced blind spots to  the intended zero-knowledge self-exploration from some sub-set of all possible Chess games  (the sub-set that could not be separably represented given convolution restrictions).  The book Deep learning, by Goodwill, Courville, Bengio, is a bit heavy on the matrix math, but it does give a good look at what i try to push here.

Deep Blender

unread,
Nov 5, 2019, 6:44:38 AM11/5/19
to LCZero
At the time when AlphaGo/Zero were released, I was curious how they achieved everything. As I wrote, I am pretty certain, but not absolutely sure that DeepMind tried both, convolutions and fully connected networks. I can't remember whether it was for Chess or Go. I believe it was mentioned by David Silver in an interview/lecture/presentation. It is not worth the time trying to find it, because we wouldn't know more details anyways.

You seem to be looking for the purest zero approach (please correct me if I am wrong). If that is your goal, looking at the neural network is from my point of view the wrong starting point. A reasonably trained and configured neural network is at the end of the day only some kind of knowledge representation. Arguing whether fully connected or convolutional neural networks are more in the spirit of the zero approach is pointless, because you are also going to have non linearities (e.g. relu) and batch normalization in the architecture, which are both clearly human biased and heavily engineered. It is also known that the input encoding is very important for the training of the neural network. The input encoding is also always human designed. The neural network is trained with heavily engineered methods, no matter which kind of optimizer is used. And all those aspects are simply about the training of the neural network.
Further, and likely even more important for the quality of Leela, Monte-Carlo tree search has a huge human bias. The exploration used in Leela is also engineered to work as good as possible. Both of those are core components of Leela with a way more significant impact than details in the neural network.

The core issue is that something like a "zero" architecture or reinforcement learning approach does not exist (or we don't even closely know how this would look like). All you can do at this point is defining it into existence, by changing the meaning of certain terms.

Benedetto Romano

unread,
Nov 5, 2019, 8:38:59 AM11/5/19
to LCZero

Fantastic reply, one of the best never readed about. Compliments ;-)

Deep Blender

unread,
Nov 6, 2019, 7:42:38 AM11/6/19
to LCZero
I am aware that my reply appears quite destructive. But what I am describing is what actually happens in any neural network, including its training. Many people seem to think of them in quite abstract or even mythical terms. But at the end of the day, it is way more accurate to think of them as heavily engineered machines which have been optimized for some tasks.

Dariouch Babaï

unread,
Nov 6, 2019, 1:22:34 PM11/6/19
to lcz...@googlegroups.com
This discussion sub-thread about my initial motivation, is totally warranted.  I am not a Platonist, or purist, but am often thinking in conflicting dichotomies to splice through foggy concepts or problems.  I can easily formulate radical purist looking ideas, but just as scaffolding (math. professinal. deformation; radical statements logical conclusions may be easier to learn from than the most subtle and mostly plausible affirmation, when wrong).

To answer the purist hypothesis:  well, been ignorant of the past 10 years ML history (just became aware of its emergence into the commercial word, through newspapers).  I am actually always having a cynical hypothesis lurking in my sub-conscious, i don't like to get stuck without an escape hypothesis.
So i am looking into the zero purist direction, without thinking there is an actual existing end to that.  But i am only asking for a step back, just asking if the tools at hand and "virginity" of the playing field in terms of ML, may have made hasting through without optimizing for less parallel compute intensive, an easy thing to do (reverse the syntax here if my statement became unreadable).

Also, i am not looking only for better gladiators, i want to learn about chess from a scientific viewpoint, more physical science than social science (i guess in English that's called humanities), which i view most of the past chess history as following the methodology of (forgive English style).

Finite programming chess computing may have introduced more math. looking language, and good gladiators, but from the learning or solving the whole game of chess objective, they do need some clearing up (or cleaning up).

I view the application of probabilistic framework approach to a known finite deterministic problem, as a good place to minimize human history impact from the start.  The whole set of chess games is my scientific curiosity. I don't need zero human input everywhere, only where it might make blind spots in the whole set of legal chess game, exploration potential.

Now, that was only my philosiphical reply.  I now read to read the 3-4 days backlog i have accumulated from this thread (been out of touch). and comment each ideas in a new meandering sub-sub-thread, hoping not to get lost.
--
You received this message because you are subscribed to a topic in the Google Groups "LCZero" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/lczero/NHCgy-ARsnk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to lczero+un...@googlegroups.com.

DB

unread,
Nov 6, 2019, 10:07:46 PM11/6/19
to LCZero
So, your reply may have been about motivation and perspective, Even if I were to adopt yours (i do, at times), this does not have much bearing on my questions, does it? 

Are you thinking that dropping abstract and mythical adjectives, answers or dismisses the relevance of the following  question: 
have non-local layer to layer connections been tried for the game of chess?  That is not abstract.  Any trace of such experiments? abstract, mythical?  

whatever my perspective, these questions stand, don't they?

DB

unread,
Nov 6, 2019, 10:41:33 PM11/6/19
to LCZero
Adding to my previous post.

One of your points is that architecture of initial deep network layers does not have as much impact on the game playing results, that the hyper-parameters of training algorithms have  more.  Is there some methodology that exist that could make this kind of statement more than a hypothesis?  Or convincing examples published or public?

Remember, i'm ignorant of the last 10 years, and i may have a professional deformation about what I think I know.

It would be easy to stop this line of questions, with knowledge and pointers.

You may not have such answers, does not mean that nobody else does. 

I am still eager to have my doubts answered, before i go into concrete work, with some of my own experiments or data analysis (i would not want to reinvent the wheel, in vain, I'm lazy that way), if anybody could make Deep Blender reminiscence more precise, please, do make a post here, or web pointers.

DB

unread,
Nov 6, 2019, 11:06:01 PM11/6/19
to LCZero
Is this series, going to help me?

Lessons From Implementing AlphaZero - Oracle Developers - Medium

https://medium.com/oracledevs/lessons-from-implementing-alphazero-7e36e9054191
Message has been deleted

DB

unread,
Nov 7, 2019, 7:08:21 PM11/7/19
to LCZero
I am answering to myself, because, upon filling some gaps (trying to) in my knowledge about the alpha-zero approach, i realized that i had already answered some of my questions in my many buried hypothetical answers.

Below, in the excerpt from an earlier post of mine, in the second paragraph, was the hypothesis of a fundamental incompatibility between the 2 phase early Deep Learning developed 10 years ago and the alpha-zero approach.

Yes there is a fundamental incompatibility  for the simple reason that unsupervised deep layers initialization is a batch process.

The whole training set is needed to provide for the best separable representation from which to start the supervised learning task.   

You could view this as: first a non-task-oriented general mapping of state space, transformed into a well spread representation of the training data, then, second: a supervised learning task, having an easier job, because the global optimum is not hiding in a tiny corner that random initial weights would often reduce them to (that's practical experience with MLPs).

In the alpha-zero approach, given my current and hopefully improving understanding, the mapping and the supervised task are entangled through time.  The initial exploration in the self-play training, start with isotropic move probabilities, and as some end-points (or surrogate end-points) get visited the valuation is able to assign probability estimates, which in turn start affecting the exploration policy (?) move preferences, and upon re-visitation of certain lines get to tune the whole couple toward some convergence where it has learned how much to explore and what's good. 

My point being that initially the mapping is not supervised by end-points, and only upon solidifications or confirmation of path policy in conjunction with valuation of position (or line?), does it get influenced by the task (winning the game).

Somehow the interaction between valuation and policy contains in itself what's needed to know that it has explored quiet moves enough?  perhaps hyper-parameters.

Conclusions: 

1) What i thought could have been easy tried, was actually inconceivable.  

2)However, the question of non-local weights in the inner layers being initialized to zero before self-play training, and how much state space exploration is omitted as a consequence 
          e.g. preventing detection of certain patterns, or correct valuation of patterns with pieces where correlations could not be taken into account: move a Bishop beyond the typical convolution support:  the correlation (as a sequence) between before and after, would be lost, i think.  

While i admit that it is not clear to me exactly how non-local lateral weight getting stuck at 0, or not, translates into valuation estimates of move lines (or position), I still think that clearly, pieces moving outside of convolution support hyper-parameter can't be part of the learning, we might lose some predicting power.   How much, negligible?  how can one tell? 

I guess, perhaps, the MCTS can fix that somehow, through valued move trees what was ignored at the spatial correlation level. 
But then, which is more efficient, architecture, or MCTS?


On Monday, November 4, 2019 at 5:34:36 PM UTC-5, DB wrote:


Stronger points:

My point, is that for chess, it may not have been shown that a fully connected deep learning initialization would not have done better.

...

Reply all
Reply to author
Forward
0 new messages