Anybody knows if fully connected deep neural networks (DNN) architectures have been tried before adopting the convolution neural net approach (CNN).
Personally, I avoid using the 'zero' concept altogether: it's a brand, leave it that. It makes more sense to me to talk about generality. Convolutions suit board games played on regular grids. It's not obvious how you'd use them for a card game. Maybe, just maybe, fully connected nets could be shown to be better than convolutional nets, if the task is to play both board games and card games with the 'same' architecture.
Hello, thanks for your reply. i take it as a notice of
interest in the question of CNN versus fully connected DNN.
Ready to read more?
Anybody knows if fully connected deep neural networks (DNN) architectures have been tried before adopting the convolution neural net approach (CNN).
A convolutional neural network is a kind of deep neural network.
yes, but take the complement of CNNs in DNNs, and call it DNN,
(fDNN?).
The deep layer training paradigm shift is for those more
general DNNs (more free parameters). It was more than 10 years
ago, Hinton et al., and a bit after Bengio et al.., sorry to not
be exhaustive, those are the ones I was exposed to, 10 years
ago, or more.
CNNs, have been around before that , since the 80s, i am
pretty sure, already being used for visual discrimination tasks
(digits?!). They were trainable before the term "deep" was
introduced in the scientific literature and pierced into the
commercial world. But the innovation was that even wilder DNNs
were now trainable, and proven to be able to solve many problems
that ANN were thought unable to tackle.
Adjusting terminology:
Because of that innovation, I was reserving DNN for the fully
connected DNN, excluding CNNs, an abuse, because it is the one
closer to zero assumptions about the training data, while CNNs
come from vision problems assumptions and animal visual cortex
architecture, where pixel as input is naturally well suited to
convolution bias on the connections of initial
multi-convolution-layers network (local lateral correlations at
the input encoding levels, that's the assumptions, for each
convolution layer, i think).
Vision context:
pixel images for discrimination or detection tasks, show high local correlations (positive or negative), so yes, you can generate some feature detectors through convolutions layers (different kernels, but still local correlations at each layers, neighbouring pixel input neurons have their parameters pooled, different pooling at different layers, even if not pixels anymore). We are very good at detecting texture continuity, and edge detection, might be part of it, a feature easily found with convolution layers; in animals we even know which layers do which job, how images are decomposed into the various layers features assignments, maybe not all the way to the grand-ma cell, in humans. This is stuff i learned more than 15 years ago You may already know all that, this just to emphasize the context of recent "A.I." efforts, technological, commercial, and well diffused successes. And also, where inertia may come from.
Alpha zero's neural network used both convolutional and fully connected layers.
only the last layer, otherwise the universal approx. would not hold. the deep part: inner layers never fully inter-connected. that's my current understanding,
Now, i hope we agree that there are two type of DNN. Fully
connected DNN on one hand, (no pooling, more params), and
convolution DNN==CNN here.
not all DNNs are CNNs. I abusively make the assignment: DNNs == DNNs - CNNs.
You may stop here, if interest is waning, or, even more so, if
i'm wrong, given my intended meaning which is now clear, i
think. Any pointer to support the existence of prior
experiments or attempts with fully inter-connected inner-layers,
before the zero Go story, or after, but before just going
straight to the next board game, chess. If so, I would gladly
pause the discussion here. awaiting your reply. The rest, will
have been good practice to package my ideas.
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Otherwise, and if you are still curious, at some later time
(i'm not in a hurry, basic questions are so nice that way, they
stay valid a long time....), i propose
some arguable elements for a discussion, below. You are
welcome to propose rebuttals, pointers to go with them? even
better. Take your time, please don't get stuck in the side
points.
1st side point:
I wonder, if given the high parallelism of the hardware at
hand, CNN were not preferred as best suited to showcase the
technology or some kind of professional deformation. Not a
central point i want to make, this would just be a
psychological explanation, for not considering the fully
connected initialization approach.
2nd side point:
my other non central hypothesis, is that they were so surprised
to see a probability based approach be so efficient at learning
a non probabilistic game, learn so fast, in Go, that they
decided that all non-probabilistic board games were alike, and
their traditional deterministic yet greedy search based engine
could be beaten by learning as if they were probabilistic (there
is a trick point here, what is randomness? are dice really
random, or is it just that we are too lazy to compute the
initial conditions with enough precision for the given friction
parameters of the landing table)
Stronger points:
My point, is that for chess, it may not have been shown that a fully connected deep learning initialization would not have done better.
given the processing power of google and that as long as there is at least one fully connected layer, any separation functions could in theory be approximated, it is ensured that all board games could be mind blown, by this alpha go statistical engine.
But, lc zero, the open source project, does not work with same computing power. the fully connected unsupervised initialization may still an option to try, leaving the rest of the recipe as is (unless it has been shown that it does not do better, or unless there are fundamental incompatibilities between those recipes, and the unsupervised many parameters phase).
The local correlation of inputs assumption may be natural in Go (i don't know much, but that's my next search), but i wonder if chess is different. If only pawns in chess, perhaps, one chess position encoding would be highly connected to its "neighbouring" input, see how one has to be careful in the encoding in relation to the convolution setup, i can't even formulate the assumption correctly without a specific chess position encoding suitable for the input layer?
CNN may actually be the long way, but google processing power
did not care, the paradigm shift (in game engine gladiators,
this time, not ML) could be had, without fussing with my kind of
question, the machine was already sitting there for Go, why not
try it as is, and make a proof of concept, only need to beat the
best of the old guard (the automaton?).
If the uselessness of the extra connections has been shown for Chess (i insist on Chess, no Go, not Vision problems), then i will stop.
Go has a simpler local move generation rule set than chess, but a bigger territory is my current naive understanding (never played). Maybe they can put new beads at a distance....which would make me even wonder if CNN assumption from the start has been sub-optimal for Go as well (obviously it is enough to beat human biased engines in both games....with the brute force of google parallel processing).
Do you know about metric spaces, topology (not the network, but
that of the space containing the data)? The initialization by a
fully connected deep network (layer-wise) as some sort of
multi-dimension auto-encoder, is a way to re-encode or transform
the training set to give it nice partition functions for the
final task (discrimination, or probability estimation, contour
levels instead of partition). The feature set accessible to
convolution is tuned to vision problems, into things like edge
detection or its opposite continuity detection, this
decomposition being presented further in the cortex somehow made
available as meaningful object by whatever consciousness
is...10+ years old knowledge, so you can tell me wrong by
obsolescence, with new knowledge (or pointers to).
For now, the previous group member who replied, has given me a set of pointers, mostly populated, on the theory side, by convolution with various depth (of convolution) and some widening experiments, but always with vision as the underlying tasks to determine where parameters are superfluous or not, or the equivalence of such wider connections with adjustment on the convolution parameters. .
The pointers on the implementations before alpha zero (giraffe, deepchess), in GitHub projects hold more promise as i will be able to scrutinize the various input encoding used.
I will make a comparison of Go input encoding versus Chess,
learning about Go. my current hypothesis, while waiting for
information on conclusive DNN experiments, is that Go can be
shown to be similar in its state definition (position at end of
turn, gnu-go encoding), to a vision problem, of that the
encoding shows that the local correlation assumption is natural
there. I don't think it will be as obvious with Chess encoding,
from classical engines practices or in Alpha-Goes to Chess.
Any pointer, or assurance on your part that such a pointer
exist, any pointer, i say, to evidence that fully connected
approaches are not better than CNN, given the same computing
power, and i will stop wasting my time, and yours..... although
i like fundamental questions more than implementations, less
quantity of details to consider.... and spaces are so much fun.
--
You received this message because you are subscribed to a topic in the Google Groups "LCZero" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/lczero/NHCgy-ARsnk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to lczero+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/097eab7c-d930-4d1e-928b-a6003344bd74%40googlegroups.com.
Some point need to be developed, I'm sorry, and hope you manage
to read through. and continue discussion.
DeepMind tried both approaches and got similar results.
"Deepmind tried both approaches"for Chess, (not Go, not Vision tasks), you are sure?
there should be some trace somewhere, no? not just assumptions
that, surely, this must have been done. That's generally not a
good assumption, specially when not bothering already gives huge
results (still google processing power at play, and late awareness
to ML and its probability point of view in the Chess engine
programming community).
So, please try to remember where. The articles on ResNet, and
equivalence statements i have seen so far, from the first reply,
were made not in board game problems, but in vision problems.
where convolution is appropriate (all animals have such layers in
their visual cortex).
Let.'s not be picky in vocabulary. for me zero means zero
possible human bias, whether there is or not, we are not taking
the chance that some historical human biases in the width of
exploration of all possible Chess games, has been restricting
opening theories, for example.
While at it, heuristics compromises to the full exhaustive search
by classical engines, are also human knowledge, and may not just
be dismissing random well spread outliers, but some connected
perhaps even convex in the right re-encoding, learn-able
manifolds (not a space, but more than just a list or set, some
ability to have local metrics, global tractable separating hype-surfaces or boundaries).
That is the intent behind letting the training be an all directions of exploration, as much as possible, for as long as possible. Maybe not the for as long as possible, but something has to decide and not too early that enough isotropic random move self-games have been explored, that is part of the training, i guess, to converge to that point? under control? Anyway, initially, all direction (assuming we have a space representation of chess state, or position, otherwise direction is meaningless). Convolution is a limited transformation of the raw input, into such a space, it may have collapsed entire sub-spaces, or manifolds, by assuming all non-local inner-layers contributions to next layer be equal to zero (that is what convolution means)
in theory one hidden layer can approximate any function (such as winning termination boundary surface). It is just not efficient, and forces you on very tiny set of parameter regions to tweak because of the lack of flexibility of an inflexible initial encoding (one layer means you already think that you encoding is completely learn-able, (i.e. separable).
Do
you really, think that FEN or PGN or bandwidth compression minded
encoding of those, offers a well separable space to learn from
whe n used at the input layer, That would be very lucky.
the lack of intelligent mapping of the whole data space before
learning to win a game, may be missing a simple opportunity, to
compensate for not having google brute force parallel processing.
Human biases:
Convolution is so ubiquitous, because Visions problems are as
well, specially where the commercial forces are at play. Also,
they have been around for a long time. I would not be surprised
if the Go success was not a hasting factor in going straight to
Chess without any changes, given the spectacular success, and how
late ML has been introduced in Chess computing.
Regarding fully connected (well, restrain a bit to inner-layer to inner-layers fully connected): you may have missed the Deep Learning publications initiated by Hinton et al. and Bengio et all,, was it 10 years ago, or more? I stopped following ML research some 10 years ago, (hence the 10 time mark), but have their work been proven wrong since then. or is big processing power, making biases in architecture workable, as long as one has a universal approximation in the end.
Make the fancier elaborate training algorithms correct the wrong initial bias (convolution) which may have introduced blind spots to the intended zero-knowledge self-exploration from some sub-set of all possible Chess games (the sub-set that could not be separably represented given convolution restrictions). The book Deep learning, by Goodwill, Courville, Bengio, is a bit heavy on the matrix math, but it does give a good look at what i try to push here.
--
You received this message because you are subscribed to a topic in the Google Groups "LCZero" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/lczero/NHCgy-ARsnk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to lczero+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/74b6bd20-fa47-4878-a5dc-0b7f443c28d0%40googlegroups.com.
On Monday, November 4, 2019 at 5:34:36 PM UTC-5, DB wrote:
Stronger points:
My point, is that for chess, it may not have been shown that a fully connected deep learning initialization would not have done better.
...