Thanks y'all, for your participation in that fascinating conversation! I'm including Conal in the "to" line to avoid the race condition where he joins the group after I send this :)
This email ended up longer than anticipated. It has 3 sections:
1. Genetic algorithm finding an FPGA configuration with weird electromagnetic effects.
2. Infinite-width neural networks
3. The training efficiency and quality benefits of finetuning a pretrained model
I mentioned an FPGA configured to produce analog output, via electromagnetic interference. I dug around a bit, but didn't find that, I think instead I misremembered the details of "An evolved circuit, intrinsic in silicon, entwined with physics."
(what a cool title!!), from 1996. The summary is that they used a genetic algorithm to discover an FPGA configuration which could discriminate between a 1khz input squarewave and a 10khz squarewave, without access to a clock. Some particularly interesting quotes from the paper:
In the Analysis section:
This circuit is discriminating between inputs of period 1ms and 0.1ms using
only 32 cells, each with a propagation delay of less than 5ns, and with no o -chip
components whatsoever: a surprising feat. Evolution has been free to explore
the full repertoire of behaviours available from the silicon resources provided,
even being able to exploit the subtle interactions between adjacent components
that are not directly connected. The input/output behaviour of the circuit is a
digital one, because that is what maximising the fitness function required, but
the complex analogue waveforms seen at the output during the intermediate
stages of evolution betray the rich continuous-time continuous-value dynamics
that are likely to be internally present.
shaded gray cannot be clamped without degrading performance, even though
there is no connected path by which they could influence the output as they were
not present on the pruned diagram of Fig. 6. They must be influencing the rest of
the circuit by some means other than the normal cell-to-cell wires: this probably
takes the form of a very localised interaction with immediately neighbouring components. Possible mechanisms include interaction through the power-supply
wiring, or electromagnetic coupling.
Earlier in the results section:
By generation 1400, the neat behaviour for the 1kHz input had been abandoned, but now the output was mostly high for the 1kHz input, and mostly
low for the 10kHz input. . . with very strange looking waveforms. This behaviour
was then gradually improved. Notice the waveforms at generation 2550 | they
would seem utterly absurd to a digital designer. Even though this is a digital
FPGA, and we are evolving a recurrent network of logic gates, the gates are
not being used to `do' logic. Logic gates are in fact high-gain arrangements of a
few transistors, so that the transistors are usually saturated | corresponding to
logic 0 and 1. Evolution does not `know' that this was the intention of the designers of the FPGA, so just uses whatever behaviour these high-gain groups of
transistors happen to exhibit when connected in arbitrary ways (many of which
a digital designer must avoid in order to make digital logic a valid model of the
system's behaviour). This is not a digital system, but a continuous-time, continuous valued dynamical system made from a recurrent arrangement of high-gain
groups of transistors | hence the unusual waveforms.
So, the output of this FPGA is somewhat analog in some of the intermediate generations! But it is not like a DAC, which is what I had incorrectly recalled.
The paper I saw most recently about infinite-width neural networks is "Neural Tangents: Fast and Easy Infinite Neural Networks in Python"
, from Google Brain. So, there are certainly people thinking about generalizing neural "networks" beyond the graph metaphors.
One of the silver linings of Covid is that some conferences now have open "poster session" recordings, like the one linked above. Hopefully this is a tradition that will continue! ICLR has an incredible number of fascinating papers with associated short presentations - https://iclr.cc/virtual_2020/papers.html
I also talked a little bit about the idea of pretraining a language model like BERT or GPT on a huge corpus with a generic task, like prediction of part of the input that was hidden. It turns out that predicting obscured input forces the model to learn lots of latent structure.
This pretraining effort can then be reused by finetuning the model on a specific task. What I neglected to mention is that this additional benefit, beyond just the efficiency of reusing the earlier training effort: These finetuned models tend to perform much better than training from scratch on a particular task. My understanding is that this is due to two effects:
1. Typically the dataset for a more specialized task will be much smaller than the huge corpuses used for pretraining models like BERT. So, finetuning allows the model to reuse all of that structure to solve a more specific problem.
2. There is a concept called "curriculum learning", where the "difficulty" of the training examples is increased as the model becomes increasingly capable. The intuition is straightforward - why waste cycles training the model on challenges it has little hope of accomplishing - very similar to how we tend to structure our own learning. Tasks like MLM (masked language model, the prediction task mentioned earlier) are often "easier" than the sorts of tasks the model is likely to be finetuned on. So, in a sense it is a two-phase curriculum, where the model starts off by doing the "easy" task of soaking up the latent structure of a massive corpus, and then you turn its attention towards the particular challenging task you have in mind. This ends up with much better results than throwing a tabula-rasa model straight into the deep end of learning your particular task. When Google released BERT a couple years ago, it achieved state-of-the-art results on 11 different NLP tasks, by finetuning it on those tasks
Fun stuff! Talk with y'all later.