Multi-head Self Attention vs 2D Convolutions

Cristian Garcia

unread,

May 29, 2020, 4:03:13 PM5/29/20

to LCZero

The grid pattern of the chess board seems like a natural fit for a CNN architecture, but given we already have a high-level representation of the board convolutions seem a bit wasteful since it takes a few layers for the network to relate two pieces on opposite sides of the board, meanwhile MHA can do this straight away.

Any thoughts on this?

Deep Blender

unread,

Jun 2, 2020, 7:53:34 PM6/2/20

to LCZero

With the success of transformer networks, I expected someone would try to apply some variation to board games. Unfortunately, I haven't seen anything in that direction yet.

Intuitively, it feels like a great opportunity with lots of potential. It is not even necessary to switch completely as they could even be mixed with CNNs, e.g. by flattening the CNN representation and concatenating it with the MHA driven part of the model.

The disadvantage of something like this is certainly the time investment. Finding out whether it works in general shouldn't be too difficult. But finding out whether it works way better takes time. And if it was seen as being worth it, LCZero would need to be heavily extended as it doesn't rely on a framework which already supports MHA (at least the previous time I had a look at the source code, this wasn't the case).

As usual, it would take the initiative a a developer to take on the project.

DBg

unread,

Jun 3, 2020, 9:08:34 PM6/3/20

to LCZero

What are the principles behind MHA, contrasting with current CNN architecture. I wonder also i anybody has looked at the actual representation within the common tower sub-network (basically the CNN), in order to measure if there is actual waste. I am not aware of any such studies.... they may be somewhere. or to be done.

high-level representation of the board convolutions

I assume you mean the input state space. How high is the level of representation there. Do you mean high dimensional? sparse? If the layers are wasteful, and the input space is already exploded enough to expose the boundaries of best games versus random game (legal) (or between legal outcomes), why not plug the valuation and policy directly on it.

I do not know what self-attention means in terms of machine. could you share some gist of it? I understand the notions the words are appealing too, attention. such a precious thing. not kidding.

It may be where I would understand your comment as though MHA could make associations from denser connections? right away, would mean so smoothing. see where I bug?

maybe i need to read about it, but please try some hint maybe i will click. The self, is intriguing. it looks as thought this is not only architectural, but algorithmic as well. Some knowledge of how good updates are going. i will stop before making more wrong hypotheses (although i like to guess, and build hypotheses)...

If there is some kind of self measure during learning, that would mean some advances in network representation systems of measurements, my English maybe lacking the proper term, but you comment does touch the issue of assessing in some way how the representation has been transformed through the layers before the dense last layers, in the tower. Either way i am interested to know more about the basis on your suggestion. thanks for humoring my question.

how many heads total? and what kind of training sequence of events would that entail?

DBg

unread,

Jun 3, 2020, 9:21:37 PM6/3/20

to LCZero

bloody uneditable typos. I meant that "right away" making statistical association from the probably very jagged boundaries (for initial moves to lead to one of the outcome). I kind of view the convolutions as smoothing while keeping at high level representation. which allow the last dense layer, an easier task at assigning probabilities as continuous (order preserving if you must) functions of neighboring positions (as mutlilayer convoluted represented to the RL heads). I may be wrong in that view. And perhaps i should study transformers? ... well that's it. i will surf. bit. but a contrasting statement could debug me, does not have to be watertight. intuition is ok....

DBg

unread,

Jun 3, 2020, 9:48:46 PM6/3/20

to LCZero

Do you have examples where the reality being modeled is not sequential? When you relate far positions across the board, are you thinking of some sequential scanning?
Or is the chess move sequence itself the object that would allow natural language machine learning framework, to be applicable. I have not finished surfing. but I have seen recurrent neural networks, perhaps looking as chess games as history of moves.... could bring something. but I am stuck at the "right away". more as I read more....

Cristian Garcia

unread,

Jun 4, 2020, 12:50:54 AM6/4/20

to DBg, LCZero

I'd suggest starting with the paper Attention is All You Need.
More relevant to this: AlphaStar and OpenAI's bots already use attention.

On Wed, Jun 3, 2020 at 8:48 PM DBg <dariouc...@gmail.com> wrote:

Do you have examples where the reality being modeled is not sequential? When you relate far positions across the board, are you thinking of some sequential scanning?
Or is the chess move sequence itself the object that would allow natural language machine learning framework, to be applicable. I have not finished surfing. but I have seen recurrent neural networks, perhaps looking as chess games as history of moves.... could bring something. but I am stuck at the "right away". more as I read more....

--
You received this message because you are subscribed to a topic in the Google Groups "LCZero" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/lczero/mxXNV41-DfQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to lczero+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/85d898ca-f0a6-4b01-893a-65a90b024ecb%40googlegroups.com.

DBg

unread,

Jun 4, 2020, 4:52:05 AM6/4/20

to LCZero

Thank you. I am actually, looking there. I realize that there was some economy in architecture complexity, i still need to understand what attention means (for now I think i understand it as solving a dilemma of time scale of analysis, some short term memory based attention for incoming information, but also, some longer term memory for long term associations, these are the tasks needed for natural language tasks, the notion of long scale context affecting short scale analysis or interpretation.

I understand that this sort of temporal split of attention (and perhaps some robust patterns of weighing the time scales ---- hand waving here) may also be part of continuous time or real time video games. and digesting the evolving environment with incoming new data, while keeping some plan of action, or learning what to do on the long run, may bring about same kind of theme, but for a laid back abstract tempo sequential (turned based) game, I wonder.

I am not sure that I need to answer the implementation questions, to see, that, it would not be a replacement of current architecture for chess, if I am right in what that new way of approaching dynamic environment has been solving (the apparent conflict of time scales, and careful divided attention budgeting (weighing), that may have been wasted in a wall-to-wall carpet of attention before....

My current guess, would be that it could help with another level of chess gaming that goes with various level of opponent players, or various styles. That kind of information could be essentially of a sequential nature, while the abstract game of chess being completely determined by the current position does not need to know the past moves, and move alternating pairs, that history might provide context for another level of learning, that of adjusting "tactics" or positional preferences, according the opponent probing that each move from start have allowed. not the position based deterministic game of chess, but the player diversity in strength or style (not talking about preparing for player, still game only information), but the sequence could be some way to gather some information about either strength of opponent or "style" and likely responses to a possibly new level of probability choices in policy per position. now policy per position, per move history...

going to get my curiosity about attention implementation nonetheless. hope i made sense, and not too far out of plausibility. still would like to understand your "right away" accross board idea. what prompted that. is that what transformer can do. find the right thing to look at right away? maybe my reading more will help.... answer that, and not get lost in the details of implementation: transformers are better suited at long scale associations? current hypothesis for reading.... hoping not only a question of time scale though... nice discussion idea anyway.

On Thursday, June 4, 2020 at 12:50:54 AM UTC-4, Cristian Garcia wrote:

I'd suggest starting with the paper Attention is All You Need.
More relevant to this: AlphaStar and OpenAI's bots already use attention.

On Wed, Jun 3, 2020 at 8:48 PM DBg <dariou...@gmail.com> wrote:

Do you have examples where the reality being modeled is not sequential? When you relate far positions across the board, are you thinking of some sequential scanning?
Or is the chess move sequence itself the object that would allow natural language machine learning framework, to be applicable. I have not finished surfing. but I have seen recurrent neural networks, perhaps looking as chess games as history of moves.... could bring something. but I am stuck at the "right away". more as I read more....

--
You received this message because you are subscribed to a topic in the Google Groups "LCZero" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/lczero/mxXNV41-DfQ/unsubscribe.

To unsubscribe from this group and all its topics, send an email to lcz...@googlegroups.com.

DBg

unread,

Jun 6, 2020, 2:15:38 AM6/6/20

to LCZero

I fast-forwarded to my question. I still don't know exactly what multi-head self-attention means precisely in terms of architecture/training algorithm.
But to answer my question about whether time series was the only place where this would apply. I am now wanting to read first the following paper. here link and paste abstract.

On the Relationship between Self-Attention and Convolutional Layers

Recent trends of incorporating attention mechanisms in vision have led researchers to reconsider the supremacy of convolutional layers as a primary building block. Beyond helping CNNs to handle long-range dependencies, Ramachandran et al. (2019) showed that attention can completely replace convolution and achieve state-of-the-art performance on vision tasks. This raises the question: do learned attention layers operate similarly to convolutional layers? This work provides evidence that attention layers can perform convolution and, indeed, they often learn to do so in practice. Specifically, we prove that a multi-head self-attention layer with sufficient number of heads is at least as expressive as any convolutional layer. Our numerical experiments then show that self-attention layers attend to pixel-grid patterns similarly to CNN layers, corroborating our analysis. Our code is publicly available

DBg

unread,

Jun 6, 2020, 8:07:15 PM6/6/20

to LCZero

I am not sure that I need to answer the implementation questions, to see, that, it would not be a replacement of current architecture for chess, if I am right ....

Ding Ding Ding, I have had some good reading from the below linked article written in matrix notation (readable to me more than code, easier to understand the whole picture without seeing dancing indices around, like in cartoons indicating concussion, at least the indices are not propagated everything, defined once, and then think at the matrix level.).

I am not sure that I need to look at the implementation articles before the math. articles establishing the precise definitions of attention (so we can filter out what is hype or enthusiast inflation, or just branding pressure).

I was wrong, about "it would not be a replacement of current architecture". I still hold the op "right away", "high level representation" , and "wasteful" as needing clarification (please op, give me some story).

also, I still don't get why "self", and why "attention", but it does not matter, I have the math as hand, which settles the fuzz.

I will explain in further post why that is, using the math. from the paper below.

My current understanding is that it would be a practical thing to do that could actually be made through a smooth transition, keeping all previous results from the history of design of experiment. The main result of the paper below, is that the Convolution tower math. function (functions spanned by architecture and admissible weights) can be shown equivalent to a subspace or subset of the Self-Attention multi-head.

fixed time sequences can't be exchanged by a re-ordering or the pixel in an image, with some adjustments to that input redefinition (but no transformation of information at input level).

So, the current architecture, and past ones, with respect to the "residual" tower being fed to RL heads, is already included as a particular case of the MH self-attention. So i am going to test that and make it clear in another post from a second pass at the article. Toward asking any onlooker from lc0, which can speak math in the same notation as the article, to provide for a translation of all the previous lc0 convolutions towers that still have data accessible. Then I could translate those using the theorem or its proof, to re-write those as MHSAs.

That would make all the previous hyper-parameter exploration from the past (for which data still accessible), part of the hyper-parameter exploration of an hypothetical larger design of Experiments. It would at least give some starting sub-set from which to make improvements, toward testing empirically the op claims as possible reasons for improvement or other improvements causes. It is actually not yet obvious to me, since 2D convolution can be writing as MHSAs. Hence, just having MHSA architecture is not enough, by definition.

Warning: I may have interpreted the theorem statements too optimistically, hence the second pass, and presentation of the notation, for any courageous math speaking dev, who would like to bridge the notations for me or themselves, using that same paper. The bonus, would be that then , there would be a new level of documentations, ok, more abstract, that could fit in few lines, the actual layer diagrams could serve as illustration but for those who get lost in too many independent items at once, the flexible math notation, would help a lot. Tensors are ugly when fully dressed....

DBg

unread,

Jun 6, 2020, 8:12:03 PM6/6/20

to LCZero

fixed time sequences can't be exchanged by a re-ordering or the pixel in an image, with some adjustments to that input redefinition (but no transformation of information at input level).

Sorry. the can definitely be replaced by a fixed size sequence of pixels. So not a problem. the paper, is actually about that, for images. since lc0 input is motivated by board planes for each piece. The procedure of the paper is totally applicable.

Cristian Garcia

unread,

Jun 6, 2020, 8:28:38 PM6/6/20

to DBg, LCZero

That paper you cite uses relative embeddings with are a way of regaining the translation invariant properties you have in convolutional layers. Given the board is only 8x8 I think a regular (learn) positional embedding would be good enough.

My point is more on the fact that convolutions are good at localized pattern extraction, while self attention is better at global relational "reasoning" which I believe is better for this task. However, since Leela doesn't use a deep learning framework as such and training takes a lot of time, it makes the experiment less likely to happen, unless DeepMind/OpenAI publishes amazing results using this method.

On Sat, Jun 6, 2020, 19:12 DBg <dariouc...@gmail.com> wrote:

fixed time sequences can't be exchanged by a re-ordering or the pixel in an image, with some adjustments to that input redefinition (but no transformation of information at input level).

Sorry. the can definitely be replaced by a fixed size sequence of pixels. So not a problem. the paper, is actually about that, for images. since lc0 input is motivated by board planes for each piece. The procedure of the paper is totally applicable.

--

You received this message because you are subscribed to a topic in the Google Groups "LCZero" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/lczero/mxXNV41-DfQ/unsubscribe.

To unsubscribe from this group and all its topics, send an email to lczero+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/8b0ab3bd-53c5-484b-b00a-3e0092adc680o%40googlegroups.com.

DBg

unread,

Jun 6, 2020, 8:35:47 PM6/6/20

to LCZero

Following are two links toward less mathematics oriented explanations.

Some interactive very detailed diagrams as well, that might be more recent traditions of describing NN architectures (I have not been following the trends for many years, so math, being stable over years, is easier up to a point).

I encourage more implementation aware people to go to these sites and see how plausible my claims of the new architecture suggested by op would be an extension of current ones. And that it would only require rewriting the current architecture in compatible notation, if not already the case, but not published, or documented.

I will concentrate on the theorem of the paper. and perhaps understand where exactly are the many heads assigned. etc... what the two conditions below actually mean. If have to go from my notion of convolution to the discrete version, being assigned various names like receptive field, filter, pooling, they might all look the same in math. Also, there are still unavoidable plethora of indices to follow, but they are the strict minimum required to establish the relation between the two functional bases.

Visualization of Self-Attention Maps in Vision

How a self-attention layer can learn convolutional filters?

The two most crucial requirements for a self-attention layer to express a convolution are:
having multiple heads to attend to every pixel of a convolutional layer's receptive field,
using relative positional encoding to ensure translation equivariance.

DBg

unread,

Jun 6, 2020, 9:02:04 PM6/6/20

to LCZero

I am sorry, i was writing while you were. I think that the point I want to make out of the paper is to show that the current architecture can be looked at via what you propose.

Then, there is nothing preventing development to keep what they are doing, or do experiments with other MHSA hyper-parameters that correspond to your idea (which i may understand better, but do not see yet, precisely, perhaps because "embeddings" is already loaded term). Let me try to improve my understanding.

relative embeddings = relative positional encoding? This is not only to get translation invariant, but wanting to keep the order information as part of what will affect the machine decision, and its training. translation invariant possibly being a corrolarry, but i don't see that as the objective (although, that is how images are being filters through layers, only transmitting changes).

Perhaps in bag of words, there is enough information for the learning tasks, and perhaps order of words in sentence is not crucial to certain interpretation tasks, (i am not claiming understanding, here, just struggling to understand by proposing arguments, feel free to correct).

However, without the relative positional encoding what i read is the following:

A key property of the self-attention model described above is that it is equivariant to reordering, that is, it gives the same output independently of how the input pixels are shuffled. This is problematic for cases we expect the order of input to matter.

This may very well be important here, and may depend on the particular properties of the set of images and the nature of learning task. If there is a lot of information even in averaging projections of all the pixels, so that re-order would not obscure the differences in test images that belong to the classes well discriminated that way, then yes maybe the order preservation might be superfluous and overhead.

I don't think that a representation transformation that blends re-ordering into the same output presented to RL heads, for chess, would have kept the high level representation you had mentioned. I may have screwed some details here. but am I understanding ok?

Dariouch Babaï

unread,

Jun 6, 2020, 9:48:11 PM6/6/20

to lcz...@googlegroups.com

Actually i read my past post, and maybe i don't understand what output invariance to re-ordering of the pixels.

as long as the pixel order chosen is set, maybe the representation only needs the mutual relationships between the squares. any ordering before self-attention multi-heads would still capture the mutual relationships between the squares, if the next transformation does not pool them. (and unless the heads are them selves multi-valued, I don't see how the mutual relationships between squares would not be blended into its attentive head one-dimension output).

so maybe the attention head definition is my problem... back to the paper.

Cristian Garcia

unread,

Jun 6, 2020, 9:51:28 PM6/6/20

to DBg, LCZero

MHA is invariant to the order of the elements as you've read, to give order information back you encode position somehow. The easiest is just to add (sum) a learned embedding (vector) per position/pixel to the corresponding values at those positions.

Example of what MHA could provide:

If you have a bishop hopefully the network would learn to attend to pieces on the diagonal tiles from it or pieces that are attacking it, while a 3D convolution can only look to in a 3x3 neighborhood. A single attention layer can relate (learn to attend to) each tile of the board with each other tile of the board in a single pass, convolutional networks need more layer to be able to relate all the pieces.

If you look at the equations notice the attention matrix A is NxN, in chess it would be 64x64.

--

You received this message because you are subscribed to a topic in the Google Groups "LCZero" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/lczero/mxXNV41-DfQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to lczero+un...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/lczero/ac1914af-995a-43ba-af4b-65c76c981ef2o%40googlegroups.com.

Dariouch Babaï

unread,

Jun 6, 2020, 10:24:08 PM6/6/20

to Cristian Garcia, LCZero

Thank you for helping me getting through this.

Do i understand well that your are displacing the training that would have been through a certain number of CCN layers (in order to cover all the squares of the whole board, and their possible statistical interactions), by the learned embedding per position.

So now i need to understand what that entails. The learned embedding per position, is something that is fixed before the RL training and MHA optimization. It would need a training set of positions. The embedding can't be learned on the go, concurrently with the RL training. it would have to be based on carefully chosen training set of positions.

Besides the learning question.

I understood embedding as the by-product of some supervised batch training, e.g. using some discriminatory task, with actually CNN architecture containing the embedding after the dense classification head removed. Would the vector be this entire tower?

If the above is right. This would actually be a compromise, with the zero approach, in that you would have already an idea of what the correct training set (and embedding learning task) would be to get the embedding valid for all the training.

It might be possible that, with the many training self play data available, on could develop the proper embedding training set definition that would represent chess space the best way for the RL training to go on. so not really far from the zero approach,

But, why would the training toward finding the embedding, be less work, than doing it on the go while RL is going on, with CNN trunk.?

Where did I bug this time? thanks.

Le 06/06/2020 à 21:51, Cristian Garcia a écrit :

Warren D Smith

unread,

Jun 7, 2020, 12:21:13 AM6/7/20

to LCZero

Well, if I had multiple heads, I'd be giving myself some self-attention.

Just saying: these guys have no taste whatever when it comes to naming things.
It is really hard to be that bad. Really, really, really hard.

--
Warren D. Smith
http://RangeVoting.org <-- add your endorsement (by clicking
"endorse" as 1st step)

Dariouch Babaï

unread,

Jun 7, 2020, 2:19:56 PM6/7/20

to Warren D Smith, LCZero

Thank you, i only just understood that you were not making jokes. actually giving me a hint about where my confusion about attention heads came from (I should just look at the equations and forget the branding, scientific noise from funding over-lookers).

because i have internalized the meaning of heads from lc0 definitions as value head and policy head. I prioritized that understanding over just looking at the equations of Attention object, in matrix notation. forget the distractions, but particularly the word head.

calling an inner layer head, may have been a marketing mistake, is is not a head, unless one can grow heads anywhere on the body, or withing organs, i have a head in my lungs, wait for it, it was just coughing that it was hearing me talking about it....

Attention could be ok, but not need to mystify it. self, not clear, seems superfluous at this point of my understanding.

i should work from the equations up. meaning i need to find some math expressing ways into this google group. or work with images. one per equation, if i need to refer to one, i could just repaste it as needed. as suggested for communication efficiency with equations?

Le 07/06/2020 à 00:21, Warren D Smith a écrit :

Dariouch Babaï

unread,

Jun 7, 2020, 4:49:19 PM6/7/20

to Warren D Smith, LCZero

as suggested for communication efficiency with equations?

I meant, "any suggestion for ..... sorry high level typos seem to be my things (I must have some heads of their own in my hands).

Dariouch Babaï

unread,

Jun 8, 2020, 11:04:27 AM6/8/20

to Cristian Garcia, LCZero

I have been educating myself some more, and have focused on papers that are about vision tasks. I don't want to learn the whole NLP field, but i read enough hear and there, to make progress. and accept some of the new terminology like attention (but it may also have been overloaded, there are many variations, and i have not seen any heads yet, only inner layers, or stems).

It seems that you are right about the self-attention picewise defined layers (or multistems, or multi-layers). I am starting to understand why the "self" was used in the context of NLP, with encoders and decoders, because of the possibility of one sentence to be dot.produced (?) with itself.

Also, there is the strong possibility that adding such layers (that is what they are, for the task at hand here, in the minimal proposition of adding such component, at least), would actually, allow for easier capture of cross-correlations over longer distances, for the long range piece, which you are also right, can be captured by convolutions at bigger cost and round about.

For the papers about performance, it seems that the combination of CNN at layers close to input, and attention layers higher up, offer for the best improvement, when long range interactions are expected to be important. That this way of introducing attention (I think that would be called feature attention... not sure), would be the right division of labor between inner layer architectures.

The CNN part would keep the square local relationships (otherwise the chessboard become a fully connected graph and the spatial connectivity of the squares would have to be learned somehow (which I guess could be done through some learned embedding injection at the input levels, data augmentation to have the representation transformation from another machine learning task, hoping that the task was relevant to that of the RL tasks, ... maybe such embedding already exist... anybody...? where could we find such embedding, I wonder....)

I, also, like others, have been thinking that the training may have been not very efficient toward capturing the variables ranges per different type of pieces, but with enough CNN layers, that can be captured, the long way around perhaps, is the gist of your "right away", right. I will not insist about clarifying the statement of "high level enough representation" and the CNN layers being a bit wasteful, as there are no way to measure that, we can only try, architecture mutations and test for performance, until some metric about the various representation transformation level are constructed or proposed, that could somehow make sense of whether the state space had been made more decidable (in probabilistic sense, and loosely speaking) or not by the new pre-heads tower.

but long range does not mean no local range. the pawns are still the slowest, and those pretty much unanimously shape the overall broad strokes of chess games (walking on a minefield here, trying not to use chess terms in vain...).

Also, making some room for different weights capturing the cross-correlation per pieces type, would come naturally with the dynamically updated mutli-attention "heads" concatenation weights (and other attention prescribing parameters), which convolution layers lack (all input planes are cross-correlated with the same parameters, if i am not mistaken).

Attention Augmented Convolutional Networks
2019

We introduce a novel two-dimensional relative self-attention mechanism that proves competitive in replacing convolutions as a stand-alone computational primitive for image classification. We find in control experiments that the best results are obtained when combining both convolutions and self-attention.

Stand-Alone Self-Attention in Vision Models 2019,

We introduce a novel two-dimensional relative self-attention mechanism that proves competitive
in replacing convolutions as a stand-alone computational primitive for image classiﬁcation. We ﬁnd in control exper-
iments that the best results are obtained when combining both convolutions and self-attention.

(Competitive gets translated to slightly worse than pure convolution, in the body of the paper).
Also, here it seems that the convo stem was parallel to the attention one, the previous i think was serial. I have not looked into the detail to see how the attention layer was partitioned over the input vector (made multi piece, ok heads).

both seem to point to the need for both ingredients. I may still have some misconceptions. please point them to me, whoever reads this. But, i am mostly reporting on what those paper claim. very little interpretation about their results. which seem clear.

Dariouch Babaï

unread,

Jun 9, 2020, 11:13:55 AM6/9/20

to Dariouch Babaï, Cristian Garcia, LCZero

one last note about the "standalone" paper, from the discussion or conclusion, to put it in perpective with respect to the first one.

While this work primarily focuses on content-based interactions to establish their virtue for vision
tasks, in the future, we hope to unify convolution and self-attention to best combine their unique
advantages.

Dariouch Babaï

unread,

Jun 9, 2020, 12:18:25 PM6/9/20

to Cristian Garcia, LCZero

Sorry, I re-read my last email/post and in all my hindsight re-editing I ended up doing worse than if I dumped as is.
Which means also in hindsight that either perpetual motion is my only salvation, or that I just correct myself here with the actual verbatim abstracts. (in order of the links below). How is that?

Attention Augmented Convolutional Networks 2019

Convolutional networks have been the paradigm of choice in many computer vision applications. The convolution operation however has a significant weakness in that it only operates on a local neighborhood, thus missing global information. Self-attention, on the other hand, has emerged as a recent advance to capture long range interactions, but has mostly been applied to sequence modeling and generative modeling tasks. In this paper, we consider the use of self-attention for discriminative visual tasks as an alternative to convolutions. We introduce a novel two-dimensional relative self-attention mechanism that proves competitive in replacing convolutions as a stand-alone computational primitive for image classification. We find in control experiments that the best results are obtained when combining both convolutions and self-attention. We therefore propose to augment convolutional operators with this self-attention mechanism by concatenating convolutional feature maps with a set of feature maps produced via self-attention. Extensive experiments show that Attention Augmentation leads to consistent improvements in image classification on ImageNet and object detection on COCO across many different models and scales, including ResNets and a state-of-the art mobile constrained network, while keeping the number of parameters similar. In particular, our method achieves a $1.3\%$ top-1 accuracy improvement on ImageNet classification over a ResNet50 baseline and outperforms other attention mechanisms for images such as Squeeze-and-Excitation. It also achieves an improvement of 1.4 mAP in COCO Object Detection on top of a RetinaNet baseline

gan3sh500/attention-augmented-conv
Tensorflow
leaderj1001/Attention-Augmented-Conv2d
Pytorch

Stand-Alone Self-Attention in Vision Models 2019,

Convolutions are a fundamental building block of modern computer vision systems. Recent approaches have argued for going beyond convolutions in order to capture long-range dependencies. These efforts focus on augmenting convolutional models with content-based interactions, such as self-attention and non-local means, to achieve gains on a number of vision tasks. The natural question that arises is whether attention can be a stand-alone primitive for vision models instead of serving as just an augmentation on top of convolutions. In developing and testing a pure self-attention vision model, we verify that self-attention can indeed be an effective stand-alone layer. A simple procedure of replacing all instances of spatial convolutions with a form of self-attention applied to ResNet model produces a fully self-attentional model that outperforms the baseline on ImageNet classification with 12% fewer FLOPS and 29% fewer parameters. On COCO object detection, a pure self-attention model matches the mAP of a baseline RetinaNet while having 39% fewer FLOPS and 34% fewer parameters. Detailed ablation studies demonstrate that self-attention is especially impactful when used in later layers. These results establish that stand-alone self-attention is an important addition to the vision practitioner's toolbox.

JoeRoussy/adaptive-attention-in-cv

and I apologize for the confusion that browsing the two papers in the same time frame (and blaming fatigues, etc... and just non-inhibited chaos), and hindsight post-editing and typing collisions (good excuses, right?), but both papers are insightful for understand each architectural mutations with regard to the representation NN works. We learn about ResNet as a side-effect, since that is the basis for the comparison. The appendices are there, for the math inclined among us, to make a better understanding of what is going on in lc0 itself, and tune the language of everybody ... (I mean since those papers keep an appendix for those of us limited to those kind of notation, should be good language level to tune our violins with).

The remainder of comments on previous email about the papers should be free of confusion. but just read the papers, when in doubt....

Also, here it seems that the convo stem was parallel to the attention one, the previous i think was serial. I have not looked into the detail to see how the attention layer was partitioned over the input vector (made multi piece, ok heads).

both seem to point to the need for both ingredients. I may still have some misconceptions. please point them to me, whoever reads this. But, i am mostly reporting on what those paper claim. very little interpretation about their results. which seem clear.

I strongly suggest anybody to give a long look at those papers. Friction and confrontation of ideas, is good (just keep the egos out of it). And jargon is a disease to keep in check for any specialist putting its toes into a quest where many different specialists need to play/act like or with other specialists for optimizing progress. Need to build some bridges, or be bound to always be following Deepmind or Google innovations, not making its own. Maybe I am preaching to the choir. maybe not yet. I believe in open source code, data and science. (to plug the 3 holes in the alpha zero publications, and make it full scientific progress).

PS (for real): one last note about the "standalone" paper, from the discussion or conclusion, to put it in perpective with respect to the first one.

Reply all

Reply to author

Forward