Regarding attention

21 views
Skip to first unread message

Ravi Jain

unread,
Nov 1, 2019, 6:09:25 AM11/1/19
to tensor2tensor
Why is 
```
Q*(K).t()
```
( t() mean transpose) done in attention, and not 
```
Q*(Q+K).t()
```
for example, if we have two pixels, black and white, and want to represent each combination of them differently.

```
black white -> (QK)        white   black  -> (KQ)           black  black  -> (QQ)       white white -> (KK)                                                                                    
                                              black -> (Q)         white -> (K)
```
```
Q*(K).t()
```
will give same result for 
```
black white
```
and 
```
white black
```

whereas if we do,
```
Q*(Q+K).t()
```
then four would be different, other options could be 
```
Q*(Q-K)
```
but then
```
black black
white white
```
would be same, or 
```
Q*K*K
```
, but that would be computationally expensive than 
```
Q*(Q+K)
```

or 

```
(Q+K)
```
but then,
```
black white
white black
```
would be same

or

```
(Q-K)
```
but then,

```
white white
black black
```
would be same

or only
```
Q
```
or only
```
K
```
but then all four would be same,

or concat Q and K together, but that would mean higher computation would be required to carry this operation again, as size increased.
Reply all
Reply to author
Forward
0 new messages