Why is
```
Q*(K).t()
```
( t() mean transpose) done in attention, and not
```
Q*(Q+K).t()
```
for example, if we have two pixels, black and white, and want to represent each combination of them differently.
```
black white -> (QK) white black -> (KQ) black black -> (QQ) white white -> (KK)
black -> (Q) white -> (K)
```
```
Q*(K).t()
```
will give same result for
```
black white
```
and
```
white black
```
whereas if we do,
```
Q*(Q+K).t()
```
then four would be different, other options could be
```
Q*(Q-K)
```
but then
```
black black
white white
```
would be same, or
```
Q*K*K
```
, but that would be computationally expensive than
```
Q*(Q+K)
```
or
```
(Q+K)
```
but then,
```
black white
white black
```
would be same
or
```
(Q-K)
```
but then,
```
white white
black black
```
would be same
or only
```
Q
```
or only
```
K
```
but then all four would be same,
or concat Q and K together, but that would mean higher computation would be required to carry this operation again, as size increased.