Regarding attention

21 views

Skip to first unread message

Ravi Jain

unread,

Nov 1, 2019, 6:09:25 AM11/1/19

to tensor2tensor

Why is

```

Q*(K).t()

```

( t() mean transpose) done in attention, and not

```

Q*(Q+K).t()

```

for example, if we have two pixels, black and white, and want to represent each combination of them differently.

```

black white -> (QK) white black -> (KQ) black black -> (QQ) white white -> (KK)

black -> (Q) white -> (K)

```

Q*(K).t()

```

will give same result for

```

black white

```

and

```

white black

```

whereas if we do,

```

Q*(Q+K).t()

```

then four would be different, other options could be

```

Q*(Q-K)

```

but then

```

black black

white white

```

would be same, or

```

Q*K*K

```

, but that would be computationally expensive than

```

Q*(Q+K)

```

(Q+K)

```

but then,

```

black white

white black

```

would be same

```

(Q-K)

```

but then,

```

white white

black black

```

would be same

or only

```

or only

```

but then all four would be same,

or concat Q and K together, but that would mean higher computation would be required to carry this operation again, as size increased.

Reply all

Reply to author

Forward

0 new messages