Differences between UT without ACT and plain transformer with depth

18 views
Skip to first unread message

Sergio Ryan

unread,
Jun 20, 2019, 4:34:09 AM6/20/19
to tensor2tensor
So as far as I understand UT replaces the "pre-determined depth" in Transformer (that is 6, the suggested number of transformer body from the paper Attention is All You Need) with recursive connections. What is the difference between connecting plain transformer multiple time like in the paper, with the recursive connection other than the coordinate embedding and transition function, when ACT is not used to dynamically reduce the number of time steps?

From the Google AI blog:
 However, now the number of times this transformation is applied to each symbol (i.e. the number of recurrent steps) can either be manually set ahead of time (e.g. to some fixed number or to the input length), or it can be decided dynamically by the Universal Transformer itself. To achieve the latter, we added an adaptive computation mechanism...

I probably get it wrong. Can somebody clear things up for me?

Sergio Ryan

unread,
Jun 20, 2019, 4:47:33 AM6/20/19
to tensor2tensor
Note that when running for a fixed number of steps, the Universal Transformer is equivalent to a multi-layer Transformer with tied parameters across its layers.

Seems that it is the same like transformer with depth and but the parameters are like "shared" to all the plain transformer layers.

Sorry this seems like a novice question but correct me if I'm wrong?

Lukasz Kaiser

unread,
Jun 30, 2019, 1:52:34 PM6/30/19
to Sergio Ryan, tensor2tensor
> Seems that it is the same like transformer with depth and but the parameters are like "shared" to all the plain transformer layers.
>
> Sorry this seems like a novice question but correct me if I'm wrong?

No - you are perfectly right, this is the main idea! There are a few
more tweaks and the ACT part, but the main part is a depth-shared
Transformer :).

Lukasz

> On Thursday, June 20, 2019 at 3:34:09 PM UTC+7, Sergio Ryan wrote:
>>
>> So as far as I understand UT replaces the "pre-determined depth" in Transformer (that is 6, the suggested number of transformer body from the paper Attention is All You Need) with recursive connections. What is the difference between connecting plain transformer multiple time like in the paper, with the recursive connection other than the coordinate embedding and transition function, when ACT is not used to dynamically reduce the number of time steps?
>>
>> From the Google AI blog:
>>>
>>> However, now the number of times this transformation is applied to each symbol (i.e. the number of recurrent steps) can either be manually set ahead of time (e.g. to some fixed number or to the input length), or it can be decided dynamically by the Universal Transformer itself. To achieve the latter, we added an adaptive computation mechanism...
>>
>> https://ai.googleblog.com/2018/08/moving-beyond-translation-with.html
>>
>> I probably get it wrong. Can somebody clear things up for me?
>
> --
> You received this message because you are subscribed to the Google Groups "tensor2tensor" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to tensor2tenso...@googlegroups.com.
> To post to this group, send email to tensor...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/tensor2tensor/0b27287e-477e-46ab-ac52-8de30f7c50d3%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Sergio Ryan

unread,
Jun 30, 2019, 10:36:51 PM6/30/19
to tensor2tensor
Thank you very much!

Just one last thing, how can a transformer with single attention layer that is used multiple times (universal transformer) achieves better accuracy than the (plain) transformer that has multiple attention layers?


On Monday, July 1, 2019 at 12:52:34 AM UTC+7, Lukasz Kaiser wrote:
> Seems that it is the same like transformer with depth and but the parameters are like "shared" to all the plain transformer layers.
>
> Sorry this seems like a novice question but correct me if I'm wrong?

No - you are perfectly right, this is the main idea! There are a few
more tweaks and the ACT part, but the main part is a depth-shared
Transformer :).

Lukasz

> On Thursday, June 20, 2019 at 3:34:09 PM UTC+7, Sergio Ryan wrote:
>>
>> So as far as I understand UT replaces the "pre-determined depth" in Transformer (that is 6, the suggested number of transformer body from the paper Attention is All You Need) with recursive connections. What is the difference between connecting plain transformer multiple time like in the paper, with the recursive connection other than the coordinate embedding and transition function, when ACT is not used to dynamically reduce the number of time steps?
>>
>> From the Google AI blog:
>>>
>>>  However, now the number of times this transformation is applied to each symbol (i.e. the number of recurrent steps) can either be manually set ahead of time (e.g. to some fixed number or to the input length), or it can be decided dynamically by the Universal Transformer itself. To achieve the latter, we added an adaptive computation mechanism...
>>
>> https://ai.googleblog.com/2018/08/moving-beyond-translation-with.html
>>
>> I probably get it wrong. Can somebody clear things up for me?
>
> --
> You received this message because you are subscribed to the Google Groups "tensor2tensor" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to tensor...@googlegroups.com.

Artit Wangperawong

unread,
Jul 4, 2019, 8:05:06 PM7/4/19
to Sergio Ryan, tensor2tensor
Hi Sergio,

That's a good question. I ran some character-level experiments and found that although the universal transformer achieves higher accuracy at the same number of training steps, it took 3.5 times longer to train than the vanilla transformer on a 6-core CPU. Adding ACT reduced the training time to 2x and improves accuracy further. Lukasz then informed me that 1M steps on a TPU achieved near perfect accuracy. For reference, see https://arxiv.org/abs/1812.02825

We can therefore ask -- how much improvement comes from a better architecture, and how much improvement comes from more training time applied? To investigate this, I'll report results at a fixed training time, rather than a fixed number of steps. 

Lukasz, can you elaborate on the universal transformer paper results' training setup (training steps, time to train, etc.)? I currently understand that all experiments were run with the same number of training steps, rather than training time.

Thanks,
Art



To unsubscribe from this group and stop receiving emails from it, send an email to tensor2tenso...@googlegroups.com.

To post to this group, send email to tensor...@googlegroups.com.

Sergio Ryan

unread,
Jul 5, 2019, 5:25:29 AM7/5/19
to tensor2tensor
1M steps here means 1M maximum iterations for each position (character)? ACT halts universal transformer iterations, but actually improves accuracy (and performance)? Without ACT the model seems to overfit, doesn't it? That's a great paper, I'll take a look on it.

I agree with you about the training setup, would be useful for benchmarks.

Thank you,
Sergio
Reply all
Reply to author
Forward
0 new messages