SWE-Glu SF: Enhanced Transformer with Rotary Position Embedding (RoPE)

29 views

Skip to first unread message

sasha.hydrie

unread,

Aug 23, 2024, 1:16:15 AM8/23/24

to SWE-Glu SF Papers Reading Group

Glu evening,

Our next meeting will be Saturday, August 24th, 2:30 PM @ 848 Divisadero Street. We have been growing 300% week-over-week and are now searching for a new meeting space, if you have any leads please email us.

This week's paper is "RoFormer: Enhanced Transformer with Rotary Position Embedding" by Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu (2021) arxiv.org/abs/2104.09864

Why this is cool:
1. This positional-encoding scheme is behind nearly-every frontier model, part of the mythical Noam architecture*
2. All of this takes not even 20 lines to implement (or a repo to experiment with)
3. Add on a random reddit thread and a forced acronym to get graphs like this

This week's paper is light on math (compared to last week, at least) but rich in intuition and history! We'll certainly take some time to work through why it works and how we got here. For the attention-challenged among you, try RoPE this video is a great primer on the paper.

We're trying something new this week. In the spirit of listserv, feel free to respond with your initial thoughts.

How does the choice of Θ impact the encoding? What properties are required?
Which parts of the paper are the most sketchy?
Were you impressed by the results? How did RoPE become so dominant? (What happened to ALIBI?)

Thank you to the 15ish (!) of you who joined us at our third meeting.

Best,
Cheikh and Sasha

P.S. if you are somehow reading this email but not on our listserv join it here. If you are on our listserv, send it to your friends.

*I still suspect that Cheikh made up this term himself

nottlespike

unread,

Aug 23, 2024, 6:56:07 PM8/23/24

to SWE-Glu SF Papers Reading Group

Hi all I'm Jason/Kearm and my startup is focused on LLM powered cybersecurity but my passion is local and open source LLM's.

The thing that popped out to me as the most sketchy the most when reading the paper for the first time was the VERY in my opinion odd choice to use V100 GPU's as they do not have nor will they, based on Tri Dao Lab's roadmap, have FlashAttention-2 which is standard for most if not all pre-training and likewise for supervised fine-tuning.

RoPE as we know now is one of many critical tools for any LLM "scientist's" toolbox. I would like to propose a look at YaRN https://arxiv.org/abs/2309.00071 as well for even longer context and the tradeoffs and differences.