Meeting #174 | Vision Transformers Need Registers + Efficient Streaming Language Models with Attention Sinks

159 views
Skip to first unread message

Tatev Vardanyan

unread,
Nov 9, 2023, 12:55:46 AM11/9/23
to Machine Learning Reading Group Yerevan
Dear friends,

On Friday, NOVEMBER 10, at 3:30 p.m., YerevaNN's ML Researcher Philipp Guevorguian will present two papers on Vision Transformers Need Registers & Efficient Streaming Language Models with Attention Sinks. See details below.

Date & Time: Friday, November 10, 2023, 3:30 pm
Language: Armenian and/or English [if non-Armenian speakers attend the seminar]
Venue: YSU-Krisp AI Lab

---------
1. Vision Transformers Need Registers

Abstract
Transformers have recently emerged as a powerful tool for learning visual representations. In this paper, we identify and characterize artifacts in feature maps of both supervised and self-supervised ViT networks. The artifacts correspond to high-norm tokens appearing during inference primarily in low-informative background areas of images, that are repurposed for internal computations. We propose a simple yet effective solution based on providing additional tokens to the input sequence of the Vision Transformer to fill that role. We show that this solution fixes that problem entirely for both supervised and self-supervised models, sets a new state of the art for self-supervised visual models on dense visual prediction tasks, enables object discovery methods with larger models, and most importantly leads smoother feature maps and attention maps for downstream visual processing.

The paper is available on arxiv: https://arxiv.org/pdf/2309.16588.pdf

2. Efficient Streaming Language Models with Attention Sinks

Abstract
Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens Key and Value states (KV) consumes extensive memory. Secondly, popular MLMs cannot generalize to longer texts than the training sequence length. Window attention, where only the most recent KVs are cached, is a natural approach — but we show that it fails when the text length surpasses the cache size. We observe an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention. In this paper, we first demonstrate that the emergence of an attention sink is due to the strong attention scores towards initial tokens as a “sink” even if they are not semantically important. Based on the above analysis, we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence length without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, Streaming LLM outperforms the sliding window recomputation baseline by up to 22.2× speedup.

The paper is available on arxiv: https://arxiv.org/pdf/2309.17453.pdf

Regards,
Tatev


invite (1).ics
Reply all
Reply to author
Forward
0 new messages