Starkly Speaking: Adaptive Protein Tokenization

32 views

Skip to first unread message

Hannes Stärk

unread,

May 17, 2026, 2:37:49 PMMay 17

to stark...@googlegroups.com

Hi together,

Tomorrow we will talk about:

Speaker: Rohit Dilip who is a PhD student in Computer Science at Caltech, co-advised by David Van Valen and Georgia Gkioxari, working on generative modeling for biology with recent work on flow-matching tokenizers and autoregressive models for protein structures.

Paper:

Adaptive Protein Tokenization

https://arxiv.org/abs/2602.06418 (Rohit Dilip, Ayush Varshney, David Van Valen)

Tokenization is a promising path to multi-modal models capable of jointly understanding protein sequences, structure, and function. Existing protein structure tokenizers create tokens by pooling information from local neighborhoods, an approach that limits their performance on generative and representation tasks. In this work, we present a method for global tokenization of protein structures in which successive tokens contribute increasing levels of detail to a global representation. This change resolves several issues with generative models based on local protein tokenization: it mitigates error accumulation, provides embeddings without sequence-reduction operations, and allows task-specific adaptation of a tokenized sequence's information content. We validate our method on reconstruction, generative, and representation tasks and demonstrate that it matches or outperforms existing models based on local protein structure tokenizers. We show how adaptive tokens enable inference criteria based on information content, which boosts designability. We validate representations generated from our tokenizer on CATH classification tasks and demonstrate that non-linear probing on our tokenized sequences outperforms equivalent probing on representations from other tokenizers. Finally, we demonstrate how our method supports zero-shot protein shrinking and affinity maturation.

Meeting Details:

Every Monday at 9:00 PT / 12:00 ET / 18:00 CE(S)T

https://mit.zoom.us/my/starkhannes

Slack Workspace for discussion and paper voting:

https://join.slack.com/t/logag/shared_invite/zt-2zuxi7gd1-rLUgxg6gnCkhO7WlRsyElg

All information: Schedule of upcoming papers, recordings, mailing list:

https://hannes-stark.com/starkly-speaking

Hannes Stärk

Website: https://hannes-stark.com

PhD student at MIT

Reply all

Reply to author

Forward

0 new messages