GSoC 2026 - LLM Tokenizers Part 2 Interest

16 views
Skip to first unread message

Danilo Santiago Fonseca

unread,
Feb 12, 2026, 9:50:17 AM (8 days ago) Feb 12
to opencv-gsoc-202x
Hi all,

I'm Danilo, a final-year CS undergrad at UFBA (Brazil) interested in the LLM Tokenizers Part 2 project (Idea #15) for GSoC 2026.

I have strong C++ experience from competitive programming practice and systems projects, and am currently working with PyTorch for my undergraduate thesis. I've been studying language models and transformers over the past year, and I'm particularly interested in implementing tokenization algorithms (such as WordPiece and Unigram) from scratch, moving from theoretical understanding to low-level C++ implementation.

I've reviewed PR #27534 from last year's GSoC and the HuggingFace tokenizers documentation. The scope of expanding beyond BPE, adding customizable normalization, and optimizing performance aligns well with my interests in both ML infrastructure and systems programming.

A few questions arose as I explored the codebase:

1) For the tokenizer.json parser, is support for the full HuggingFace spec envisioned, or only for a practical subset targeting common VLM models (e.g. Paligemma, CLIP)?

2) Are there identified performance bottlenecks in the current BPE implementation to target, or should profiling be part of the project scope?

3) For the regex-based pre-tokenization, is the plan to extend the llama.cpp unicode regex functionality used in PR #27534, or explore a different approach?

Looking forward to contributing to the project!

Best regards,

Reply all
Reply to author
Forward
0 new messages