Hi all,
I'm Danilo, a final-year CS undergrad at UFBA (Brazil) interested in the LLM Tokenizers Part 2 project (Idea #15) for GSoC 2026.
I
have strong C++ experience from competitive programming practice and
systems projects, and am currently working with PyTorch for my
undergraduate thesis. I've been studying language models and
transformers over the past year, and I'm particularly interested in
implementing tokenization algorithms (such as WordPiece and Unigram)
from scratch, moving from theoretical understanding to low-level C++
implementation.
I've
reviewed PR #27534 from last year's GSoC and the HuggingFace tokenizers
documentation. The scope of expanding beyond BPE, adding customizable
normalization, and optimizing performance aligns well with my interests
in both ML infrastructure and systems programming.
A few questions arose as I explored the codebase:
1)
For the tokenizer.json parser, is support for the full
HuggingFace spec envisioned, or only for a practical subset targeting common VLM models
(e.g. Paligemma, CLIP)?
2)
Are there identified performance bottlenecks in the current BPE
implementation to target, or should profiling be part of the project
scope?
3) For the
regex-based pre-tokenization, is the plan to extend the llama.cpp
unicode regex functionality used in PR #27534, or explore a different
approach?
Looking forward to contributing to the project!
Best regards,