Contribution to OpenCV DNN Module – Tokenizer Integration

Anushka Sharma

unread,

Mar 4, 2025, 11:38:55 AM3/4/25

to opencv-gsoc-202x

Dear OpenCV team,

I hope you are doing well.

As we all know, OpenCV is primarily designed for computer vision. However, integrating Large Language Model (LLM) support can significantly enhance its capabilities. Currently, OpenCV can detect and extract text from images using OCR or deep learning models, but it often produces errors, missing characters, or poor formatting.

One of the major limitations in OpenCV’s DNN module is the lack of native support for tokenization. This means that running an LLM inside OpenCV requires an external tokenizer library, adding unnecessary complexity. I strongly believe that addressing this issue will make OpenCV more efficient and impactful in real-world applications, which is why I am eager to contribute to solving this problem.

To better understand tokenization algorithms, I am referring to the Hugging Face documentation:
https://huggingface.co/learn/nlp-course/en/chapter6/5

⭐️I have also referred to the krish naik's NLP playlist on youtube in past and have experience in fine tuning models as well.

Additionally, I found this video helpful in understanding the concepts:
https://youtu.be/zduSFxRajkE?si=JF725Ipnzc4R5Nnc

That said, I still need to explore the OpenCV DNN module further to fully understand how to integrate a tokenizer effectively. If anyone has relevant resources, insights, or experience related to this, I would greatly appreciate it if you could share them with me.

Looking forward to your thoughts!

Best regards,
Anushka Sharma
anushkas...@gmail.com
LinkedIn

Vadim Pisarevsky

unread,

Mar 7, 2025, 2:50:43 PM3/7/25

to opencv-gsoc-202x

Hello,

tokenization is a very good topic. Reimplementing a part of https://github.com/huggingface/tokenizers in C++ (I guess, it's done in llama-cpp and maybe a few other places) would be a good contribution.

As usual, the details that you put into your proposal and also some details about your coding expertise, in particular in C++, would be the most important for our decision.

Regards,

Vadim

Anushka Sharma

unread,

Mar 21, 2025, 2:19:46 PM3/21/25

to opencv-gsoc-202x

Opened the first issue regarding the addition of tokenizer in dnn module....
https://github.com/opencv/opencv/issues/27121

Reply all

Reply to author

Forward