Proposal for Better LLMs support in OpenCV (2)

40 views
Skip to first unread message

Kunal Tiwari

unread,
Apr 6, 2025, 12:37:11 AMApr 6
to opencv-gsoc-202x
Hello mentors,

My name is Kunal Tiwari. I am writing to express my interest in contributing to the Better LLMs support in OpenCV (2) project, specifically focusing on optimizing inference performance.

I am a student at the University of Texas at Austin where I am currently studying Computer Science. I have experience in machine learning, computer vision, and transformer-based architectures! I am currently part of the Living with Robots lab on campus, where I and my team have been developing a SOTA transformer for human motion prediction. We are currently also writing a research paper to be published displaying our results.

Outside of school I’ve worked on projects like rebuilding the GPT-2 model from scratch, using CNNs for music genre classification, and deploying full-stack ML apps using PyTorch, TensorFlow, and Hugging Face tools.

I wanted to provide a quick roadmap of my simplified proposal for Enhanced LLM support in the OpenCV DNN module:

Objective:
Improve the OpenCV DNN module to support LLMs more efficiently, with a focus on models compatible with llama.cpp (such as LLaMA, Mistral, and GPT-J). The goal is to optimize inference performance, particularly for autoregressive decoding, and to provide a user-friendly API that facilitates real-time applications.

Key Ideas:

  1. Dynamic Memory Management:

    • Enhance the current static blob allocation mechanism to support dynamic input sizes.

    • Avoid costly reallocations when extending input sequences during token-by-token generation.

    • Ensure efficient handling of varying sequence lengths to reduce memory overhead.

  2. Past Key/Value Caching:

    • Implement a caching mechanism for transformer-based models so that once the past key/value pairs are computed, they can be reused instead of re-calculating them with each new token.

    • This caching should reduce computation time, making inference nearly constant per token after the initial forward pass.

  3. Dynamic Sequence Extension & Batch Processing:

    • Provide support for seamless, incremental token generation where the model processes only the new token along with the cached states.

    • Enable batched processing of multiple sequences to maximize throughput, especially important for applications like generating multiple text completions in parallel.

  4. User-Friendly API & Demo Applications:

    • Offer a high-level wrapper that abstracts the complexity of caching and dynamic input handling, making it easier for users to integrate LLM inference into their projects.

    • Develop sample demos (e.g., token streaming for a real-time chatbot, parallel text generation) that illustrate how to use the enhanced DNN module effectively.

Please let me know if I am on the right track for this project and if there is any other additional information I should know.

Thank you for your time,
Kunal Tiwari
Reply all
Reply to author
Forward
0 new messages