Objective:
Improve the OpenCV DNN module to support LLMs more efficiently, with a focus on models compatible with llama.cpp (such as LLaMA, Mistral, and GPT-J). The goal is to optimize inference performance, particularly for autoregressive decoding, and to provide a user-friendly API that facilitates real-time applications.
Key Ideas:
Dynamic Memory Management:
Enhance the current static blob allocation mechanism to support dynamic input sizes.
Avoid costly reallocations when extending input sequences during token-by-token generation.
Ensure efficient handling of varying sequence lengths to reduce memory overhead.
Past Key/Value Caching:
Implement a caching mechanism for transformer-based models so that once the past key/value pairs are computed, they can be reused instead of re-calculating them with each new token.
This caching should reduce computation time, making inference nearly constant per token after the initial forward pass.
Dynamic Sequence Extension & Batch Processing:
Provide support for seamless, incremental token generation where the model processes only the new token along with the cached states.
Enable batched processing of multiple sequences to maximize throughput, especially important for applications like generating multiple text completions in parallel.
User-Friendly API & Demo Applications:
Offer a high-level wrapper that abstracts the complexity of caching and dynamic input handling, making it easier for users to integrate LLM inference into their projects.
Develop sample demos (e.g., token streaming for a real-time chatbot, parallel text generation) that illustrate how to use the enhanced DNN module effectively.