Hello,
I have been exploring the generative AI inference C++ tasks. I'd like to know more about the roadmap for these features.
As I understand it, the process involves converting SafeTensors or PyTorch binaries into a FlatBuffers binary that includes only weights and metadata, something like a subset of the TFLite format. This binary is then executed by a minimal runtime (XNNPACK for CPU) which only uses TFLite for parsing the FlatBuffers (the runtime includes code to generate a subgraph loading XNNPACK primitives). Please, correct me if there are any inaccuracies.
Given this context, I have a couple of questions:
a) Is there a plan to open-source the converter code for LLM tasks to allow for the integration of custom or modified LLM models? As far as I understand, this is not available yet [
https://github.com/google/mediapipe/issues/5355, internals are contained in a .so precompiled in the pip package, but generating that lib seems not possible].
b) Considering that the relationship with TFLite seems limited to the format of the FlatBuffers and its loading into the runtime, is this method intended as a temporary solution until TFLite fully supports LLMs, or is it expected to remain in place?
Thank you for your time and assistance.
Best regards,