Announcing Pixtral 12B - the first-ever multimodal Mistral model
1 view
Skip to first unread message
Alan Timm
unread,
Sep 18, 2024, 6:51:43 PM9/18/24
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to RSSC-List
Impressive performance from a 12B vision model. Can be run on gpu, and if it's not already I expect to see a 4bit quantized version out there that runs on 8gb of memory.
Pixtral is trained to understand both natural images and documents, achieving 52.5% on the MMMU reasoning benchmark, surpassing a number of larger models. The model shows strong abilities in tasks such as chart and figure understanding, document question answering, multimodal reasoning and instruction following. Pixtral is able to ingest images at their natural resolution and aspect ratio, giving the user flexibility on the number of tokens used to process an image. Pixtral is also able to process any number of images in its long context window of 128K tokens. Unlike previous open-source models, Pixtral does not compromise on text benchmark performance to excel in multimodal tasks.