Multimodal image and voice input update.

328 views
Skip to first unread message

Jeremy Ellis

unread,
Sep 10, 2025, 12:37:02 AMSep 10
to Chrome Built-in AI Early Preview Program Discussions
Is there an update about the multimodal Chrome Gemini API update. I notice it is not really mentioned that much but using a complex JSON input I am getting some kind of base64 analysis. Is there a reasonably new demo about both image and sound multimodal input. My test code is always getting an [object,object] error when I try to pass a blob to the model, but this JSON input as the prompt had some (not accurate) output. At least it did not give an error. It got the colors and content confused but it was OK with the image and knew it was a pixelated small  say 24x24 PNG image.

```
{
  "contents": [
    {
      "parts": [
        {
          "text": "Describe the image. Is it a picture of something specific?"
        },
        {
          "inlineData": {
            "mimeType": "image/png",
            "data": "iVBORw0KGgoAAAANSUhEUgAAABAAAAAPCAYAAADtc08vAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAADsEAAA7BAbiRa+0AAAGHaVRYdFhNTDpjb20uYWRvYmUueG1wAAAAAAA8P3hwYWNrZXQgYmVnaW49J++7vycgaWQ9J1c1TTBNcENlaGlIenJlU3pOVGN6a2M5ZCc/Pg0KPHg6eG1wbWV0YSB4bWxuczp4PSJhZG9iZTpuczptZXRhLyI+PHJkZjpSREYgeG1sbnM6cmRmPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5LzAyLzIyLXJkZi1zeW50YXgtbnMjIj48cmRmOkRlc2NyaXB0aW9uIHJkZjphYm91dD0idXVpZDpmYWY1YmRkNS1iYTNkLTExZGEtYWQzMS1kMzNkNzUxODJmMWIiIHhtbG5zOnRpZmY9Imh0dHA6Ly9ucy5hZG9iZS5jb20vdGlmZi8xLjAvIj48dGlmZjpPcmllbnRhdGlvbj4xPC90aWZmOk9yaWVudGF0aW9uPjwvcmRmOkRlc2NyaXB0aW9uPjwvcmRmOlJERj48L3g6eG1wbWV0YT4NCjw/eHBhY2tldCBlbmQ9J3cnPz4slJgLAAAC30lEQVQ4TyXN0W7bVACA4f8cn2M7jpM0SZuu7ZrBhtjUTaAyJDaBEIg7uJp2wSXwDNyjXfEAvMKeA6FJIEQFQlRMHUJVtUILaVOncR3bsc/x4YLvBT7x9MuPnTUlUaeDWVZIBzoMQEqcqfBDH+X7LIxHRYuIOcpTeH5AXdfIqiwBR5ZmFHlJWZaYusbaitboFWwwZJEVZIua3niXrfuPKI1iNj3jcjZH5nmBg/9HBDIIaFxNEK9z490v2H7wGNXqoMyc/K+f6Q2uce+jzzFEOGvwPry3+aQxBqk1UvmUeY7vS4xtaLUHlMkJV5MXeFrR2Jzs/JDffv+TX37aY33QRnzz2Ttu+Ood5umMKk0QrqHXCVnmGUEQMc0bkqzkzdc3qI3D1BUXSUrjJJ04RFJlPNvb5+m3B3gItNaIIEaP38OLekyuFjw/ukAIATgcgtXVPqO1Nk3jkLoVsLMR0BVLCNps7bxP/9ZDVH+drKy51vG5ubWCbGoCBZ4nsfUSjcNUJXJtY8Tt6xGf3F9Dap/BoIsSFeNRzOjuB0xnC9YHCqNAakllLPgRFkWoDLKqci5Sj25nxPHplOk8JV4Z8u/BDxz+sU9WC26PRxRFxdE/F6S5wB+MqcNVjAE5Sy256xK2hwz7Pb5/9h0nFw2Ty5yr0wMe7t4iikP6ccR41CdoR4iywNSW87RCfP3prmuvvYa1Fq/JmM3OOT1LCOM211diBj2NtQ2ep5AK0kISxCMuk0umL/cRTx6/4YrlEqUDIiVotUOss/haYxsHDRih+PE4R6RTHtxY42VyBWHI0Gshvnp015XLEl8rpBPUpkR5Gk8IpFZYJJ7S7B3N6KJwRvA8mbF5c5vkeIIUyqPVjrDCw2mJVj5lI3gxWfDr4RmetBTlgrc2FDtbHsq3JJXj74MT7nQGyFm2ZLFsKOqGq8KQGcv+8ZzJZcnqsEthQaqQuo5Jpj7F3NIqC97eXGdna5v/AKS6WjOZUKYQAAAAAElFTkSuQmCC"
          }
        }
      ]
    }
  ]
}
```

Thomas Steiner

unread,
Sep 10, 2025, 7:00:24 AMSep 10
to Jeremy Ellis, Chrome Built-in AI Early Preview Program Discussions
Hi Jeremy,

Here's a zero nuance, no error checking, reduced to the absolute minimum example: https://tomayac.github.io/throw-aways/multimodal-prompt-api/

Screenshot 2025-09-10 at 12.58.57.png
The entire source code is this:

<img id="img" src="image.png" alt="" />
<output id="output"></output>
<script type="module">
  const session = await LanguageModel.create({
    expectedInputs: [{ type: 'image' }, { type: 'text', languages: ['en'] }],
  });
  const stream = session.promptStreaming([
    {
      role: 'user',
      content: [
        {
          type: 'image',
          value: img,
        },

        {
          type: 'text',
          value: 'Describe this image in a few words.',
        },
      ],
    },
  ]);
  for await (const chunk of stream) {
    output.append(chunk);
  }
</script>


Hope this unblocks you!

Cheers,
Tom


--
You received this message because you are subscribed to the Google Groups "Chrome Built-in AI Early Preview Program Discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chrome-ai-dev-previe...@chromium.org.
To view this discussion visit https://groups.google.com/a/chromium.org/d/msgid/chrome-ai-dev-preview-discuss/64aae8e0-479c-4f90-a93c-e425984c439en%40chromium.org.


--
Thomas Steiner, PhD—Developer Relations Engineer (blog.tomayac.comtoot.cafe/@tomayac)

Google Spain, S.L.U.
Torre Picasso, Pl. Pablo Ruiz Picasso, 1, Tetuán, 28020 Madrid, Spain

CIF: B63272603
Inscrita en el Registro Mercantil de Madrid, sección 8, Hoja M­-435397 Tomo 24227 Folio 25

----- BEGIN PGP SIGNATURE -----
Version: GnuPG v2.4.8 (GNU/Linux)

iFy0uwAntT0bE3xtRa5AfeCheCkthAtTh3reSabiGbl0ck
0fjumBl3DCharaCTersAttH3b0ttom.xKcd.cOm/1181.
----- END PGP SIGNATURE -----

François Beaufort

unread,
Sep 10, 2025, 7:58:53 AMSep 10
to Thomas Steiner, Jeremy Ellis, Chrome Built-in AI Early Preview Program Discussions

Thomas Steiner

unread,
Sep 10, 2025, 8:12:21 AMSep 10
to François Beaufort, Thomas Steiner, Jeremy Ellis, Chrome Built-in AI Early Preview Program Discussions
Oh, I see. Hence the base64-encoding… (Gemini Code Assist always suggests code like this, even when coding against the built-in API, so I was misled, but you're right, it says Gemini API in the text.) Anyway, in this case, Jeremy, this is the mailing list focused on built-in AI APIs. If you want to consider using these APIs, my example will be helpful I hope, and according to your Discord message, you have signed up for the Built-in AI Challenge, so I assume you are interested :-)

Jeremy Ellis

unread,
Sep 11, 2025, 10:38:25 PMSep 11
to Chrome Built-in AI Early Preview Program Discussions, Thomas Steiner, Jeremy Ellis, Chrome Built-in AI Early Preview Program Discussions, François Beaufort
This is getting useful. So I think I have been searching incorrectly, as soon as I mention Chrome-Gemini-API I get weird information, so I should be searching for Chrome-Built-In-API ?  I will see if that helps.  Yes I am in the  Built-in AI Challenge which is why I want to get this code working well so I can understand it. Here is my demo that I want to get image and sound loading and fix a minor issue with the execute stop button which works for streaming but not the other prompts.  

 https://hpssjellis.github.io/my-examples-of-web-llm/public/webllm00.html

Also Thomas thank you for the code which is as bare bones as it gets, however if I try this offsite I get a cross boundary type error, even if the image is in the same folder. The error is "SecurityError: Source would taint origin." I will try a few other ways to load the image. At least you have given me a starting point.

 Here is the code you gave me.   


 <img id="img" src="image.png" alt="" />
<output id="output"></output>
<script type="module">
  const session = await LanguageModel.create({
    expectedInputs: [{ type: 'image' }, { type: 'text', languages: ['en'] }],
  });
  const stream = session.promptStreaming([
    {
      role: 'user',
      content: [
        {
          type: 'image',
          value: img,
        },

        {
          type: 'text',
          value: 'Describe this image in a few words.',
        },
      ],
    },
  ]);
  for await (const chunk of stream) {
    output.append(chunk);
  }
</script>

Jeremy Ellis

unread,
Sep 11, 2025, 11:06:25 PMSep 11
to Chrome Built-in AI Early Preview Program Discussions, Jeremy Ellis, Thomas Steiner, Chrome Built-in AI Early Preview Program Discussions, François Beaufort
Thanks so much Chrome Team. That is perfect. My demo works both online and offline when I used a blob for the image. My demo is here https://hpssjellis.github.io/my-examples-of-web-llm/public/bare-stream00.html

Can someone reply if sound can be loaded yet into the Chrome built-in multimodal? Now that I know what to ask in my search, I will see if I can get the translation working offline.

Jeremy Ellis

unread,
Sep 12, 2025, 2:15:37 AMSep 12
to Chrome Built-in AI Early Preview Program Discussions, Jeremy Ellis, Thomas Steiner, Chrome Built-in AI Early Preview Program Discussions, François Beaufort
No worries about the language detection and translators. I generated examples easily now that I am calling it "Chrome built-in AI".

The only thing I am stumped on in audio to text using the multimodal. I think it is only in origin trials which I don't want to work with. Can someone confirm that audio to text it is not yet ready for chrome 138 with flags?

Thomas Steiner

unread,
Sep 12, 2025, 4:29:01 AMSep 12
to Jeremy Ellis, Chrome Built-in AI Early Preview Program Discussions, Thomas Steiner, François Beaufort
Hey Jeremy,

Audio input for the Prompt API works with the flags for local experimentation, and it's also part of the origin trial for testing with real users. Here's a minimal example: https://chrome.dev/web-ai-demos/mediarecorder-audio-prompt/

Cheers,
Tom

François Beaufort

unread,
Sep 12, 2025, 4:35:00 AMSep 12
to Thomas Steiner, Jeremy Ellis, Chrome Built-in AI Early Preview Program Discussions
FYI some Multimodal built-in AI demos are available at https://chrome.dev/web-ai-demos/io2025.html#multimodal.
And you can find source code at https://github.com/GoogleChromeLabs/web-ai-demos

Connie Leung

unread,
Sep 12, 2025, 10:22:05 AMSep 12
to Chrome Built-in AI Early Preview Program Discussions, François Beaufort, Jeremy Ellis, Chrome Built-in AI Early Preview Program Discussions, Thomas Steiner
hi Thom,


The demo broke in Chrome stable and I have not had time to debug it.
Screenshot 2025-09-12 at 10.21.14 PM.png

François Beaufort

unread,
Sep 12, 2025, 10:27:33 AMSep 12
to Connie Leung, Chrome Built-in AI Early Preview Program Discussions, Jeremy Ellis, Thomas Steiner
Argh. It looks like we're hitting https://issues.chromium.org/issues/441711146
I wonder if we'll have to apply something similar to https://github.com/GoogleChromeLabs/web-ai-demos/pull/182/files @Thomas Steiner 

Thomas Steiner

unread,
Sep 12, 2025, 10:27:33 AMSep 12
to Connie Leung, Chrome Built-in AI Early Preview Program Discussions, François Beaufort, Jeremy Ellis, Thomas Steiner
I just tested in Stable, and while it took a short moment, it was working just fine (only showing the language warning, the error message is from an unrelated extension I have installed):

Screenshot 2025-09-12 at 16.26.12.png


On Fri, Sep 12, 2025 at 4:22 PM Connie Leung <cleu...@gmail.com> wrote:

Thomas Steiner

unread,
Sep 12, 2025, 10:31:54 AMSep 12
to François Beaufort, Connie Leung, Chrome Built-in AI Early Preview Program Discussions, Jeremy Ellis, Thomas Steiner
Ah, good instinct. I have the model downloaded, but this bug only hits when the model needs to be downloaded. That's definitely https://issues.chromium.org/issues/441711146 and you need to manually pass a top-K. François, I'll rubber-stamp your PR when it's ready.

Jeremy Ellis

unread,
Sep 14, 2025, 3:49:10 PMSep 14
to Chrome Built-in AI Early Preview Program Discussions, Thomas Steiner, Connie Leung, Chrome Built-in AI Early Preview Program Discussions, Jeremy Ellis, François Beaufort
I did some of my own checking with both Chrome uptodate and chromeCanary and did not get the audio loading into the multimedia model. I could not find a method to define topK prior to loading the model. Everything else is going well.

Thomas Steiner

unread,
Sep 15, 2025, 3:01:09 AMSep 15
to Jeremy Ellis, Chrome Built-in AI Early Preview Program Discussions, Thomas Steiner, Connie Leung, François Beaufort
Hey Jeremy,

As per my previous example's rules, here's a zero nuance no error handling audio demo that hopefully gets you unlocked: https://tomayac.github.io/throw-aways/multimodal-prompt-api-audio/. Note that you can't pass the <audio> element directly (yet), you need to pass, for example, the blob. The topK value can only be set at session creation time, not be changed later.

Cheers,
Tom

Jeremy Ellis

unread,
Sep 16, 2025, 9:40:06 AM (13 days ago) Sep 16
to Chrome Built-in AI Early Preview Program Discussions, Thomas Steiner, Chrome Built-in AI Early Preview Program Discussions, Connie Leung, François Beaufort, Jeremy Ellis
Thanks for trying Thomas. Even with topk and temperate I am still getting errors that the model is not available. Any other suggestions?  Here is my slightly changed code 
```



<audio controls preload="auto"><source src="hello.mp3" /></audio>
<button id="button" type="button">Transcribe</button>

<output id="output"></output>
<script type="module">
button.addEventListener('click', async () => {
const blob = await fetch('hello.mp3').then((response) => response.blob());
output.textContent = '';

const session = await LanguageModel.create({
topK: 3, // Example: only consider the 3 most likely next words at each step.
temperature: 0.0, // Added the temperature parameter to fix the error.
expectedInputs: [{ type: 'audio' }, { type: 'text', languages: ['en'] }],
expectedOutputs: [{ type: 'text', languages: ['en'] }],

});
const stream = session.promptStreaming([
{
role: 'user',
content: [
{
type: 'audio',
value: blob,
},


{
type: 'text',
value: 'Transcribe this audio file.',

},
],
},
]);
for await (const chunk of stream) {
output.append(chunk);
}
});
</script>

```

Thomas Steiner

unread,
Sep 16, 2025, 9:49:18 AM (13 days ago) Sep 16
to Jeremy Ellis, Chrome Built-in AI Early Preview Program Discussions, Thomas Steiner, Connie Leung, François Beaufort
What device are you using? We know for a fact that, for example, 2019 MacBook Pro laptops with Intel chips don't support the audio modality. Do you have access to a different device, just to test this theory?

François Beaufort

unread,
Sep 16, 2025, 9:50:37 AM (13 days ago) Sep 16
to Thomas Steiner, Jeremy Ellis, Chrome Built-in AI Early Preview Program Discussions, Connie Leung
Could you also share a screenshot of chrome://on-device-internals/?

François Beaufort

unread,
Sep 16, 2025, 10:04:36 AM (13 days ago) Sep 16
to Thomas Steiner, Jeremy Ellis, Chrome Built-in AI Early Preview Program Discussions, Connie Leung


On Tue, Sep 16, 2025 at 3:50 PM François Beaufort <fbea...@google.com> wrote:
Could you also share a screenshot of chrome://on-device-internals/? with Chrome Canary please

Jeremy Ellis

unread,
Sep 16, 2025, 5:08:27 PM (13 days ago) Sep 16
to Chrome Built-in AI Early Preview Program Discussions, François Beaufort, Jeremy Ellis, Chrome Built-in AI Early Preview Program Discussions, Connie Leung, Thomas Steiner
Multimodal sound seems to be working now (this code was not working a few days ago, perhaps I had something set wrong). I am on windows 11 chrome   Version 140.0.7339.128 (Official Build) (64-bit),  Not sure when it changed from chrome 138+.  

This demo works for me and gives much more than just the transcription:  https://hpssjellis.github.io/my-examples-of-web-llm/public/sound00.html

It also works when offline
For example:  

"The audio is a simple, iconic sound: "Hello, World!". It's spoken clearly, with a neutral, slightly robotic tone. The pronunciation of each word is distinct, and the spacing between them is deliberate. The audio likely represents the first program a programmer writes when learning a new programming language. It's a foundational concept, symbolizing a successful setup and a starting point. The audio is short and straightforward, lacking any musical accompaniment or background sounds. It's purely auditory, focusing solely on the spoken phrase. "

Jeremy Ellis

unread,
Sep 16, 2025, 11:03:43 PM (12 days ago) Sep 16
to Chrome Built-in AI Early Preview Program Discussions, Jeremy Ellis, François Beaufort, Chrome Built-in AI Early Preview Program Discussions, Connie Leung, Thomas Steiner
Just to make things interesting, both my desktop (AMD Ryzen 5 5500U with Radeon Graphic) and newer laptop (13th gen intel(R) core TM  i9-13900H) are running the same version of windows and the same version of chrome, with the same flags set, but the code Image Sound Multimodal   code at https://hpssjellis.github.io/my-examples-of-web-llm/public/sound00.html does not work on my older desktop, and runs fine on my newer laptop. Probably why the last few days have been so frustrating. 

P.S. Lots of my TinyML Arduino code has not worked well on the desktop computer, so my setup might be questionable.  

Thomas Steiner

unread,
Sep 17, 2025, 4:34:55 AM (12 days ago) Sep 17
to Jeremy Ellis, Chrome Built-in AI Early Preview Program Discussions, François Beaufort, Connie Leung, Thomas Steiner
Hi Jeremy,

Glad we could track the problem down to being device-dependent. With regard to the demo, the prompt in this demo asks the model to describe, not transcribe, which is what it does. 

Cheers,
Tom

Connie Leung

unread,
Sep 17, 2025, 4:40:23 AM (12 days ago) Sep 17
to Thomas Steiner, Jeremy Ellis, Chrome Built-in AI Early Preview Program Discussions, François Beaufort
I recorded a video to show these hidden pages in case you do not know.

Tom showed it to me at google io connect china in August 

https://youtu.be/vlUUbGs_AB0?si=PzH0XBkTWGLwRv6O


Jeremy Ellis

unread,
Sep 17, 2025, 10:08:08 PM (11 days ago) Sep 17
to Chrome Built-in AI Early Preview Program Discussions, Thomas Steiner, Chrome Built-in AI Early Preview Program Discussions, François Beaufort, Connie Leung, Jeremy Ellis
Thanks Thomas, Transcribe and Description together works really well. Demo here sound00.html

Jer  

Thomas Steiner

unread,
Sep 18, 2025, 6:44:58 AM (11 days ago) Sep 18
to Jeremy Ellis, Chrome Built-in AI Early Preview Program Discussions, Thomas Steiner, François Beaufort, Connie Leung
Glad to hear, Jeremy :-) It's been a bit of a journey, but glad we solved the various mysteries :-) Happy hacking!
Reply all
Reply to author
Forward
0 new messages