Multimodal image and voice input update.

Jeremy Ellis

unread,

Sep 10, 2025, 12:37:02 AMSep 10

to Chrome Built-in AI Early Preview Program Discussions

Is there an update about the multimodal Chrome Gemini API update. I notice it is not really mentioned that much but using a complex JSON input I am getting some kind of base64 analysis. Is there a reasonably new demo about both image and sound multimodal input. My test code is always getting an [object,object] error when I try to pass a blob to the model, but this JSON input as the prompt had some (not accurate) output. At least it did not give an error. It got the colors and content confused but it was OK with the image and knew it was a pixelated small say 24x24 PNG image.

```

{
"contents": [
{
"parts": [
{
"text": "Describe the image. Is it a picture of something specific?"
},
{
"inlineData": {
"mimeType": "image/png",
"data": "iVBORw0KGgoAAAANSUhEUgAAABAAAAAPCAYAAADtc08vAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAADsEAAA7BAbiRa+0AAAGHaVRYdFhNTDpjb20uYWRvYmUueG1wAAAAAAA8P3hwYWNrZXQgYmVnaW49J++7vycgaWQ9J1c1TTBNcENlaGlIenJlU3pOVGN6a2M5ZCc/Pg0KPHg6eG1wbWV0YSB4bWxuczp4PSJhZG9iZTpuczptZXRhLyI+PHJkZjpSREYgeG1sbnM6cmRmPSJodHRwOi8vd3d3LnczLm9yZy8xOTk5LzAyLzIyLXJkZi1zeW50YXgtbnMjIj48cmRmOkRlc2NyaXB0aW9uIHJkZjphYm91dD0idXVpZDpmYWY1YmRkNS1iYTNkLTExZGEtYWQzMS1kMzNkNzUxODJmMWIiIHhtbG5zOnRpZmY9Imh0dHA6Ly9ucy5hZG9iZS5jb20vdGlmZi8xLjAvIj48dGlmZjpPcmllbnRhdGlvbj4xPC90aWZmOk9yaWVudGF0aW9uPjwvcmRmOkRlc2NyaXB0aW9uPjwvcmRmOlJERj48L3g6eG1wbWV0YT4NCjw/eHBhY2tldCBlbmQ9J3cnPz4slJgLAAAC30lEQVQ4TyXN0W7bVACA4f8cn2M7jpM0SZuu7ZrBhtjUTaAyJDaBEIg7uJp2wSXwDNyjXfEAvMKeA6FJIEQFQlRMHUJVtUILaVOncR3bsc/x4YLvBT7x9MuPnTUlUaeDWVZIBzoMQEqcqfBDH+X7LIxHRYuIOcpTeH5AXdfIqiwBR5ZmFHlJWZaYusbaitboFWwwZJEVZIua3niXrfuPKI1iNj3jcjZH5nmBg/9HBDIIaFxNEK9z490v2H7wGNXqoMyc/K+f6Q2uce+jzzFEOGvwPry3+aQxBqk1UvmUeY7vS4xtaLUHlMkJV5MXeFrR2Jzs/JDffv+TX37aY33QRnzz2Ttu+Ood5umMKk0QrqHXCVnmGUEQMc0bkqzkzdc3qI3D1BUXSUrjJJ04RFJlPNvb5+m3B3gItNaIIEaP38OLekyuFjw/ukAIATgcgtXVPqO1Nk3jkLoVsLMR0BVLCNps7bxP/9ZDVH+drKy51vG5ubWCbGoCBZ4nsfUSjcNUJXJtY8Tt6xGf3F9Dap/BoIsSFeNRzOjuB0xnC9YHCqNAakllLPgRFkWoDLKqci5Sj25nxPHplOk8JV4Z8u/BDxz+sU9WC26PRxRFxdE/F6S5wB+MqcNVjAE5Sy256xK2hwz7Pb5/9h0nFw2Ty5yr0wMe7t4iikP6ccR41CdoR4iywNSW87RCfP3prmuvvYa1Fq/JmM3OOT1LCOM211diBj2NtQ2ep5AK0kISxCMuk0umL/cRTx6/4YrlEqUDIiVotUOss/haYxsHDRih+PE4R6RTHtxY42VyBWHI0Gshvnp015XLEl8rpBPUpkR5Gk8IpFZYJJ7S7B3N6KJwRvA8mbF5c5vkeIIUyqPVjrDCw2mJVj5lI3gxWfDr4RmetBTlgrc2FDtbHsq3JJXj74MT7nQGyFm2ZLFsKOqGq8KQGcv+8ZzJZcnqsEthQaqQuo5Jpj7F3NIqC97eXGdna5v/AKS6WjOZUKYQAAAAAElFTkSuQmCC"
}
}
]
}
]
}

```

Thomas Steiner

unread,

Sep 10, 2025, 7:00:24 AMSep 10

to Jeremy Ellis, Chrome Built-in AI Early Preview Program Discussions

Hi Jeremy,

Here's a zero nuance, no error checking, reduced to the absolute minimum example: https://tomayac.github.io/throw-aways/multimodal-prompt-api/.

The entire source code is this:

<img id="img" src="image.png" alt="" />
<output id="output"></output>
<script type="module">
const session = await LanguageModel.create({
expectedInputs: [{ type: 'image' }, { type: 'text', languages: ['en'] }],
});
const stream = session.promptStreaming([
{
role: 'user',
content: [
{
type: 'image',
value: img,
},

{
type: 'text',
value: 'Describe this image in a few words.',
},
],
},
]);
for await (const chunk of stream) {
output.append(chunk);
}
</script>

Hope this unblocks you!

Cheers,

Tom

--
You received this message because you are subscribed to the Google Groups "Chrome Built-in AI Early Preview Program Discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chrome-ai-dev-previe...@chromium.org.
To view this discussion visit https://groups.google.com/a/chromium.org/d/msgid/chrome-ai-dev-preview-discuss/64aae8e0-479c-4f90-a93c-e425984c439en%40chromium.org.

--

Thomas Steiner, PhD—Developer Relations Engineer (blog.tomayac.com, toot.cafe/@tomayac)

Google Spain, S.L.U.

Torre Picasso, Pl. Pablo Ruiz Picasso, 1, Tetuán, 28020 Madrid, Spain

CIF: B63272603

Inscrita en el Registro Mercantil de Madrid, sección 8, Hoja M-435397 Tomo 24227 Folio 25

----- BEGIN PGP SIGNATURE -----

Version: GnuPG v2.4.8 (GNU/Linux)

iFy0uwAntT0bE3xtRa5AfeCheCkthAtTh3reSabiGbl0ck

0fjumBl3DCharaCTersAttH3b0ttom.xKcd.cOm/1181.

----- END PGP SIGNATURE -----

François Beaufort

unread,

Sep 10, 2025, 7:58:53 AMSep 10

to Thomas Steiner, Jeremy Ellis, Chrome Built-in AI Early Preview Program Discussions

FWIW my understanding is that Jeremy was using the Gemini API (https://ai.google.dev/gemini-api/docs/image-understanding), not the Built-in AI API.

To view this discussion visit https://groups.google.com/a/chromium.org/d/msgid/chrome-ai-dev-preview-discuss/CALgRrLkap8QshJpVmOu2Hvf_d7t-CfXCMZ9DmG-ZkW2SkYObBg%40mail.gmail.com.

Thomas Steiner

unread,

Sep 10, 2025, 8:12:21 AMSep 10

to François Beaufort, Thomas Steiner, Jeremy Ellis, Chrome Built-in AI Early Preview Program Discussions

Oh, I see. Hence the base64-encoding… (Gemini Code Assist always suggests code like this, even when coding against the built-in API, so I was misled, but you're right, it says Gemini API in the text.) Anyway, in this case, Jeremy, this is the mailing list focused on built-in AI APIs. If you want to consider using these APIs, my example will be helpful I hope, and according to your Discord message, you have signed up for the Built-in AI Challenge, so I assume you are interested :-)

Jeremy Ellis

unread,

Sep 11, 2025, 10:38:25 PMSep 11

to Chrome Built-in AI Early Preview Program Discussions, Thomas Steiner, Jeremy Ellis, Chrome Built-in AI Early Preview Program Discussions, François Beaufort

This is getting useful. So I think I have been searching incorrectly, as soon as I mention Chrome-Gemini-API I get weird information, so I should be searching for Chrome-Built-In-API ? I will see if that helps. Yes I am in the Built-in AI Challenge which is why I want to get this code working well so I can understand it. Here is my demo that I want to get image and sound loading and fix a minor issue with the execute stop button which works for streaming but not the other prompts.

https://hpssjellis.github.io/my-examples-of-web-llm/public/webllm00.html

Also Thomas thank you for the code which is as bare bones as it gets, however if I try this offsite I get a cross boundary type error, even if the image is in the same folder. The error is "SecurityError: Source would taint origin." I will try a few other ways to load the image. At least you have given me a starting point.

Here is the code you gave me.

https://tomayac.github.io/throw-aways/multimodal-prompt-api/

<img id="img" src="image.png" alt="" />
<output id="output"></output>
<script type="module">
const session = await LanguageModel.create({
expectedInputs: [{ type: 'image' }, { type: 'text', languages: ['en'] }],
});
const stream = session.promptStreaming([
{
role: 'user',
content: [
{
type: 'image',
value: img,
},

{
type: 'text',
value: 'Describe this image in a few words.',
},
],
},
]);
for await (const chunk of stream) {
output.append(chunk);
}
</script>

Jeremy Ellis

unread,

Sep 11, 2025, 11:06:25 PMSep 11

to Chrome Built-in AI Early Preview Program Discussions, Jeremy Ellis, Thomas Steiner, Chrome Built-in AI Early Preview Program Discussions, François Beaufort

Thanks so much Chrome Team. That is perfect. My demo works both online and offline when I used a blob for the image. My demo is here https://hpssjellis.github.io/my-examples-of-web-llm/public/bare-stream00.html

Can someone reply if sound can be loaded yet into the Chrome built-in multimodal? Now that I know what to ask in my search, I will see if I can get the translation working offline.

Jeremy Ellis

unread,

Sep 12, 2025, 2:15:37 AMSep 12

to Chrome Built-in AI Early Preview Program Discussions, Jeremy Ellis, Thomas Steiner, Chrome Built-in AI Early Preview Program Discussions, François Beaufort

No worries about the language detection and translators. I generated examples easily now that I am calling it "Chrome built-in AI".

The only thing I am stumped on in audio to text using the multimodal. I think it is only in origin trials which I don't want to work with. Can someone confirm that audio to text it is not yet ready for chrome 138 with flags?

Thomas Steiner

unread,

Sep 12, 2025, 4:29:01 AMSep 12

to Jeremy Ellis, Chrome Built-in AI Early Preview Program Discussions, Thomas Steiner, François Beaufort

Hey Jeremy,

Audio input for the Prompt API works with the flags for local experimentation, and it's also part of the origin trial for testing with real users. Here's a minimal example: https://chrome.dev/web-ai-demos/mediarecorder-audio-prompt/.

Cheers,

Tom

François Beaufort

unread,

Sep 12, 2025, 4:35:00 AMSep 12

to Thomas Steiner, Jeremy Ellis, Chrome Built-in AI Early Preview Program Discussions

FYI some Multimodal built-in AI demos are available at https://chrome.dev/web-ai-demos/io2025.html#multimodal.
And you can find source code at https://github.com/GoogleChromeLabs/web-ai-demos

Connie Leung

unread,

Sep 12, 2025, 10:22:05 AMSep 12

to Chrome Built-in AI Early Preview Program Discussions, François Beaufort, Jeremy Ellis, Chrome Built-in AI Early Preview Program Discussions, Thomas Steiner

hi Thom,

The demo broke in Chrome stable and I have not had time to debug it.

Screenshot 2025-09-12 at 10.21.14 PM.png

François Beaufort

unread,

Sep 12, 2025, 10:27:33 AMSep 12

to Connie Leung, Chrome Built-in AI Early Preview Program Discussions, Jeremy Ellis, Thomas Steiner

Argh. It looks like we're hitting https://issues.chromium.org/issues/441711146
I wonder if we'll have to apply something similar to https://github.com/GoogleChromeLabs/web-ai-demos/pull/182/files @Thomas Steiner

Thomas Steiner

unread,

Sep 12, 2025, 10:27:33 AMSep 12

to Connie Leung, Chrome Built-in AI Early Preview Program Discussions, François Beaufort, Jeremy Ellis, Thomas Steiner

I just tested in Stable, and while it took a short moment, it was working just fine (only showing the language warning, the error message is from an unrelated extension I have installed):

On Fri, Sep 12, 2025 at 4:22 PM Connie Leung <cleu...@gmail.com> wrote:

Thomas Steiner

unread,

Sep 12, 2025, 10:31:54 AMSep 12

to François Beaufort, Connie Leung, Chrome Built-in AI Early Preview Program Discussions, Jeremy Ellis, Thomas Steiner

Ah, good instinct. I have the model downloaded, but this bug only hits when the model needs to be downloaded. That's definitely https://issues.chromium.org/issues/441711146 and you need to manually pass a top-K. François, I'll rubber-stamp your PR when it's ready.

Jeremy Ellis

unread,

Sep 14, 2025, 3:49:10 PMSep 14

to Chrome Built-in AI Early Preview Program Discussions, Thomas Steiner, Connie Leung, Chrome Built-in AI Early Preview Program Discussions, Jeremy Ellis, François Beaufort

I did some of my own checking with both Chrome uptodate and chromeCanary and did not get the audio loading into the multimedia model. I could not find a method to define topK prior to loading the model. Everything else is going well.

Thomas Steiner

unread,

Sep 15, 2025, 3:01:09 AMSep 15

to Jeremy Ellis, Chrome Built-in AI Early Preview Program Discussions, Thomas Steiner, Connie Leung, François Beaufort

Hey Jeremy,

As per my previous example's rules, here's a zero nuance no error handling audio demo that hopefully gets you unlocked: https://tomayac.github.io/throw-aways/multimodal-prompt-api-audio/. Note that you can't pass the <audio> element directly (yet), you need to pass, for example, the blob. The topK value can only be set at session creation time, not be changed later.

Cheers,

Tom

Jeremy Ellis

unread,

Sep 16, 2025, 9:40:06 AM (13 days ago) Sep 16

to Chrome Built-in AI Early Preview Program Discussions, Thomas Steiner, Chrome Built-in AI Early Preview Program Discussions, Connie Leung, François Beaufort, Jeremy Ellis

Thanks for trying Thomas. Even with topk and temperate I am still getting errors that the model is not available. Any other suggestions? Here is my slightly changed code

```

<audio controls preload="auto"><source src="hello.mp3" /></audio>
<button id="button" type="button">Transcribe</button>

button.addEventListener('click', async () => {
const blob = await fetch('hello.mp3').then((response) => response.blob());
output.textContent = '';

const session = await LanguageModel.create({

topK: 3, // Example: only consider the 3 most likely next words at each step.
temperature: 0.0, // Added the temperature parameter to fix the error.
expectedInputs: [{ type: 'audio' }, { type: 'text', languages: ['en'] }],
expectedOutputs: [{ type: 'text', languages: ['en'] }],

});
const stream = session.promptStreaming([
{
role: 'user',
content: [
{

type: 'audio',
value: blob,
},

{
type: 'text',
value: 'Transcribe this audio file.',

},
],
},
]);
for await (const chunk of stream) {
output.append(chunk);
}

});
</script>

```

Thomas Steiner

unread,

Sep 16, 2025, 9:49:18 AM (13 days ago) Sep 16

to Jeremy Ellis, Chrome Built-in AI Early Preview Program Discussions, Thomas Steiner, Connie Leung, François Beaufort

What device are you using? We know for a fact that, for example, 2019 MacBook Pro laptops with Intel chips don't support the audio modality. Do you have access to a different device, just to test this theory?

François Beaufort

unread,

Sep 16, 2025, 9:50:37 AM (13 days ago) Sep 16

to Thomas Steiner, Jeremy Ellis, Chrome Built-in AI Early Preview Program Discussions, Connie Leung

Could you also share a screenshot of chrome://on-device-internals/?

François Beaufort

unread,

Sep 16, 2025, 10:04:36 AM (13 days ago) Sep 16

to Thomas Steiner, Jeremy Ellis, Chrome Built-in AI Early Preview Program Discussions, Connie Leung

On Tue, Sep 16, 2025 at 3:50 PM François Beaufort <fbea...@google.com> wrote:

Could you also share a screenshot of chrome://on-device-internals/? with Chrome Canary please

Jeremy Ellis

unread,

Sep 16, 2025, 5:08:27 PM (13 days ago) Sep 16

to Chrome Built-in AI Early Preview Program Discussions, François Beaufort, Jeremy Ellis, Chrome Built-in AI Early Preview Program Discussions, Connie Leung, Thomas Steiner

Multimodal sound seems to be working now (this code was not working a few days ago, perhaps I had something set wrong). I am on windows 11 chrome Version 140.0.7339.128 (Official Build) (64-bit), Not sure when it changed from chrome 138+.

This demo works for me and gives much more than just the transcription: https://hpssjellis.github.io/my-examples-of-web-llm/public/sound00.html

It also works when offline

For example:

"The audio is a simple, iconic sound: "Hello, World!". It's spoken clearly, with a neutral, slightly robotic tone. The pronunciation of each word is distinct, and the spacing between them is deliberate. The audio likely represents the first program a programmer writes when learning a new programming language. It's a foundational concept, symbolizing a successful setup and a starting point. The audio is short and straightforward, lacking any musical accompaniment or background sounds. It's purely auditory, focusing solely on the spoken phrase. "

Jeremy Ellis

unread,

Sep 16, 2025, 11:03:43 PM (12 days ago) Sep 16

to Chrome Built-in AI Early Preview Program Discussions, Jeremy Ellis, François Beaufort, Chrome Built-in AI Early Preview Program Discussions, Connie Leung, Thomas Steiner

Just to make things interesting, both my desktop (AMD Ryzen 5 5500U with Radeon Graphic) and newer laptop (13th gen intel(R) core TM i9-13900H) are running the same version of windows and the same version of chrome, with the same flags set, but the code Image Sound Multimodal code at https://hpssjellis.github.io/my-examples-of-web-llm/public/sound00.html does not work on my older desktop, and runs fine on my newer laptop. Probably why the last few days have been so frustrating.

P.S. Lots of my TinyML Arduino code has not worked well on the desktop computer, so my setup might be questionable.

Thomas Steiner

unread,

Sep 17, 2025, 4:34:55 AM (12 days ago) Sep 17

to Jeremy Ellis, Chrome Built-in AI Early Preview Program Discussions, François Beaufort, Connie Leung, Thomas Steiner

Hi Jeremy,

Glad we could track the problem down to being device-dependent. With regard to the demo, the prompt in this demo asks the model to describe, not transcribe, which is what it does.

Cheers,

Tom

Connie Leung

unread,

Sep 17, 2025, 4:40:23 AM (12 days ago) Sep 17

to Thomas Steiner, Jeremy Ellis, Chrome Built-in AI Early Preview Program Discussions, François Beaufort

I recorded a video to show these hidden pages in case you do not know.

Tom showed it to me at google io connect china in August

https://youtu.be/vlUUbGs_AB0?si=PzH0XBkTWGLwRv6O

Jeremy Ellis

unread,

Sep 17, 2025, 10:08:08 PM (11 days ago) Sep 17

to Chrome Built-in AI Early Preview Program Discussions, Thomas Steiner, Chrome Built-in AI Early Preview Program Discussions, François Beaufort, Connie Leung, Jeremy Ellis

Thanks Thomas, Transcribe and Description together works really well. Demo here sound00.html

Jer

Thomas Steiner

unread,

Sep 18, 2025, 6:44:58 AM (11 days ago) Sep 18

to Jeremy Ellis, Chrome Built-in AI Early Preview Program Discussions, Thomas Steiner, François Beaufort, Connie Leung

Glad to hear, Jeremy :-) It's been a bit of a journey, but glad we solved the various mysteries :-) Happy hacking!

Reply all

Reply to author

Forward