multimodal error for audio

Jeff Whelpley

unread,

Jun 2, 2025, 5:09:42 PM6/2/25

to Chrome Built-in AI Early Preview Program Discussions

Both when I try out the demo app here:

https://chrome.dev/web-ai-demos/mediarecorder-audio-prompt/

Or try my own code:

const session = await LanguageModel.create({
expectedInputs: [{ type: 'audio' }],
});

I get the same error:

NotSupportedError: The device is unable to create a session to run the model. Please check the result of availability() first.

When I check availability here:

const availability = await LanguageModel.availability({
expectedInputs: [{ type: 'audio' }],
});

I get back "available". The prompt API works just fine without multimodal. I am using the latest Chrome Canary. Maybe there is something else on my desktop that is causing this to fail? Or perhaps it is because I did download the model without audio at first and then later changed it to include audio?

Jeff

Connie Leung

unread,

Jun 2, 2025, 8:00:58 PM6/2/25

to Chrome Built-in AI Early Preview Program Discussions, Jeff Whelpley

Hi all,

I just tried, and the same error happened to me. I need this feature to work because my AI sprint project idea is based on Chrome's Prompt Multimodal input

Thanks,

Connie

Connie Leung

unread,

Jun 2, 2025, 8:09:56 PM6/2/25

to Chrome Built-in AI Early Preview Program Discussions, Connie Leung, Jeff Whelpley

Hi Jeff

chrome://flags/#prompt-api-for-gemini-nano-multimodal-input

I enabled the above flag and the previous error went away but another error has occurred.

TypeError: Failed to execute 'promptStreaming' on 'LanguageModel': Failed to read the 'role' property from 'LanguageModelMessage': Required member is undefined.

I found out about the flag here: https://firebase.google.com/docs/ai-logic/hybrid-on-device-inference?api=dev#set-up-chrome

Phil Nash

unread,

Jun 3, 2025, 5:13:00 AM6/3/25

to Chrome Built-in AI Early Preview Program Discussions, Connie Leung, Jeff Whelpley

I have also been playing with this today. I discovered that the multimodal prompt API doesn't like inputs like:

model.prompt([

"Transcribe this audio",

{ type: "audio", content: audioBuffer }

]);

But did respond when I tried:

model.prompt([{

role: "user",

content: [

{ type: "text", value: "Transcribe this audio" },

{ type: "audio", value: audioBuffer

]

});

Though having said that, I must have done something wrong with my audio because it was only responding with text like "No, no, no, no".

Still, calling it like that worked. This is how it is described in the explainer. But it must have changed as the transcription demo doesn't work any more. I haven't tried to get results using images with that prompt format, but I hope it would work.

Phil

Thomas Steiner

unread,

Jun 3, 2025, 5:16:44 AM6/3/25

to Phil Nash, François Beaufort, Chrome Built-in AI Early Preview Program Discussions, Connie Leung, Jeff Whelpley

The API shape Phil described is the one to go for. If you are encountering problems, please file a new Chromium bug with a minimal reproduction case if you can. Thank you! I will ask @François Beaufort to update the demo code to that shape.

model.prompt([{

role: "user",

content: [

{ type: "text", value: "Transcribe this audio" },

{ type: "audio", value: audioBuffer

]

});

Cheers,

Tom

--
You received this message because you are subscribed to the Google Groups "Chrome Built-in AI Early Preview Program Discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chrome-ai-dev-previe...@chromium.org.
To view this discussion visit https://groups.google.com/a/chromium.org/d/msgid/chrome-ai-dev-preview-discuss/bdd4728d-c2ae-4447-949e-d7250406b1b4n%40chromium.org.

--

Thomas Steiner, PhD—Developer Relations Engineer (blog.tomayac.com, toot.cafe/@tomayac)

Google Germany GmbH, ABC-Str. 19, 20354 Hamburg, Germany
Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891

----- BEGIN PGP SIGNATURE -----

Version: GnuPG v2.4.3 (GNU/Linux)

iFy0uwAntT0bE3xtRa5AfeCheCkthAtTh3reSabiGbl0ck

0fjumBl3DCharaCTersAttH3b0ttom.xKcd.cOm/1181.

----- END PGP SIGNATURE -----

Thomas Steiner

unread,

Jun 3, 2025, 5:41:05 AM6/3/25

to Thomas Steiner, Phil Nash, François Beaufort, Chrome Built-in AI Early Preview Program Discussions, Connie Leung, Jeff Whelpley

There recently was an audio encoder bug (https://crbug.com/421242377, likely not accessible for most of you), but it was fixed. So if you received "no no no" or similar responses, try again in a future Canary version. Thanks!

François Beaufort

unread,

Jun 3, 2025, 6:05:03 AM6/3/25

to Thomas Steiner, Phil Nash, Chrome Built-in AI Early Preview Program Discussions, Connie Leung, Jeff Whelpley

On Tue, Jun 3, 2025 at 12:01 PM François Beaufort <fbea...@google.com> wrote:

Thanks for the warning!
I've just updated the following multimodal demos with https://github.com/GoogleChromeLabs/web-ai-demos/pull/148:

- https://chrome.dev/web-ai-demos/mediarecorder-audio-prompt
- https://chrome.dev/web-ai-demos/miroir-mon-beau-miroir/
- https://chrome.dev/web-ai-demos/canvas-image-prompt/

Jeff Whelpley

unread,

Jun 3, 2025, 9:22:48 PM6/3/25

to François Beaufort, Thomas Steiner, Phil Nash, Chrome Built-in AI Early Preview Program Discussions, Connie Leung

OK, so, I've been playing with this all day and I still can't seem to get multimodal with audio to work. When I try the test app you created here:

https://chrome.dev/web-ai-demos/mediarecorder-audio-prompt/

in Chrome Canary, I just get this:

Screenshot 2025-06-03 at 9.13.59 PM.png

I've created my own publicly available demo app here:

https://jeff.app

Transformers.js and the Web Speech API works in this demo, but if you try to choose the Chrome Built-in AI model the prompt returns something really unexpected. The code for this is publically available here:

https://github.com/jeffwhelpley/angular-speech-recognition/blob/main/src/libs/managers/speech.manager.ts#L60

The key part of the code to look at is here where I take the audio stream from my microphone and send it into the prompt:

https://github.com/jeffwhelpley/angular-speech-recognition/blob/main/src/libs/managers/speech.manager.ts#L88

No matter what audio I record and pass in, the result from the prompt call is the following:

The object you provided appears to be an object literal in JavaScript, also known as a JSON object.

**What it represents:**

* It likely contains a collection of named key-value pairs.
* Each pair represents a property of the object, and its value can be of various data types:
* **Primitives:** `string`, `number`, `boolean`, `null`, `undefined`
* **Complex data types:** Arrays, objects, functions

**Example:**

```javascript
const person = {
name: "John Doe",
age: 30,
city: "New York",
hobbies: ["reading", "coding"]
};
```

**How to use it:**

You can access the properties of an object using the dot notation (`.`). For example:

```javascript
console.log(person.name); // Output: "John Doe"
console.log(person.city); // Output: "New York"
```

**Other properties you might find:**

* `toString()` method: Returns the string representation of the object.
* `hasOwnProperty()` method: Checks if a property is a direct property of the object itself, and not inherited from its prototype (like `name` in the example).
* `toLocaleString()` method: Returns the string representation of the object in a locale-specific format.

**Note:** In JSON, the key-value pairs are typically separated by commas and enclosed in double quotes. The values are also enclosed in double quotes. For example:

```json
{
"name": "John Doe",
"age": 30,
"city": "New York"
}
```

In JavaScript, you can use `JSON.stringify()` to convert objects to JSON strings and `JSON.parse()` to convert JSON strings to JavaScript objects.

On Tue, Jun 3, 2025 at 6:04 AM François Beaufort <fbea...@google.com> wrote:

To unsubscribe from this group and stop receiving emails from it, send an email to chrome-ai-dev-preview-discuss+unsub...@chromium.org.

To view this discussion visit https://groups.google.com/a/chromium.org/d/msgid/chrome-ai-dev-preview-discuss/bdd4728d-c2ae-4447-949e-d7250406b1b4n%40chromium.org.

--
Thomas Steiner, PhD—Developer Relations Engineer (blog.tomayac.com, toot.cafe/@tomayac)

Google Germany GmbH, ABC-Str. 19, 20354 Hamburg, Germany
Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891

----- BEGIN PGP SIGNATURE -----
Version: GnuPG v2.4.3 (GNU/Linux)

iFy0uwAntT0bE3xtRa5AfeCheCkthAtTh3reSabiGbl0ck
0fjumBl3DCharaCTersAttH3b0ttom.xKcd.cOm/1181.
----- END PGP SIGNATURE -----

Phil Nash

unread,

Jun 3, 2025, 9:48:13 PM6/3/25

to Jeff Whelpley, François Beaufort, Thomas Steiner, Chrome Built-in AI Early Preview Program Discussions, Connie Leung

Are you trying to perform a live transcription on the audio stream? I'm not sure it handles that very well. I've had success with both the demo and my own code, which I haven't pushed to GitHub yet, but will soon.

But what I found worked was to collect chunks of data using the MediaRecorder:

stream = await navigator.mediaDevices.getUserMedia({
audio: true,
video: false,
});
const mimeType = "audio/webm";
let chunks = [];
recorder = new MediaRecorder(stream, { type: mimeType });
recorder.addEventListener("dataavailable", (event) => {
if (typeof event.data === "undefined") return;
if (event.data.size === 0) return;
chunks.push(event.data);

});

Then when the recording is complete, put the chunks into a Blob, get an array buffer from the Blob, and then decode the audio data from that array buffer with the Web Audio API, passing the result as the audio to the prompt API.

recorder.addEventListener("stop", async () => {
let recording = new Blob(chunks, {
type: mimeType,
});

It looks like you're trying to do something similar using an AudioWorklet instead of MediaRecorder, but you are passing a blob to the prompt API and I think you will have to decode the audio data, as above, rather than just pass the blob. I'm guessing the prompt API is responding by describing JavaScript objects to you because it is just seeing an object it can't read, or even just [object Object], and it's doing what it can with it.

I would probably ensure you have the latest version of Canary, that might be causing the [noise] issues in the other demo. For reference I'm on 139.0.7218.0 (Official Build) canary (arm64) and it's working for me.

To unsubscribe from this group and stop receiving emails from it, send an email to chrome-ai-dev-previe...@chromium.org.

To view this discussion visit https://groups.google.com/a/chromium.org/d/msgid/chrome-ai-dev-preview-discuss/bdd4728d-c2ae-4447-949e-d7250406b1b4n%40chromium.org.

--
Thomas Steiner, PhD—Developer Relations Engineer (blog.tomayac.com, toot.cafe/@tomayac)

Google Germany GmbH, ABC-Str. 19, 20354 Hamburg, Germany
Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891

----- BEGIN PGP SIGNATURE -----
Version: GnuPG v2.4.3 (GNU/Linux)

iFy0uwAntT0bE3xtRa5AfeCheCkthAtTh3reSabiGbl0ck
0fjumBl3DCharaCTersAttH3b0ttom.xKcd.cOm/1181.
----- END PGP SIGNATURE -----

Thomas Steiner

unread,

Jun 4, 2025, 3:47:08 AM6/4/25

to Phil Nash, Jeff Whelpley, François Beaufort, Thomas Steiner, Chrome Built-in AI Early Preview Program Discussions, Connie Leung

Jeff, you're missing one level of array nesting. This is what you have:

const result = await session.prompt({

role: 'user',

content: [

{ type: 'text', value: 'Please transcribe the audio' },

{ type: 'audio', value: audioBlob },

],

});

Instead try sending this:

const result = await session.prompt([{

role: 'user',

content: [

{ type: 'text', value: 'Please transcribe the audio' },

{ type: 'audio', value: audioBlob },

],

}]);

Sorry for the message formatting, sending this from mobile.

Hope this works for you now.

Cheers,

Tom

Thomas Steiner, PhD—Developer Relations Engineer (blog.tomayac.com, toot.cafe/@tomayac)

Google Germany GmbH, ABC-Str. 19, 20354 Hamburg, Germany
Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891

----- BEGIN PGP SIGNATURE -----

Version: GnuPG v2.4.3 (GNU/Linux)

iFy0uwAntT0bE3xtRa5AfeCheCkthAtTh3reSabiGbl0ck

0fjumBl3DCharaCTersAttH3b0ttom.xKcd.cOm/1181.

----- END PGP SIGNATURE -----

Thomas Steiner

unread,

Jun 4, 2025, 3:49:00 AM6/4/25

to Thomas Steiner, Phil Nash, Jeff Whelpley, François Beaufort, Chrome Built-in AI Early Preview Program Discussions, Connie Leung

Also quick reminder about this article on debugging what the model receives: https://developer.chrome.com/docs/ai/debug-gemini-nano. It was broken for a while, but has been fixed recently.

Thomas Steiner, PhD—Developer Relations Engineer (blog.tomayac.com, toot.cafe/@tomayac)

Google Germany GmbH, ABC-Str. 19, 20354 Hamburg, Germany
Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891

----- BEGIN PGP SIGNATURE -----

Version: GnuPG v2.4.3 (GNU/Linux)

iFy0uwAntT0bE3xtRa5AfeCheCkthAtTh3reSabiGbl0ck

0fjumBl3DCharaCTersAttH3b0ttom.xKcd.cOm/1181.

----- END PGP SIGNATURE -----

Jeff Whelpley

unread,

Jun 4, 2025, 8:11:22 AM6/4/25

to Thomas Steiner, Phil Nash, François Beaufort, Chrome Built-in AI Early Preview Program Discussions, Connie Leung

Thanks for the tip about the debugger. I missed that before. I modified my code so that I chunk up a few seconds of audio and turn that into a wav formatted Blob as the model expects:

https://github.com/jeffwhelpley/angular-speech-recognition/blob/main/src/libs/managers/speech.manager.ts#L121

But unfortunately, I am just getting back "[noise]" from the model. The debug log is:

Wed Jun 04 2025 07:46:15 GMT-0400 (Eastern Daylight Time) ai_language_model.cc(477) Starting on-device session for PromptApi
Wed Jun 04 2025 07:46:45 GMT-0400 (Eastern Daylight Time) ai_language_model.cc(333) Model generates raw response with PromptApi:
<noise> İki.
Wed Jun 04 2025 07:46:46 GMT-0400 (Eastern Daylight Time) ai_language_model.cc(232) Executing model with input context of 28 tokens:
<user>Please transcribe the audio<audio><end><model>
Wed Jun 04 2025 07:46:46 GMT-0400 (Eastern Daylight Time) ai_language_model.cc(333) Model generates raw response with PromptApi:
1 2 3
Wed Jun 04 2025 07:46:47 GMT-0400 (Eastern Daylight Time) ai_language_model.cc(232) Executing model with input context of 28 tokens:
<user>Please transcribe the audio<audio><end><model>
Wed Jun 04 2025 07:46:48 GMT-0400 (Eastern Daylight Time) ai_language_model.cc(333) Model generates raw response with PromptApi:
<noise> <noise> <noise> <noise>
Wed Jun 04 2025 07:46:48 GMT-0400 (Eastern Daylight Time) ai_language_model.cc(232) Executing model with input context of 28 tokens:
<user>Please transcribe the audio<audio><end><model>
Wed Jun 04 2025 07:46:49 GMT-0400 (Eastern Daylight Time) ai_language_model.cc(333) Model generates raw response with PromptApi:
[noise]

Interestingly, the model did in fact correctly transcribe the audio once (i.e. the "1 2 3" is what I was saying at the time), but then just returns "[noise]" other than that.

I am not sure if it is related yet, but I just noticed that both the Transformers.js transcription and the Web Speech Api transcription are not working in Chrome Canary 139 even though they do work in Chrome 137. I am going to focus today on seeing if I can figure out the discrepancy with Transformers and Web Speech API between the two Chrome versions. Potentially there is some bug in Chrome Canary that is affecting all of them.

Jeff

On Wed, Jun 4, 2025 at 3:48 AM Thomas Steiner <to...@google.com> wrote:

Also quick reminder about this article on debugging what the model receives: https://developer.chrome.com/docs/ai/debug-gemini-nano. It was broken for a while, but has been fixed recently.

Thomas Steiner, PhD—Developer Relations Engineer (blog.tomayac.com, toot.cafe/@tomayac)

Google Germany GmbH, ABC-Str. 19, 20354 Hamburg, Germany
Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891

----- BEGIN PGP SIGNATURE -----
Version: GnuPG v2.4.3 (GNU/Linux)

iFy0uwAntT0bE3xtRa5AfeCheCkthAtTh3reSabiGbl0ck
0fjumBl3DCharaCTersAttH3b0ttom.xKcd.cOm/1181.
----- END PGP SIGNATURE -----

To unsubscribe from this group and stop receiving emails from it, send an email to chrome-ai-dev-preview-discuss+unsub...@chromium.org.

To view this discussion visit https://groups.google.com/a/chromium.org/d/msgid/chrome-ai-dev-preview-discuss/bdd4728d-c2ae-4447-949e-d7250406b1b4n%40chromium.org.

--
Thomas Steiner, PhD—Developer Relations Engineer (blog.tomayac.com, toot.cafe/@tomayac)

Google Germany GmbH, ABC-Str. 19, 20354 Hamburg, Germany
Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891

----- BEGIN PGP SIGNATURE -----
Version: GnuPG v2.4.3 (GNU/Linux)

iFy0uwAntT0bE3xtRa5AfeCheCkthAtTh3reSabiGbl0ck
0fjumBl3DCharaCTersAttH3b0ttom.xKcd.cOm/1181.
----- END PGP SIGNATURE -----

Thomas Steiner

unread,

Jun 4, 2025, 12:26:08 PM6/4/25

to Jeff Whelpley, Thomas Steiner, Phil Nash, François Beaufort, Chrome Built-in AI Early Preview Program Discussions, Connie Leung

Thank you very much for the investigation here, Jeff, and glad it worked at least in one case, so we can be sure the model is at least seeing the audio properly (or rather, hearing). Before, you were involuntarily throwing a stringified JSON at it, which of course it tried to make sense of (similar to what I did in https://issues.chromium.org/issues/392661409).

Cheers,

Tom

To unsubscribe from this group and stop receiving emails from it, send an email to chrome-ai-dev-previe...@chromium.org.

Reply all

Reply to author

Forward