Understanding Prompt API Performance

207 views
Skip to first unread message

Scott Fortmann-Roe

unread,
Jan 27, 2025, 4:16:27 AMJan 27
to chrome-ai-dev-...@chromium.org
I would like to make sure I am using the Prompt API in a way that minimizes resource consumption and is as performant as possible for users. However, I don't have a good mental model for the underlying performance characteristics of the API.

Here are some questions I had that would help guide us. Note that when I use the term "performance" here, I am referring to speed and resource consumption, not to model accuracy.

1) Are there any performance benefits to using prompt() instead of promptStreaming()? (Can the language model operate more effectively when it doesn't need to return results token by token?)

2) Does the topK session parameter have any effect on performance?

3) Does Aborting a prompt using an AbortController immediately stop evaluation or does it continue to utilize the CPU/GPU in some cases after the abort has been called?

4) If I am done with a session, but know I will soon need another session, should I keep a reference to the session around to keep the model loaded in memory? In general an overview of how model loading/unloading works would be helpful.

5) Could you comment on the cost/benefits of having one long session versus multiple short sessions? The Explainer has an example generating emojis where new sessions are created for each prompt (https://github.com/webmachinelearning/prompt-api?tab=readme-ov-file#n-shot-prompting). Is that the best approach for this use case? Or would it be better to have a single session and keep extending the session with new prompts as new emoji requests come in?

Thank you for your help!
Scott


Thomas Beverley

unread,
Jan 27, 2025, 6:50:18 AMJan 27
to Chrome Built-in AI Early Preview Program Discussions, Scott Fortmann-Roe
From our experiments with projects using the prompt API, we've found that... (Googlers might have a deeper understanding!)

1. prompt and promptStreaming seem to be similar in performance, but we've found that promptStreaming "feels quicker" because the time to read something on screen is much faster

2. Tokens are weighted before the topK is considered, so the hard work has already been done. This means topK shouldn't have any impact on performance

3. Calling abort() actually stops the request. You should see this if you have multiple requests lined up, the next request will start. (Worth noting, the prompt API doesn't run on the CPU as the requirements on https://developer.chrome.com/docs/ai/get-started say the machine needs a 4GB+ GPU. Unless you've used the BypassPerfRequirement flag enabled)

4. I think some of the model framework is kept in memory even after destroying it, but someone at Google might be better suited to answer about the underlying mechanics of this one

5. When you reuse a session, it keeps building up a conversation history within it. This context can be useful for example if you want to ask follow-up questions, but if their discreet prompts this can almost muddy the water. What I will say, is a session with more inputs (be it a longer prompt, or history) will become progressively slower because there's more tokens to parse. As each request is effectively 1-shot (i.e., the whole context is piped in each time), creating new sessions is better if you need a new context.

Tom

Thomas Steiner

unread,
Jan 27, 2025, 8:34:21 AMJan 27
to Thomas Beverley, Chrome Built-in AI Early Preview Program Discussions, Scott Fortmann-Roe
Hi Scott and Tom,

Thanks for asking those questions, Scott, and your initial thoughts, Tom. I agree with everything you have said. For the things that require implementation background, I'm waiting to hear from engineering myself and will chime in (or wait if engineering does themselves) with more details.

Cheers,
Tom

PS: I'll send a separate email about this as it was requested some time ago, but here's an article on AI session best practices.

--
You received this message because you are subscribed to the Google Groups "Chrome Built-in AI Early Preview Program Discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chrome-ai-dev-previe...@chromium.org.
To view this discussion visit https://groups.google.com/a/chromium.org/d/msgid/chrome-ai-dev-preview-discuss/e79933f0-4038-42f1-9f43-6ab4dfa57967n%40chromium.org.


--
Thomas Steiner, PhD—Developer Relations Engineer (blog.tomayac.comtoot.cafe/@tomayac)

Google Germany GmbH, ABC-Str. 19, 20354 Hamburg, Germany
Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891

----- BEGIN PGP SIGNATURE -----
Version: GnuPG v2.4.3 (GNU/Linux)

iFy0uwAntT0bE3xtRa5AfeCheCkthAtTh3reSabiGbl0ck
0fjumBl3DCharaCTersAttH3b0ttom.xKcd.cOm/1181.
----- END PGP SIGNATURE -----

Scott Fortmann-Roe

unread,
Jan 27, 2025, 10:55:39 AMJan 27
to Thomas Steiner, Thomas Beverley, Chrome Built-in AI Early Preview Program Discussions
Thank you both!

Thomas Steiner

unread,
Jan 28, 2025, 7:14:08 AMJan 28
to Scott Fortmann-Roe, Clark Duvall, Mingyu Lei, Thomas Steiner, Thomas Beverley, Chrome Built-in AI Early Preview Program Discussions
Hi all,

As promised here are the responses from the engineering team (everything italic in quotes is verbatim):

For 1), courtesy of @Mingyu Lei:

"The model and the browser side [do] exactly the same thing for these two APIs, the only difference is the renderer will do some aggregation on the partial results before returning the final output for prompt(). So performance wise they should be at the same level." 

For 3), code snippet courtesy of @Mingyu Lei:

It seems like there's a bug, but engineering isn't super decided yet. The following test suggests a complex long prompt that's aborted causes a follow-up simple prompt to take longer than expected, but they are still investigating.

let s = await ai.languageModel.create();
let s2 = await ai.languageModel.create();
let c = new AbortController();
s.prompt("what's the result of 1+2?", { signal: c.signal })
console.log(Date.now());
c.abort();
let ss = await s2.promptStreaming("what's the result of 1+1?");
for await (const chunk of ss) {
  console.log(Date.now());
  console.log(chunk);
}
 
For 4), courtesy of @Clark Duvall:
  • "The model is loaded when a session is first created
  • The model is unloaded after a delay when the last session is deleted (currently 1 minute, but this may change)
So if a dev wants to be sure the model stays loaded and avoid model load cost for future sessions, the easiest way is to keep a session alive. This is only recommended if they are sure it will be used again, as the model takes up a lot of system resources." 

Cheers,
Tom

Scott Fortmann-Roe

unread,
Jan 29, 2025, 7:52:55 AMJan 29
to Thomas Steiner, Clark Duvall, Mingyu Lei, Thomas Beverley, Chrome Built-in AI Early Preview Program Discussions
Thank you, this is very helpful.
Reply all
Reply to author
Forward
0 new messages