Question on Prompt API session persistence & keeping model loaded

51 views
Skip to first unread message

Wilson

unread,
Dec 9, 2025, 6:53:56 AM (7 days ago) Dec 9
to Chrome Built-in AI Early Preview Program Discussions

Hi Thomas,


I’ve been testing the Prompt API session management behavior (Chrome on-device model). I noticed that even if I create a session and keep it untouched, the model still unloads after some idle time. This happens even when I intentionally keep the session object stored and never call destroy() on it.


From the documentation, the recommendation is to “keep an empty session alive” so the model stays loaded. However, in practice it seems that an idle session does not always count as a “living session,” and the model is unloaded anyway after a timeout.


Before I assume this is expected, I wanted to confirm:

1. Is there currently any supported way to keep the model loaded indefinitely (e.g. holding a session open that does not get timed out or GC’d)?

2. Are idle sessions intentionally treated as ‘dead’ after some period, causing an unload?

3. Is there any official guidance or upcoming API for more explicit warm-up/keep-alive behavior, such as a warmup() or “persistent session” mode?


My use case needs the model to be instantly available for short, unpredictable bursts of work. Reloading the model each time adds noticeable latency, so I’m trying to understand the correct approach based on current implementation and future plans.


Thanks a lot for your time — appreciate any clarification!


Best regards,

Wilson

Thomas Steiner

unread,
Dec 9, 2025, 7:06:40 AM (7 days ago) Dec 9
to Wilson, Chrome Built-in AI Early Preview Program Discussions
Hi Wilson,

That's an excellent question. As you have correctly observed, an unused session is discarded after a certain period of inactivity (this is not specified, it's user-agent-specific behavior). As an immediate mitigation, you could of course send something like a regular heartbeat "hi" message to the model, which would be wasteful. Instead, I'd rather recommend aggressively creating a cloned session that already has the context you need. A signal could be the user hovering their mouse over or near the UI element that requires a session. Do you have any more background on the experience that you're building? The current thinking is that instead of having something like a `warmUp()` method, it would be better to simply accelerate session startup; this is where the engineering effort is currently focused.

Cheers,
Tom

--
You received this message because you are subscribed to the Google Groups "Chrome Built-in AI Early Preview Program Discussions" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chrome-ai-dev-previe...@chromium.org.
To view this discussion visit https://groups.google.com/a/chromium.org/d/msgid/chrome-ai-dev-preview-discuss/06816057-4992-4a34-aa72-acf14a33c50cn%40chromium.org.


--
Thomas Steiner, PhD—Developer Relations Engineer (blog.tomayac.comtoot.cafe/@tomayac)

Google Spain, S.L.U.
Torre Picasso, Pl. Pablo Ruiz Picasso, 1, Tetuán, 28020 Madrid, Spain

CIF: B63272603
Inscrita en el Registro Mercantil de Madrid, sección 8, Hoja M­-435397 Tomo 24227 Folio 25

----- BEGIN PGP SIGNATURE -----
Version: GnuPG v2.4.8 (GNU/Linux)

iFy0uwAntT0bE3xtRa5AfeCheCkthAtTh3reSabiGbl0ck
0fjumBl3DCharaCTersAttH3b0ttom.xKcd.cOm/1181.
----- END PGP SIGNATURE -----

Wilson

unread,
Dec 9, 2025, 7:20:51 AM (7 days ago) Dec 9
to Chrome Built-in AI Early Preview Program Discussions, Thomas Steiner, Chrome Built-in AI Early Preview Program Discussions

Thanks Tom — that helps a lot.


To give a bit more background on the experience I’m trying to support:

The model is used in short, intermittent bursts. A user may trigger an AI action, pause for a while, then resume with another quick interaction. These interactions are unpredictable and typically require the model to respond immediately, without noticeable warm-up time. When the model unloads during idle gaps, the reload latency becomes very visible to the user.


This is why I initially explored keeping a minimal session alive — not for preserving context, but purely to keep the model ready. Since idle sessions may be discarded, I’m looking for the most reliable way to avoid surprising latency spikes.


Your suggestion about proactively creating a cloned session makes sense. I can potentially tie that to early UI signals (opening a panel, focusing an input field, etc.). Hover is a nice optimization, though not always guaranteed (touch devices, keyboard users, etc.), so I’m considering broader cues.


If session startup becomes significantly faster in the future, that would essentially resolve the experience issue altogether. In the meantime, any recommended best practices for managing these short-lived, high-responsiveness interactions would be very helpful.


Thanks again — really appreciate the insight.


Best,

Wilson

Thomas Steiner

unread,
Dec 9, 2025, 7:38:25 AM (7 days ago) Dec 9
to Wilson, Chrome Built-in AI Early Preview Program Discussions, Thomas Steiner
Hi again,

These are great additional details. Assuming the model interactions are indeed unforeseeable in your app, even if it existed, a `warmUp()` function wouldn't help, as you don't know when to call it. Other than the aggressive session creation based on clones driven by heuristics, I don't have any better advice right now, apart from waiting for the process to just become faster, which, as I said, is an engineering focus. The model needs to be unloaded at some point for resource preserving reasons. 

Cheers,
Tom

François Beaufort

unread,
Dec 9, 2025, 7:56:03 AM (7 days ago) Dec 9
to Thomas Steiner, Wilson, Chrome Built-in AI Early Preview Program Discussions

Wilson

unread,
Dec 10, 2025, 2:38:02 AM (7 days ago) Dec 10
to Chrome Built-in AI Early Preview Program Discussions, François Beaufort, Chrome Built-in AI Early Preview Program Discussions, Thomas Steiner

Hi François,

Here’s what I observed in my tests:

  • Idle timeout: ~6 minutes of inactivity before the on‑device model is unloaded.
  • Reload latency on resume: ~10 seconds before tokens start streaming again (very noticeable in short, bursty interactions).
  • No unload signal: I haven’t seen a reliable way to detect that the model has been unloaded; the UX impact shows up only when the next request stalls.

François Beaufort

unread,
9:55 AM (8 hours ago) 9:55 AM
to Wilson, Chrome Built-in AI Early Preview Program Discussions, Thomas Steiner
Hey Wilson,

I've shared this issue with the chrome engineering team.
You can follow along progress at https://issuetracker.google.com/issues/468376768

Hopefully we'll have signals soon to better manage the model lifecycle OR at least as suggested in the bug a more nuanced strategy like keeping the model loaded while the page using it is foregrounded.
Reply all
Reply to author
Forward
0 new messages