Preliminary RFC: Stabilizing the V8 script compiler cached data format

143 views
Skip to first unread message

Vitali Lovich

unread,
Jul 19, 2021, 12:02:04 PM7/19/21
to v8-u...@googlegroups.com
Hi,

I wanted to kick off a discussion and solicit some thoughts on whether it would be operationally feasible to try to stabilize the cached data format of the compiler.

The context is that I work on Cloudflare Workers. We'd like to increase the script size we allow our customers to upload, but we have concerns about the performance impact that will have (specifically script parse time). One mitigation for this would be to leverage the script compiler's cached data & generate the cache whenever the user uploads a script. This way we can precompute the cached data on upload & deliver it alongside the script.

Unfortunately, this approach has a major stumbling block which is that we track V8 releases as they're published. That means our V8 version changes roughly every week which would (at best) necessitate us regenerating the cache for all the scripts on a weekly basis. This adds scalability & implementation complexity concerns (especially since we may have multiple versions of V8 running at one time).

I'm not looking to discuss implementation specific details, but more trying to get an overview of the opinions from the talented V8 team.
  • I haven't actually examined yet what the structure of the code cache actually looks like. Are there prohibitive technical blockers that can't really be resolved that make this a non-starter? 
  • Are there meaningful maintenance/security/implementation concerns? I'm assuming there are very good reasons why the data is version locked.
  • It's not necessarily a requirement to freeze it for all time (although that would of course be ideal). What is the cadence for this format actually changing (vs no-op version bumps for safety)? Would it be possible to stabilize within a major V8 release (8->9, 9->10, etc) or for 6 month periods?
  • If stabilizing is truly impossible (as I suspect it probably is), would it be technically feasible to implement a cheaper "upgrade" that converts the previous code cache to the current one? It's not ideal, but it could significantly reduce the costs needed to upgrade many scripts at once
I suspect that any improvement here would also apply to Chrome in the form of a more consistent performance experience after an upgrade.

We do have a fallback plan that's workable within the current architecture, but it's got some downsides that would be neat to bypass by stabilizing the format. Appreciate any feedback/insights anyone can offer.

Thanks,
Vitali

joe lewis

unread,
Jul 19, 2021, 3:00:52 PM7/19/21
to v8-u...@googlegroups.com
Hi Vitali,

I’m neither from the v8 team, nor an expert in this subject matter. Just wanted to drop an interesting project: Hermes - https://hermesengine.dev , a javascript engine by Facebook that is tailored for fast startup times. It does this by precompiling javascript into bytecode at build time.

So something like this should be possible maybe.

Best,
Joe

--
--
v8-users mailing list
v8-u...@googlegroups.com
http://groups.google.com/group/v8-users
---
You received this message because you are subscribed to the Google Groups "v8-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to v8-users+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/v8-users/CAF8PYMgNXRdvW16Sb%3DwRaU21XGcMG3eBgkz_ey65%2BX7DdQ0a6g%40mail.gmail.com.

Leszek Swirski

unread,
Jul 20, 2021, 6:19:41 AM7/20/21
to v8-users
Hi Vitali,

Stabilising the cached data format as-is is pretty challenging; the cache as written is pretty much a direct field-by-field serialisation of the internal data structures, so freezing the cache would mean freezing the shapes of those internal objects, effectively making the internal fields an API-level guarantee. Furthermore, it's a backdoor to a stable bytecode format, which is something we've also pushed back on as it severely limits our ability to work on the interpreter; if we wanted to have a slightly weaker constraint of at least guaranteeing backwards compatibility with old bytecode, we'd have to vastly expand our test suite with old bytecodes in order to try to maintain this backwards compatibility, and even then I'm not sure we could fully guarantee if there's some edge case not covered in the test suite. Same story with porting code caches from older to newer versions; such a port would require a mapping from old to new, which would require a) some sort of log of what old fields/bytecodes translate to what new ones, and b) heavy testing to make sure that this mapping is valid. This is a big security problem; the deserialisation is pretty dumb (for performance reasons), and just spits out data onto the V8 heap without e.g. checking if the number of fields match. Having bugs in the old->new mapping, or in the backwards compatibility, would open up a whole pandora's box of security issues, where one deleted field in an edge case that tests don't cover would become an out-of-bounds write widget.

Given that this would greatly increase our development complexity (maintaining a stable API is already a lot of trouble for us), would be a big source of security issues, and I don't expect it to provide much benefit for Chrome (since we expect websites to change more often than Chrome versions), I don't see us either working on (or accepting patches for) a stable or even upgradeable cache.

I'd be curious to know if you've actually observed/measured script parse time being a big problem, or whether you're more seeing issues due to lazy function compilation time. We've done a lot of work on parse time in recent years, so it's not as slow as (some) people assume. We're also prototyping a potential stable & standardisable snapshot format for the results of partial script execution, which could help you if you're seeing large script "setup" code being an issue, but it wouldn't store compiled bytecode (for the above reasons).

I appreciate that this might be a disappointing answer for you, but having flexibility with internal objects and bytecode is one of the things that allows us to stay performant and secure.

- Leszek

Jakob Kummerow

unread,
Jul 21, 2021, 3:48:13 PM7/21/21
to v8-users
In addition to the excellent technical points that Leszek explained, I just wanted to clarify additionally that we don't have a concept of "major releases 8 -> 9" etc. V8's version numbers are assigned by adding 0.1 every six weeks, rolling from x.9 to (x+1).0. There are zero technical or strategic considerations going into version numbers, it's entirely mechanical and on a time-based schedule. The 8.9 -> 9.0 bump is exactly as significant as 8.8 -> 8.9 or 9.0 -> 9.1. (It's similar to the Linux kernel going from, say, 4.19 to 5.0.)

Using daily V8 snapshots in production is an interesting choice. If that works for you, great; I wouldn't recommend it, given that daily snapshots are quite severely broken every now and then. See https://v8.dev/docs/version-numbers for our official recommendation (simplified TL;DR: use what Chrome Stable uses).

Vitali Lovich

unread,
Jul 22, 2021, 7:18:26 PM7/22/21
to v8-u...@googlegroups.com
Hi Leszek,

Apologies for the delayed reply - I've been a bit swamped at work the past couple of days. Thank you for the excellent details & we'll align our plans accordingly. Some replies inline.

I've replied privately to Jacob's concern as I don't want to derail this conversation.

On Tue, Jul 20, 2021 at 3:19 AM Leszek Swirski <les...@chromium.org> wrote:
Hi Vitali,

Stabilising the cached data format as-is is pretty challenging; the cache as written is pretty much a direct field-by-field serialisation of the internal data structures, so freezing the cache would mean freezing the shapes of those internal objects, effectively making the internal fields an API-level guarantee. Furthermore, it's a backdoor to a stable bytecode format, which is something we've also pushed back on as it severely limits our ability to work on the interpreter; if we wanted to have a slightly weaker constraint of at least guaranteeing backwards compatibility with old bytecode, we'd have to vastly expand our test suite with old bytecodes in order to try to maintain this backwards compatibility, and even then I'm not sure we could fully guarantee if there's some edge case not covered in the test suite. Same story with porting code caches from older to newer versions; such a port would require a mapping from old to new, which would require a) some sort of log of what old fields/bytecodes translate to what new ones, and b) heavy testing to make sure that this mapping is valid. This is a big security problem; the deserialisation is pretty dumb (for performance reasons), and just spits out data onto the V8 heap without e.g. checking if the number of fields match. Having bugs in the old->new mapping, or in the backwards compatibility, would open up a whole pandora's box of security issues, where one deleted field in an edge case that tests don't cover would become an out-of-bounds write widget.

Given that this would greatly increase our development complexity (maintaining a stable API is already a lot of trouble for us), would be a big source of security issues, and I don't expect it to provide much benefit for Chrome (since we expect websites to change more often than Chrome versions), I don't see us either working on (or accepting patches for) a stable or even upgradeable cache.

I'd be curious to know if you've actually observed/measured script parse time being a big problem, or whether you're more seeing issues due to lazy function compilation time. We've done a lot of work on parse time in recent years, so it's not as slow as (some) people assume.
What's the best way to measure script parse time vs lazy function compilation time? It's been a few months since I last looked at this so my memory is a bit hazy on whether it was instantiating v8::ScriptCompiler::Sourcev8::ScriptCompiler::CompileUnboundScript, or the combined time of both (although I suspect both count as script parse time?). I do recall that on my laptop, using the code cache basically halved the time on larger scripts of what I was measuring & I suspect I would have looked at the overall time to instantiate the isolate with a script (it was a no-op on smaller scripts, so I suspect we're talking about script parse time).

FWIW, if It's helpful, when I profiled a stress test of isolate construction on my machine with a release build, I saw V8 spending a lot of time deserializing the snapshot (seemingly once for the isolate & then again for the context). Breakdown of the flamegraph:
* ~22% of total runtime to run NewContextFromSnapshot. Within that ~5% of total runtime was spent just decompressing the snapshot & the rest was deserializing it (17%). I thought there was only 1 snapshot. Couldn't the decompression happen once in V8System instead?
* 9% of total runtime spent decompressing the snapshot for the isolate (in other words 14% of total runtime was spent decompressing the snapshot).

In our use-case we construct a lot of isolates in the same process. I'm curious if there's opportunities to extend V8 to utilize COW to reduce the memory & CPU impact of deserializing the snapshot multiple times. Is my guess correct that deserialization is actually doing non-trivial things like relocating objects or do you think there's a 0-copy approach that can be taken with serializing/deserializing the snapshot so that it's prebuilt in the right format (perhaps even without any compression)?

With respect to compression, do you think that maybe the snapshot could be moved to being provided when V8System is constructed so that all isolates deserialize out of the same decompressed snapshot?

Apologies if these questions are nonsensical. I'm still trying to learn how the internals of V8 hook up together.
 
We're also prototyping a potential stable & standardisable snapshot format for the results of partial script execution, which could help you if you're seeing large script "setup" code being an issue, but it wouldn't store compiled bytecode (for the above reasons).

I appreciate that this might be a disappointing answer for you, but having flexibility with internal objects and bytecode is one of the things that allows us to stay performant and secure.
I fully understand. I'm definitely interested in the snapshot format since presumably anything that helps the web here will also help us. Is there a paper I can reference to read up more on the proposal? I've seen a few in the wild from the broader JS community but nothing about V8's plans here. I have no idea if that will help our workload but it's certainly something we're open to exploring.

Thanks,
Vitali

Jace Mogill

unread,
Jul 23, 2021, 8:02:27 PM7/23/21
to v8-users
All,

Since this topic has already prompted so much work and discussion I'll chime in as an opinionated but passive community member whose project is also ultimately limited by being able to reason about the storage sequence of V8 objects.

Startup time is often a driving consideration, but any program which spends a substantial portion of execution time converting JSON into a native V8 representation would benefit from having alternatives at runtime.

I am the author of the Extended Memory Semantics module (https://github.com/mogill/ems/) which enables a JS (or Python or C) program with petabytes of persistent data to start with no overhead because none of the data is read from storage or parsed from JSON to native representations until it is referenced at runtime.  The downside, of course, is the parsing cost is paid over and over again at runtime.  Depending on the use case this approach is somewhere between ideal and pathological.

Complementary data and task parallel JSON lexing implementations already exists (https://github.com/simdjson/simdjson, https://github.com/mogill/parallel-xml2json) but there's no way to store the results in a way V8 can use, and thus an extra copy in/out step is needed.

I see several "bookend" options which may be combined to varying degrees:
  - A lowest common denominator data storage sequence is defined by V8
  - Applications gain the ability to introspect about V8's object storage sequences at runtime
  - Applications can tell V8 how data is stored in memory and V8 can adapt to an existing storage sequence

I'd be only too happy to keep some form of this conversation going if it meant it would result in alternatives to copy in/out semantics for all V8 data.

               -J

Leszek Swirski

unread,
Jul 26, 2021, 5:19:42 AM7/26/21
to v8-u...@googlegroups.com, Marja Hölttä
On Fri, Jul 23, 2021 at 1:18 AM Vitali Lovich <vlo...@gmail.com> wrote:
What's the best way to measure script parse time vs lazy function compilation time? It's been a few months since I last looked at this so my memory is a bit hazy on whether it was instantiating v8::ScriptCompiler::Sourcev8::ScriptCompiler::CompileUnboundScript, or the combined time of both (although I suspect both count as script parse time?). I do recall that on my laptop, using the code cache basically halved the time on larger scripts of what I was measuring & I suspect I would have looked at the overall time to instantiate the isolate with a script (it was a no-op on smaller scripts, so I suspect we're talking about script parse time).

The best way is to run with --runtime-call-stats, this will give you detailed scoped timers for almost everything we do, including compilation. Script deserialisation is certainly faster than script compilation, so I'm not surprised it has a big impact when the two are compared against each other, I'm more curious how it compares to overall worklet runtime.
 
FWIW, if It's helpful, when I profiled a stress test of isolate construction on my machine with a release build, I saw V8 spending a lot of time deserializing the snapshot (seemingly once for the isolate & then again for the context).

Yeah, the isolate snapshot is the ~immutable context-independent one (think of things like the "undefined" value) which is deserialized once per isolate, and the context snapshot is things that are mutable (think of things like the "Math" object) that have to be fresh per new context. Note that these snapshots use the same mechanism as the code cache snapshot, but are otherwise entirely distinct.
 
Breakdown of the flamegraph:
* ~22% of total runtime to run NewContextFromSnapshot. Within that ~5% of total runtime was spent just decompressing the snapshot & the rest was deserializing it (17%). I thought there was only 1 snapshot. Couldn't the decompression happen once in V8System instead?

It's possible that the decompression could be once per isolate, although there is the memory impact to consider.
 
* 9% of total runtime spent decompressing the snapshot for the isolate (in other words 14% of total runtime was spent decompressing the snapshot).

In our use-case we construct a lot of isolates in the same process. I'm curious if there's opportunities to extend V8 to utilize COW to reduce the memory & CPU impact of deserializing the snapshot multiple times. Is my guess correct that deserialization is actually doing non-trivial things like relocating objects or do you think there's a 0-copy approach that can be taken with serializing/deserializing the snapshot so that it's prebuilt in the right format (perhaps even without any compression)?

There's definitely relocations happening during deserialisation; for the isolate, we've wanted to share the "read-only space" which contains immutable immortal objects (like "undefined"), but under pointer compression this has technical issues because of limited guarantees when using mmap (IIRC). I imagine COW for the context snapshot would have similar issues, combined with the COW getting immediately defeated as soon as the GC runs (because it has to mutate the data to set mark bits). It's a direction worth exploring, but hasn't been enough of a priority for us.

Another thing we're considering looking into is deserializing the context snapshot lazily, so that unused functions/classes never get deserialized in the first place. Again, not something we've had time to prioritise, but something we're much more likely to work on at some point in the future, since it becomes more web relevant every time new functionality is introduced.

I fully understand. I'm definitely interested in the snapshot format since presumably anything that helps the web here will also help us. Is there a paper I can reference to read up more on the proposal? I've seen a few in the wild from the broader JS community but nothing about V8's plans here. I have no idea if that will help our workload but it's certainly something we're open to exploring.

You're probably thinking of BinaryAST, which is unrelated to this. We haven't talked much about web snapshots yet, because it's still very preliminary, very prototypy, and we don't want to make any promises or guarantees around it even ever materialising. +Marja Hölttä is leading this effort, she'll know the current state.

Leszek Swirski

unread,
Jul 26, 2021, 5:27:21 AM7/26/21
to v8-u...@googlegroups.com
On Sat, Jul 24, 2021 at 2:02 AM Jace Mogill <ja...@mogill.com> wrote:
Complementary data and task parallel JSON lexing implementations already exists (https://github.com/simdjson/simdjson, https://github.com/mogill/parallel-xml2json) but there's no way to store the results in a way V8 can use, and thus an extra copy in/out step is needed.

Note that V8's JSON parsing generates mutable, introspectable JavaScript objects, with all their prototypal inheritance, monkey patching possibility, iteration semantics, garbage collection compatibility, interaction with V8's inline caches for property access, etc.. The actual JSON _parsing_ of V8's JSON parser is quite competitive, it's the rest of this stuff that takes the additional time.
 
I see several "bookend" options which may be combined to varying degrees:
  - A lowest common denominator data storage sequence is defined by V8
  - Applications gain the ability to introspect about V8's object storage sequences at runtime
  - Applications can tell V8 how data is stored in memory and V8 can adapt to an existing storage sequence

Since you are already using the C++ API, you can already do this by creating Object templates and registering interceptors on them which defer to your existing external storage. This will make the JSON parsing fast, at the cost of property access and memory management being slow, but there's no way for these to _not_ be slow without integrating with the rest of the system, at which point you're back to paying all the costs of V8's existing JSON parsing.

Jakob Gruber

unread,
Jul 26, 2021, 5:28:13 AM7/26/21
to v8-u...@googlegroups.com, Marja Hölttä
On Mon, Jul 26, 2021 at 11:19 AM Leszek Swirski <les...@chromium.org> wrote:
On Fri, Jul 23, 2021 at 1:18 AM Vitali Lovich <vlo...@gmail.com> wrote:
What's the best way to measure script parse time vs lazy function compilation time? It's been a few months since I last looked at this so my memory is a bit hazy on whether it was instantiating v8::ScriptCompiler::Sourcev8::ScriptCompiler::CompileUnboundScript, or the combined time of both (although I suspect both count as script parse time?). I do recall that on my laptop, using the code cache basically halved the time on larger scripts of what I was measuring & I suspect I would have looked at the overall time to instantiate the isolate with a script (it was a no-op on smaller scripts, so I suspect we're talking about script parse time).

The best way is to run with --runtime-call-stats, this will give you detailed scoped timers for almost everything we do, including compilation. Script deserialisation is certainly faster than script compilation, so I'm not surprised it has a big impact when the two are compared against each other, I'm more curious how it compares to overall worklet runtime.
 
FWIW, if It's helpful, when I profiled a stress test of isolate construction on my machine with a release build, I saw V8 spending a lot of time deserializing the snapshot (seemingly once for the isolate & then again for the context).

Yeah, the isolate snapshot is the ~immutable context-independent one (think of things like the "undefined" value) which is deserialized once per isolate, and the context snapshot is things that are mutable (think of things like the "Math" object) that have to be fresh per new context. Note that these snapshots use the same mechanism as the code cache snapshot, but are otherwise entirely distinct.
 
Breakdown of the flamegraph:
* ~22% of total runtime to run NewContextFromSnapshot. Within that ~5% of total runtime was spent just decompressing the snapshot & the rest was deserializing it (17%). I thought there was only 1 snapshot. Couldn't the decompression happen once in V8System instead?

It's possible that the decompression could be once per isolate, although there is the memory impact to consider.

 
* 9% of total runtime spent decompressing the snapshot for the isolate (in other words 14% of total runtime was spent decompressing the snapshot).

In our use-case we construct a lot of isolates in the same process. I'm curious if there's opportunities to extend V8 to utilize COW to reduce the memory & CPU impact of deserializing the snapshot multiple times. Is my guess correct that deserialization is actually doing non-trivial things like relocating objects or do you think there's a 0-copy approach that can be taken with serializing/deserializing the snapshot so that it's prebuilt in the right format (perhaps even without any compression)?

There's definitely relocations happening during deserialisation; for the isolate, we've wanted to share the "read-only space" which contains immutable immortal objects (like "undefined"), but under pointer compression this has technical issues because of limited guarantees when using mmap (IIRC). I imagine COW for the context snapshot would have similar issues, combined with the COW getting immediately defeated as soon as the GC runs (because it has to mutate the data to set mark bits). It's a direction worth exploring, but hasn't been enough of a priority for us.

Another thing we're considering looking into is deserializing the context snapshot lazily, so that unused functions/classes never get deserialized in the first place. Again, not something we've had time to prioritise, but something we're much more likely to work on at some point in the future, since it becomes more web relevant every time new functionality is introduced.

I fully understand. I'm definitely interested in the snapshot format since presumably anything that helps the web here will also help us. Is there a paper I can reference to read up more on the proposal? I've seen a few in the wild from the broader JS community but nothing about V8's plans here. I have no idea if that will help our workload but it's certainly something we're open to exploring.

You're probably thinking of BinaryAST, which is unrelated to this. We haven't talked much about web snapshots yet, because it's still very preliminary, very prototypy, and we don't want to make any promises or guarantees around it even ever materialising. +Marja Hölttä is leading this effort, she'll know the current state.

--
--
v8-users mailing list
v8-u...@googlegroups.com
http://groups.google.com/group/v8-users
---
You received this message because you are subscribed to the Google Groups "v8-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to v8-users+u...@googlegroups.com.

Jace Mogill

unread,
Jul 27, 2021, 8:16:38 PM7/27/21
to v8-users
Leszek,

Thank you for turning me on to the possibility of using an interceptor of a predefined object.  This may be the gateway to several other approaches for integrating parallelism to V8 that sidelined me.  I'll admit the last time I wrote code using V8 headers was v0.9 and now my full-time job has nothing to do with this, but I'm approaching this as a long-term effort and your response has breathed new life into an old idea.

Many of my previous interactions with the Node/V8 community have been between awful and baffling so I want to reiterate my appreciation for your response that goes beyond the technical advice.

         -J
Reply all
Reply to author
Forward
0 new messages