Utility to check if a given stream can parse as Javascript (ORB)

127 views
Skip to first unread message

Łukasz Anforowicz

unread,
Aug 11, 2021, 3:21:43 PM8/11/21
to v8-dev
Hello v8-dev@,

Could you please help me with my questions below (related to parsing Javascript)?  Please let me know if I should try another email alias instead (I wasn't quite sure where to start asking questions).

Context:
  • ORB proposes to parse a HTTP response body to verify if it can be parsed as Javascript (blocking no-cors HTTP responses if the response body doesn't represent Javascript, because earlier ORB steps have already verified that the response doesn't represent other valid no-cors scenarios like audio/image/video/stylesheet/etc).
  • AFAICT, public v8 APIs provide a way to compile a script (e.g. v8::ScriptCompiler::CompileUnboundScript which takes a string as input, and a v8::ScriptCompiler::StartStreaming which takes a stream as input).  OTOH, v8/src/parsing/parser.cc doesn't seem to be exposed via the public API.
Questions:
  • Would it be possible and/or reasonable to provide a public v8 API for checking if a stream can be parsed as Javascript?
    • Assumption: No cache integration is needed (the parsing will happen outside of a renderer process;  no compilation will be done).
    • Requirement: For JSON, the parser should indicate that this is not a valid Javascript (e.g. for JSON objects + for JSON lists that terminate without invoking any list methods)
    • I am happy to tackle this work, but I may need some guidance and hand-holding regarding some of the details.
  • Is it fair to describe Javascript parsing as risky from a security perspective?  (e.g. something to avoid in a NetworkService process and consider doing in a Utility process instead)
    • On one hand, the input is a text stream (no binary offsets) and the output is just a boolean (definitely-not-a-Javascript VS the-prefix-still-parses-as-Javascript).  And I imagine that the essence of the parser just mechanically transcribes the BNF rules for Javascript.  OTOH, parsers can get fairly complex, and so it seems that the act of parsing might be seen as violating the Rule-of-2.
--
Thanks,

Lukasz

Hannes Payer

unread,
Aug 12, 2021, 3:31:08 AM8/12/21
to v8-...@googlegroups.com
Hi Lukasz,

To understand your question correctly: You want an API which returns true if the JavaScript input is valid, right?

I think this surgery should be possible but I am deferring to the parser owners. @Leszek Swirski @Toon Verwaest WDYT? Maybe that's even a nice testing mode for JS language features.

The parser is quite complicated which is a problem from a security perspective. That's a Rule-of-2 violation.

-Hannes

--
--
v8-dev mailing list
v8-...@googlegroups.com
http://groups.google.com/group/v8-dev
---
You received this message because you are subscribed to the Google Groups "v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to v8-dev+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/d4dd45ff-3b73-4d4b-883d-d2e8ba4123e7n%40googlegroups.com.


--

 

Hannes Payer | V8 | Google Germany GmbH | Erika-Mann Str. 33, 80636 München 

Registergericht und -nummer: Hamburg, HRB 86891 | Sitz der Gesellschaft: Hamburg | Geschäftsführer: Matthew Scott Sucherman, Paul Terence Manicle

Mathias Bynens

unread,
Aug 12, 2021, 3:37:14 AM8/12/21
to v8-...@googlegroups.com
Another complication is that V8 currently doesn’t throw early (“parse”) errors for regular expression literals (issue 896). This would have to be resolved before we can accurately validate whether a given input is valid JS or not.

Leszek Swirski

unread,
Aug 12, 2021, 3:47:39 AM8/12/21
to v8-...@googlegroups.com
That's also the case for invalid LHS on assignment, e.g. `lhs() = 5`, which should be an early error but for web compat we make it a runtime error.

Overall, this could be something we expose, but:
  1. There's a couple of additional complications around JS standards incompatible errors (like the two aforementioned ones), some of which are intentional
  2. There's the rule-of-2 violation
  3. This breaks streaming compilation (since the full body of the resource has to be available for parsing before it is sent to the renderer)
  4. Parsing JS ain't cheap, and doing so as part of the network process, presumably before sending anything to the renderer, is quite a cost
  5. We don't have a way of distinguishing valid JS from valid JSON during parse, so we'd effectively need to parse twice
  6. Standards can change, and syntax can change with it, so whether or not something is blocked will be version dependent
  7. The DX of blocking a script just because of a parse error may be suboptimal

Łukasz Anforowicz

unread,
Aug 12, 2021, 2:10:55 PM8/12/21
to v8-...@googlegroups.com, Charlie Reis
Thank you very much for the feedback - much appreciated.  I've tried to reply to some of the feedback inline, below.

Let me step back a little bit, and observe that distinguishing JS from non-JS might not necessarily require full-fidelity JS parsing (to catch, say, 95% of non-JS responses).  On one hand it might be undesirable to introduce additional sniffing/parsing heuristics (defined and evolving separately from the JS parser and spec), but maybe such heuristics would be useful for catching PDF, ZIP, MSWORD, and other non-JS files that exhibit some "obvious" signs that they are non-JS?  Maybe we can brainstorm together on how such heuristics could look like?  I've tried to gather some notes in a doc here, but let me copy them below for your convenience:

ORB-with-html/json/xml-sniffing shows that some security benefits of ORB may be realized without full-fidelity JS sniffing/parsing.  Let’s explore various considerations that may lead to discovery of other alternative approaches to sniffing.


  • Pri1 requirement: Avoid breaking existing websites.

    • Requirement: Never block responses with JS body

      • Requirement: never block HTML/JS polyglots

      • Requirement: skip elements that are okay both in HTML and JS:

        • <!-- … --> comments

        • Whitespace

    • Non-requirement (?): Never block responses with *future* JS bodies

      • We don’t want to break _existing_ websites.  But maybe we can force _future_ websites to label their JS as JavaScript MIME type.  Therefore, we don’t have a requirement to robustly handle _future_ JS versions/specs (and/or _future_ image formats like JXL).

  • Pri2 requirement: Block as many responses as possible

    • Requirement: Block as many non-JS responses as possible (after earlier ORB steps rule out that we are dealing with an image, audio, video, or stylesheet):

      • Requirement: Block responses starting with: %PDF-

      • Requirement: Block zip files - files starting with 50 4B 03 04, or 50 4B 05 06, or 50 4B 07 08

      • Requirement: Block MS Office files - files starting with D0 CF 11 E0 A1 B1 1A E1 (source: MS-XLS spec + Microsoft Compound File Binary File Format spec)

      • Requirement: Block CSV files

      • Requirement: Block XML files - files starting with: <?xml

      • Requirement: Block ProtoBuf (binary and text encoding)

      • Requirement: Block responses beginning with JSON parser breaker - examples: )]}' {}&& while(1);

      • Requirement: Block HTML files: whitespace + HTML comments followed by a HTML element

  • Pri2 requirement: Do not regress performance / latency / etc

    • Requirement: Make the final decision based on the 1st 1024 bytes of the response body.

  • Assumption: UTF-8 (maybe it is okay if UTF-16 encoding of Javascript is not recognized as Javascript?)


We probably don't want the sniffer to have PDF-specific or ZIP-specific knowledge.  But maybe there are some generic heuristics that would detect PDF and ZIP as non-JS?

Not-quite-working heuristic: Maybe ASCII control characters should mean: non-JS (except LF and CR and other WhiteSpace and LineTerminator characters)?  This is a bit problematic because SourceCharacter in JS BNF allows any Unicode code point.  OTOH, maybe this only matters inside JS comments or string literals?

Thanks,

Lukasz

On Thu, Aug 12, 2021 at 12:47 AM Leszek Swirski <les...@chromium.org> wrote:
That's also the case for invalid LHS on assignment, e.g. `lhs() = 5`, which should be an early error but for web compat we make it a runtime error.

Overall, this could be something we expose, but:
  1. There's a couple of additional complications around JS standards incompatible errors (like the two aforementioned ones), some of which are intentional
  2. There's the rule-of-2 violation
  3. This breaks streaming compilation (since the full body of the resource has to be available for parsing before it is sent to the renderer)
Having to present the full body of a resource is indeed problematic (because it requires gathering the response body before passing CORB/ORB can pass/expose the body into the renderer process).
  1. Parsing JS ain't cheap, and doing so as part of the network process, presumably before sending anything to the renderer, is quite a cost
  2. We don't have a way of distinguishing valid JS from valid JSON during parse, so we'd effectively need to parse twice
This seems solvable with something like:
When the sniffer sees:
[ 123, 456, “long string taking X bytes”,
then it should block the response when the Content-Type is a JSON MIME type, but otherwise it should allow the response (trading off security for backcompatibility).

When the sniffer sees:
{ “foo”:
then it should block the response, because such a prefix never results in valid Javascript. (Although the JSON object syntax is exactly Javascript's object-initializer syntax, a Javascript object-initializer expression is not valid as a standalone Javascript statement.)
 
  1. Standards can change, and syntax can change with it, so whether or not something is blocked will be version dependent
That is a fair point.  OTOH, we might have some flexibility here, because A) if CORB/ORB blocks only non-javascript responses, then this should have very little impact on web pages that work fine and B) we mostly care about avoiding breaking _existing_ websites (and therefore might be okay ignoring _future_ Javascript spec changes and forcing _future_ scripts to always be served with a correct MIME type).
  1. The DX of blocking a script just because of a parse error may be suboptimal
Ack.  This is something that I didn't have in focus, so thanks for bringing this up.  Maybe (?) it is okay to say that to get good Developer eXperience one has to label their scripts with the correct JavaScript MIME type.

On Thu, Aug 12, 2021 at 9:37 AM 'Mathias Bynens' via v8-dev <v8-...@googlegroups.com> wrote:
Another complication is that V8 currently doesn’t throw early (“parse”) errors for regular expression literals (issue 896). This would have to be resolved before we can accurately validate whether a given input is valid JS or not.

On Thu, Aug 12, 2021 at 9:31 AM 'Hannes Payer' via v8-dev <v8-...@googlegroups.com> wrote:
Hi Lukasz,

To understand your question correctly: You want an API which returns true if the JavaScript input is valid, right?

Yes.  I am not sure at this point whether the input is 1) a string containing the whole response body, 2) a string containing a prefix of the response body (e.g. the 1st 1024 bytes), or 3) a stream. 

I think this surgery should be possible but I am deferring to the parser owners. @Leszek Swirski @Toon Verwaest WDYT? Maybe that's even a nice testing mode for JS language features.

The parser is quite complicated which is a problem from a security perspective. That's a Rule-of-2 violation.

Ack. 
You received this message because you are subscribed to a topic in the Google Groups "v8-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/v8-dev/NGGCw9OjatI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to v8-dev+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/CAGRskv9jhWgPAqjiTvWuy0JCyLAgdYS_9PKgg-5bAqpuKyp81Q%40mail.gmail.com.


--
Thanks,

Lukasz

Jakob Kummerow

unread,
Aug 12, 2021, 6:11:54 PM8/12/21
to v8-dev, Charlie Reis

ORB-with-html/json/xml-sniffing shows that some security benefits of ORB may be realized without full-fidelity JS sniffing/parsing. 


You may call it a security benefit to block "obvious" parser breakers like )]}', but in general, any "when in doubt, don't block it" strategy won't be much of an obstacle to intentional attacks. For instance, once Mr. Bad Guy has learned that the sniffer only looks at the first 1024 characters, they can send a response whose first 1024 characters lead to a "well, it might be valid JS" judgement (such as a JS comment, or long string, or whatever). OTOH any "when in doubt, block it" strategy runs the risk of breaking existing websites in those doubtful cases.
 
 (Although the JSON object syntax is exactly Javascript's object-initializer syntax, a Javascript object-initializer expression is not valid as a standalone Javascript statement.)

There is (at least) one subtlety here: JS is more permissive than the official JSON spec. The latter requires quotes around property names, the former doesn't. I.e. {"foo": is indeed never valid JS, but {foo: is (the brace opens a code block, and foo is a label). Also, the colon is essential for rejecting the former snippet, because {"foo"; is valid JS (code block plus ignored string á la "use strict";), so this is a concrete example where the 1024-char prefix issue is relevant.
 
When the sniffer sees:
     [ 123, 456, “long string taking X bytes”,
then it should block the response when the Content-Type is a JSON MIME type

I don't follow. When the Content-Type is JSON, and the actual contents are valid JSON, why should that be blocked?

Łukasz Anforowicz

unread,
Aug 12, 2021, 6:18:21 PM8/12/21
to v8-...@googlegroups.com, Charlie Reis
On Thu, Aug 12, 2021 at 3:11 PM Jakob Kummerow <jkum...@chromium.org> wrote:

ORB-with-html/json/xml-sniffing shows that some security benefits of ORB may be realized without full-fidelity JS sniffing/parsing. 


You may call it a security benefit to block "obvious" parser breakers like )]}', but in general, any "when in doubt, don't block it" strategy won't be much of an obstacle to intentional attacks. For instance, once Mr. Bad Guy has learned that the sniffer only looks at the first 1024 characters, they can send a response whose first 1024 characters lead to a "well, it might be valid JS" judgement (such as a JS comment, or long string, or whatever). OTOH any "when in doubt, block it" strategy runs the risk of breaking existing websites in those doubtful cases.

In CORB threat model the attacker does *not* control the responses - CORB tries to prevent https://attacker.com (with either Spectre or a compromised renderer) from being able to read no-cors responses from https://victim.com.
 
 (Although the JSON object syntax is exactly Javascript's object-initializer syntax, a Javascript object-initializer expression is not valid as a standalone Javascript statement.)

There is (at least) one subtlety here: JS is more permissive than the official JSON spec. The latter requires quotes around property names, the former doesn't. I.e. {"foo": is indeed never valid JS, but {foo: is (the brace opens a code block, and foo is a label). Also, the colon is essential for rejecting the former snippet, because {"foo"; is valid JS (code block plus ignored string á la "use strict";), so this is a concrete example where the 1024-char prefix issue is relevant.
 
When the sniffer sees:
     [ 123, 456, “long string taking X bytes”,
then it should block the response when the Content-Type is a JSON MIME type

I don't follow. When the Content-Type is JSON, and the actual contents are valid JSON, why should that be blocked?

Correct.  There is no way to read cross-origin JSON via a "no-cors" fetch.  The only way to read cross-origin JSON is via CORS-mediated fetch (where the victim has to opt-in by responding with "Access-Control-Allow-Origin: ...").

--
--
v8-dev mailing list
v8-...@googlegroups.com
http://groups.google.com/group/v8-dev
---
You received this message because you are subscribed to a topic in the Google Groups "v8-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/v8-dev/NGGCw9OjatI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to v8-dev+un...@googlegroups.com.

Łukasz Anforowicz

unread,
Aug 12, 2021, 6:26:06 PM8/12/21
to v8-...@googlegroups.com, Charlie Reis
On Thu, Aug 12, 2021 at 3:18 PM Łukasz Anforowicz <luk...@google.com> wrote:


On Thu, Aug 12, 2021 at 3:11 PM Jakob Kummerow <jkum...@chromium.org> wrote:

ORB-with-html/json/xml-sniffing shows that some security benefits of ORB may be realized without full-fidelity JS sniffing/parsing. 


You may call it a security benefit to block "obvious" parser breakers like )]}', but in general, any "when in doubt, don't block it" strategy won't be much of an obstacle to intentional attacks. For instance, once Mr. Bad Guy has learned that the sniffer only looks at the first 1024 characters, they can send a response whose first 1024 characters lead to a "well, it might be valid JS" judgement (such as a JS comment, or long string, or whatever). OTOH any "when in doubt, block it" strategy runs the risk of breaking existing websites in those doubtful cases.

In CORB threat model the attacker does *not* control the responses - CORB tries to prevent https://attacker.com (with either Spectre or a compromised renderer) from being able to read no-cors responses from https://victim.com.
 
 (Although the JSON object syntax is exactly Javascript's object-initializer syntax, a Javascript object-initializer expression is not valid as a standalone Javascript statement.)

There is (at least) one subtlety here: JS is more permissive than the official JSON spec. The latter requires quotes around property names, the former doesn't. I.e. {"foo": is indeed never valid JS, but {foo: is (the brace opens a code block, and foo is a label). Also, the colon is essential for rejecting the former snippet, because {"foo"; is valid JS (code block plus ignored string á la "use strict";), so this is a concrete example where the 1024-char prefix issue is relevant.
 
When the sniffer sees:
     [ 123, 456, “long string taking X bytes”,
then it should block the response when the Content-Type is a JSON MIME type

I don't follow. When the Content-Type is JSON, and the actual contents are valid JSON, why should that be blocked?

Correct.  There is no way to read cross-origin JSON via a "no-cors" fetch.  The only way to read cross-origin JSON is via CORS-mediated fetch (where the victim has to opt-in by responding with "Access-Control-Allow-Origin: ...").

Maybe another way to look at it is:
  • Only Javascript (and images/audio/video/stylesheets) can be sent in no-cors mode (e.g. without CORS).  Non-Javascript (and non-image/video/etc), no-cors, cross-origin responses can be blocked.
  • If the response sniffs as JSON (Content-Type=JSON and First1024bytes=JSON) then it is *not* Javascript.  Therefore we can block the response (and prevent disclosing https://victim.com/secret.json to a no-cors fetch from https://attacker.com).
 

--
--
v8-dev mailing list
v8-...@googlegroups.com
http://groups.google.com/group/v8-dev
---
You received this message because you are subscribed to a topic in the Google Groups "v8-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/v8-dev/NGGCw9OjatI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to v8-dev+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/CAKSzg3TNvd1jd3yH8xyD767ZhbCqhEZJMFmm7nQ%2BtcQcXfjt_g%40mail.gmail.com.


--
Thanks,

Lukasz


--
Thanks,

Lukasz

Toon Verwaest

unread,
Aug 17, 2021, 9:59:29 AM8/17/21
to v8-...@googlegroups.com, Charlie Reis
Thinking out loud: One idea could be to have a separate sandboxed compiler process in which we compile incoming JS code. That could reject the source if it doesn't compile; or compile it to a script that just throws with no additional info about the actual source.

That process could implement streaming compilation; so we don't block streaming until later, we don't double parse, we still have a sandbox (not in the network process). There might even be benefits for caching as a compromised renderer cannot look at the compilation artefacts until it receives them.

If we fully compile and create a code cache from the compilation result we don't need a new API on the V8 side, but do additional serialization/deserialization work. That should be faster than reparsing though. The upper limit of the cost would essentially be the cost of serializing / deserializing a code cache for each script.

You received this message because you are subscribed to the Google Groups "v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to v8-dev+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/CAA_NCUHWD5G2G9aHe%3DnM6k-hSZY2ufqx7GwEhmKYSfPN9b%3D9WA%40mail.gmail.com.

Łukasz Anforowicz

unread,
Aug 17, 2021, 8:29:28 PM8/17/21
to v8-...@googlegroups.com, Charlie Reis
On Tue, Aug 17, 2021 at 6:59 AM Toon Verwaest <verw...@chromium.org> wrote:
Thinking out loud: One idea could be to have a separate sandboxed compiler process in which we compile incoming JS code. That could reject the source if it doesn't compile; or compile it to a script that just throws with no additional info about the actual source.

That process could implement streaming compilation; so we don't block streaming until later, we don't double parse, we still have a sandbox (not in the network process). There might even be benefits for caching as a compromised renderer cannot look at the compilation artefacts until it receives them.

If we fully compile and create a code cache from the compilation result we don't need a new API on the V8 side, but do additional serialization/deserialization work. That should be faster than reparsing though. The upper limit of the cost would essentially be the cost of serializing / deserializing a code cache for each script.

This seems like an interesting idea.  I wonder if compilation (no evaluation / running of scripts) would be considered safe enough to handle in a single (not origin/site-bound/locked) process.

One thing that I don't fully understand (For both full-JS-parsing and partial/hackish-non-JS-detection approaches) is if the encoding (e.g. UTF8 vs UTF16-LE vs Win-1250) has to be known and communicated upfront to the parser/sniffer?  Or maybe the input to the decoder needs to be already in UTF8?  Or maybe something in //net or //network layers can already handle this aspect of the problem (e.g. ensuring UTF8 in URLLoader::DidRead)?

Also - when trying to explore the partial/hackish-non-JS-detection idea, I wondered if the very first character in a script may only come from a relatively limited set of characters?  Let's assume that the sniffer can skip whitespace (space, tab, CR, LF, LS, PS) and html/xml comments (e.g. <!-- ... -->) - AFAICT the very next character has to be either:
  • The start of a reserved keyword like "if", "let", etc. (all lowercase ASCII)
  • The start of an identifier (any Unicode code point with the Unicode property “ID_Start”)
  • The start of a unary expression: + - ~ !
  • The start of a string literal, string template, or a regexp literal (or non-HTML comment): " ' ` /
  • The start of a numeric literal: 0-9
  • An opening paren, bracket or brace: ( [ {
  • Not quite sure if a dot or an equal sign can appear as the very first character: . =
This would reject PDFs (starts with %) and HTML/XML (starts with <), but still would accept ZIP files (first character is a 0x50 - capital P) and MSOffice files (first character is a 0xD0 which according to Unicode has ID_Start property set to true).  Rejecting ZIP and MSOffice files would require going beyond the first character - maybe rejecting control characters like 0x11 or 0x03 outside of comments (not sure if at this point the sniffer's heuristics are starting to get too complex).

Toon Verwaest

unread,
Aug 18, 2021, 9:18:08 AM8/18/21
to v8-...@googlegroups.com, Charlie Reis
On Wed, Aug 18, 2021 at 2:29 AM 'Łukasz Anforowicz' via v8-dev <v8-...@googlegroups.com> wrote:


On Tue, Aug 17, 2021 at 6:59 AM Toon Verwaest <verw...@chromium.org> wrote:
Thinking out loud: One idea could be to have a separate sandboxed compiler process in which we compile incoming JS code. That could reject the source if it doesn't compile; or compile it to a script that just throws with no additional info about the actual source.

That process could implement streaming compilation; so we don't block streaming until later, we don't double parse, we still have a sandbox (not in the network process). There might even be benefits for caching as a compromised renderer cannot look at the compilation artefacts until it receives them.

If we fully compile and create a code cache from the compilation result we don't need a new API on the V8 side, but do additional serialization/deserialization work. That should be faster than reparsing though. The upper limit of the cost would essentially be the cost of serializing / deserializing a code cache for each script.

This seems like an interesting idea.  I wonder if compilation (no evaluation / running of scripts) would be considered safe enough to handle in a single (not origin/site-bound/locked) process.

The parser/compiler aren't tiny, so it's not unlikely there's a bug. It's certainly much less easy to control such bugs than full-blown JS OOB access though. I could imagine a security bug replacing scripts in another site (assuming it's sandboxed so well that it can't do much else), which would be terrible; and it's unclear to me how easy that would be.
 

One thing that I don't fully understand (For both full-JS-parsing and partial/hackish-non-JS-detection approaches) is if the encoding (e.g. UTF8 vs UTF16-LE vs Win-1250) has to be known and communicated upfront to the parser/sniffer?  Or maybe the input to the decoder needs to be already in UTF8?  Or maybe something in //net or //network layers can already handle this aspect of the problem (e.g. ensuring UTF8 in URLLoader::DidRead)?

There's some encoding guessing happening before we streaming compile (https://source.chromium.org/chromium/chromium/src/+/main:third_party/blink/renderer/bindings/core/v8/script_streamer.cc;l=584;drc=f0b502c3c977f47c58b49506629b2dd8353e4c59;bpv=1;bpt=1) and some afterwards; and if we initially compiled with the wrong encoding we discard and redo iirc. Presumably compilation failed anyway if the encoding was wrong; but this presumably also doesn't happen too often.
 

Also - when trying to explore the partial/hackish-non-JS-detection idea, I wondered if the very first character in a script may only come from a relatively limited set of characters?  Let's assume that the sniffer can skip whitespace (space, tab, CR, LF, LS, PS) and html/xml comments (e.g. <!-- ... -->) - AFAICT the very next character has to be either:
  • The start of a reserved keyword like "if", "let", etc. (all lowercase ASCII)
  • The start of an identifier (any Unicode code point with the Unicode property “ID_Start”)
  • The start of a unary expression: + - ~ !
  • The start of a string literal, string template, or a regexp literal (or non-HTML comment): " ' ` /
  • The start of a numeric literal: 0-9
  • An opening paren, bracket or brace: ( [ {
  • Not quite sure if a dot or an equal sign can appear as the very first character: . =
This would reject PDFs (starts with %) and HTML/XML (starts with <), but still would accept ZIP files (first character is a 0x50 - capital P) and MSOffice files (first character is a 0xD0 which according to Unicode has ID_Start property set to true).  Rejecting ZIP and MSOffice files would require going beyond the first character - maybe rejecting control characters like 0x11 or 0x03 outside of comments (not sure if at this point the sniffer's heuristics are starting to get too complex).

That was my initial thought too for e.g., PDF. You'd be blacklisting files you don't want to leak vs whitelisting JS though, which isn't entirely ideal security-wise. It might be better than the alternative though; if we either end up spending slowing down the web (repeat parsing, interfere with streaming) or potentially have new security issues through a shared compiler process.
 

Marja Hölttä

unread,
Sep 1, 2021, 10:39:40 AM9/1/21
to v8-...@googlegroups.com, Charlie Reis
A random side note: it's also possible to make V8's recursive descent parser run out of stack using valid JS, e.g., let a = [[[[[..[ 0 ]]]]]..] or other similar constructs (deep enough). Meaning you prob don't want to call into the parser in a process where you don't want this to happen.

Re: encodings, when I worked on script streaming I noticed it's pretty common that scripts advertised as UTF-8 are not valid UTF-8 (e.g., have invalid chars inside comments), and Chrome is currently pretty lenient about those.




--


Google Germany GmbH

Erika-Mann-Straße 33

80636 München


Geschäftsführer: Paul Manicle, Halimah DeLaine Prado

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg


Diese E-Mail ist vertraulich. Falls sie diese fälschlicherweise erhalten haben sollten, leiten Sie diese bitte nicht an jemand anderes weiter, löschen Sie alle Kopien und Anhänge davon und lassen Sie mich bitte wissen, dass die E-Mail an die falsche Person gesendet wurde.

    

This e-mail is confidential. If you received this communication by mistake, please don't forward it to anyone else, please erase all copies and attachments, and please let me know that it has gone to the wrong person.

Marja Hölttä

unread,
Sep 1, 2021, 11:46:25 AM9/1/21
to v8-...@googlegroups.com, Charlie Reis
Wait, no, we do handle running out of stack in a robust way and the "does this parse" should just return false then (even though the code might be valid Js). Please ignore that part of my comment :)

Daniel Vogelheim

unread,
May 31, 2022, 8:45:12 AM5/31/22
to v8-dev
Hi all,

Apologies for reviving this thread, but this problem is coming up again. I think the answer of parsing in a separate process would work, but I'd really like to find a simpler solution. For all I can see, the underlying security requirements should be much less strict than the current ORB proposal implies. An approximation should do just fine. For example, for media formats we just look for a "magic number" (e.g. a 3-byte constant for JPEG files); so I don't think we need a full parse of the input.

Here is how I'd like to simplify this:
- Run only the JS scanner. (Including charset + comment processing.)
- Take the first N tokens. I suspect N=3 would be enough.
- Check the token list against a set of permissible token sequences.

Even for small N a complete list of permissible sequences might be rather large. It might be worth approximating it.
In either case, this method easily distinguishes valid JS from pretty much any of the requirements from Lukasz' earlier mail (except "while(1);", which needs N>=5). It does leave some ambiguity towards JSON, but IMHO that's tolerable.

Would this make sense from a V8 perspective?

Is it possible to generate a list of possible token sequences from the JS grammar, or would one have to do that manually? (For, say, N=3)

The question of standardization has also come up. Could TC39 maybe be convinced to adopt such a JavaScript sniffer, since it's fundamentally an operation on JS syntax? (That would hopefully prevent the sniffer and the actual syntax from getting out of sync as JS evolves.)

Any thoughts?

Daniel

Leszek Swirski

unread,
May 31, 2022, 12:00:03 PM5/31/22
to v8-...@googlegroups.com
I want to note one thing here, kind of a side observation really: while(1); is valid JS, it's just an infinite loop. Do we also want to guard against common patterns like this?

- Leszek

Łukasz Anforowicz

unread,
May 31, 2022, 1:34:21 PM5/31/22
to v8-...@googlegroups.com
On Tue, May 31, 2022 at 9:00 AM Leszek Swirski <les...@chromium.org> wrote:
I want to note one thing here, kind of a side observation really: while(1); is valid JS, it's just an infinite loop. Do we also want to guard against common patterns like this?

FWIW today CORB explicitly detects and blocks `while(1);` (the code here has some extra comments and details).  OTOH, 1) I am not sure if detecting `while(1);` is a hard requirements (maybe detecting JS-parser-breakers is sufficient), and 2) I am not sure if/how `while(1);`-related considerations impact the main points and questions from Daniel.

Leszek Swirski

unread,
Jun 1, 2022, 4:42:27 AM6/1/22
to v8-...@googlegroups.com
As I understand it, the intention here is that false-positives for "is JS" are acceptable, and that it's up to the victim site to avoid prefixes that might be JS, but aren't. With that, what's the benefit of a full JS parse over a list of known non-JS prefixes like the one we already have?

Łukasz Anforowicz

unread,
Jun 1, 2022, 11:17:32 AM6/1/22
to v8-...@googlegroups.com
On Wed, Jun 1, 2022 at 1:42 AM Leszek Swirski <les...@chromium.org> wrote:
As I understand it, the intention here is that false-positives for "is JS" are acceptable, and that it's up to the victim site to avoid prefixes that might be JS, but aren't. With that, what's the benefit of a full JS parse over a list of known non-JS prefixes like the one we already have?

Benefit of full JS parse over a list of known non-JS prefixes: Stricter is-it-JS checking = more non-JS things get blocked = improved security.  Still, there is a balance here - some heuristics (like the ones proposed by Daniel) are almost as secure as full JS parse (while being easier to implement and having less of a performance impact).

Leszek Swirski

unread,
Jun 1, 2022, 11:34:57 AM6/1/22
to v8-...@googlegroups.com
On Wed, Jun 1, 2022 at 5:17 PM 'Łukasz Anforowicz' via v8-dev <v8-...@googlegroups.com> wrote:
Benefit of full JS parse over a list of known non-JS prefixes: Stricter is-it-JS checking = more non-JS things get blocked = improved security.  Still, there is a balance here - some heuristics (like the ones proposed by Daniel) are almost as secure as full JS parse (while being easier to implement and having less of a performance impact).

Makes sense, I'm just asking to make sure that we strike the right balance between security improvements and complexity/performance issues; even a JS tokenizer without a full parser is quite a complexity investment (it needs e.g. a full regexp parser), plus the language grammar is sufficiently broad that I expect exhaustively enumerating all possible combinations of even just 3-5 tokens to be prohibitively large (setting aside maintainability in the face of ever-updating standards).

Do we have a measure of how much non-JS coverage the current heuristics give, on real-world examples of JSON files? Or perhaps, a measure of how many different prefixes there are that we could blocklist? Do we know at what point the improved security has diminishing returns?

- Leszek

Łukasz Anforowicz

unread,
Jun 1, 2022, 12:45:09 PM6/1/22
to v8-...@googlegroups.com
Examples of a response bodies that we would want to block, but that wouldn't get blocked without full JS parsing/verification (assume that the responses below are served as text/html or application/octet-stream):
  • PDF
  • ProtoBuf
  • Microsoft Word
  • CSV files

- Leszek

--
--
v8-dev mailing list
v8-...@googlegroups.com
http://groups.google.com/group/v8-dev
---
You received this message because you are subscribed to a topic in the Google Groups "v8-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/v8-dev/NGGCw9OjatI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to v8-dev+un...@googlegroups.com.

Leszek Swirski

unread,
Jun 2, 2022, 3:46:15 AM6/2/22
to v8-...@googlegroups.com
Can we not detect these via some magic number sniffing? I'm fundamentally concerned about an allowlist approach for JS over a blocklist approach for non-JS.

Note that CSV is sadly valid JS, so that won't be blocked at all.

You received this message because you are subscribed to the Google Groups "v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to v8-dev+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/CAA_NCUE%3DgtMdPPzFGy-gSuvV62VqesgRdkTkfvpOXNf9xHKpYQ%40mail.gmail.com.

Daniel Vogelheim

unread,
Jun 2, 2022, 11:21:06 AM6/2/22
to v8-dev
On Tuesday, May 31, 2022 at 6:00:03 PM UTC+2 les...@chromium.org wrote:
I want to note one thing here, kind of a side observation really: while(1); is valid JS, it's just an infinite loop. Do we also want to guard against common patterns like this?

No, I don't think this is a hard requirement. I'm not even sure how much of a common pattern it actually is.


On Wednesday, June 1, 2022 at 10:42:27 AM UTC+2 les...@chromium.org wrote:
> As I understand it, the intention here is that false-positives for "is JS" are acceptable, and that it's up to the victim site to avoid prefixes that might be JS, but aren't. With that, what's the benefit of a full JS parse over a list of known non-JS prefixes like the one we already have?

Admittedly, the whole ORB/CORB thing is a bit weird. What we really want sites to do is to properly label their resources with the correct mime types, because then the entire problem goes away. But because historically browsers don't (always) check mime types, we want some "backup" solution for sites that aren't cooperative. The given "parser breakers" are interesting because they're in use by some sites. (IMHO, "while (1);" is the worst example of them, because that is actually valid JS. But apparently it is being used.)




Daniel Vogelheim

unread,
Jun 2, 2022, 11:35:56 AM6/2/22
to v8-dev
On Thursday, June 2, 2022 at 9:46:15 AM UTC+2 les...@chromium.org wrote:
Can we not detect these via some magic number sniffing? I'm fundamentally concerned about an allowlist approach for JS over a blocklist approach for non-JS.

This is pretty much the heart of the issue: The entire thing of CORB to ORB transition is to go from "blocklist" to "allowlist", based on the observation that block lists ultimately never seem to work. In particular, we don't want to pass things by default, where anything we don't know automatically passes. That does lead us to an allowlist, in some form. Elsewhere, I summarized (my understanding of) the ORB security requirements as this: For "no-cors" requests, we want to have some positive evidence that the data we're receiving is in a format suitable for the request type.

Being able to drop unknown stuff by default is really the core benefit of ORB.

I do think we have quite a bit of leeway to decide what form of "positive evidence" we'll accept. The current draft specifies a full JS parse, which I think is way over the top. But I do think we need something that tells us with some probability whether a given byte sequence looks like JS or not. The only hard criteria is that actually valid JS should pass, because otherwise we'll break websites left and right. (To that end, "while (1);" was arguably a terrible example.) (Caveat: Those are my opinions. Other browsers might have stronger opinions.)


IMHO, checking for "parser breakers", the way CORB does, is a convenient temporary solution, because we already know it's web compatible.

IMHO, a full parse (in the network process, or triggered by the network process) is crazy, and I'd really like to have something more lightweight.

Which leads me to the proposal to only use the scanner to look for a few tokens. And ideally for TC39 to adopt some sort of SmellsLikeJavaScript abstract operation that other standards could point to.


Leszek Swirski

unread,
Jun 3, 2022, 4:54:55 AM6/3/22
to v8-...@googlegroups.com, Shu-yu Guo
Ok, if allowlisting vs blocklisting is the heart of the issue, I can accept that this is a design requirement.

So, re: parse vs. scan -- I'm not sure this is a sufficient simplification. In particular, if memory serves, our parse cost is roughly 50% scanner and 50% token interpretation + AST building, so you'll get at best a ~2x speedup over a full parse (or over a pre-parse? I don't remember the exact breakdown). Particularly there's a cost to identifying keywords vs identifiers, but we could probably drop that and ignore keywords. Parsing strings and regexp has some cost, but you could maybe make them cheaper with stronger approximations (race to closing quotes, that sort of thing). Then, I wouldn't check if the token combination is a definitely valid one, just whether the tokenizer failed at all + some simple token-based heuristics (like brace matching, simple patterns). Tokenizer failure would most likely catch almost all binary formats; non-binary formats are likely either too-JS compatible (like some raw JSON, a lot of YAML, and I think all CSV, is valid JS) and would need to still rely on a more blocklist approach with said token heuristics.

Getting a TC39 approved version of this... well, any spec word is hard. +Shu-yu Guo.

Daniel Vogelheim

unread,
Jun 3, 2022, 11:37:16 AM6/3/22
to v8-...@googlegroups.com, Shu-yu Guo
On Fri, Jun 3, 2022 at 10:54 AM Leszek Swirski <les...@chromium.org> wrote:
Ok, if allowlisting vs blocklisting is the heart of the issue, I can accept that this is a design requirement.

So, re: parse vs. scan -- I'm not sure this is a sufficient simplification. In particular, if memory serves, our parse cost is roughly 50% scanner and 50% token interpretation + AST building, so you'll get at best a ~2x speedup over a full parse (or over a pre-parse? I don't remember the exact breakdown). Particularly there's a cost to identifying keywords vs identifiers, but we could probably drop that and ignore keywords. Parsing strings and regexp has some cost, but you could maybe make them cheaper with stronger approximations (race to closing quotes, that sort of thing). Then, I wouldn't check if the token combination is a definitely valid one, just whether the tokenizer failed at all + some simple token-based heuristics (like brace matching, simple patterns). Tokenizer failure would most likely catch almost all binary formats; non-binary formats are likely either too-JS compatible (like some raw JSON, a lot of YAML, and I think all CSV, is valid JS) and would need to still rely on a more blocklist approach with said token heuristics.

Thank you. This is a very helpful response.

My main idea at simplification was to reduce the amount of data scanned. Like, just the first 3 tokens or so. We can either reduce cost by reducing the cost of the operation (parsing > pre-parsing > only scanning), or by reducing the input size (whole file > 1kB prefix > just a few bytes). Or both.

Scanning + brace matching sounds very enticing. That's very doable, and would indeed filter out pretty much any binary format, and nearly all "parser breakers".

It'd also be much better than parsing in terms of code complexity. The V8 scanner is one medium-size file + headers, and only loosely coupled to the rest of the engine. (Mainly the input stream and the AstValueFactory.) The parser is a good bit larger and tied to much more infrastructure (the whole AST).

I wonder what simple heuristics one could have. I think most operators can't be the first token. Or be followed by another operator. Or an identifier can't be followed by another identifier. Would be good to validate that, though. I think, a while ago, Nikos had a script to extract a cover grammar from the TC39 spec. Maybe that can be hacked up to extract simple, impossible sequences or sets of relevant token classes.


Getting a TC39 approved version of this... well, any spec word is hard. +Shu-yu Guo.

Very true, unfortunately. :)
 
Reply all
Reply to author
Forward
0 new messages