--
--
v8-dev mailing list
v8-...@googlegroups.com
http://groups.google.com/group/v8-dev
---
You received this message because you are subscribed to the Google Groups "v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to v8-dev+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/d4dd45ff-3b73-4d4b-883d-d2e8ba4123e7n%40googlegroups.com.
| Hannes Payer | | V8 | | Google Germany GmbH | | Erika-Mann Str. 33, 80636 München |
To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/CAKEgpyHrQ8tzyh%3D3RF58ww9bXbSZ%2BFO9ukGodgJcdb_tHom%3DXA%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/CADizRgbND4szVdtmoUqTSwvr%3DduwB9SANRN8tAysxa9kONsHLA%40mail.gmail.com.
ORB-with-html/json/xml-sniffing shows that some security benefits of ORB may be realized without full-fidelity JS sniffing/parsing. Let’s explore various considerations that may lead to discovery of other alternative approaches to sniffing.
Pri1 requirement: Avoid breaking existing websites.
Requirement: Never block responses with JS body
Requirement: never block HTML/JS polyglots
Requirement: skip elements that are okay both in HTML and JS:
<!-- … --> comments
Whitespace
Non-requirement (?): Never block responses with *future* JS bodies
We don’t want to break _existing_ websites. But maybe we can force _future_ websites to label their JS as JavaScript MIME type. Therefore, we don’t have a requirement to robustly handle _future_ JS versions/specs (and/or _future_ image formats like JXL).
Pri2 requirement: Block as many responses as possible
Requirement: Block as many non-JS responses as possible (after earlier ORB steps rule out that we are dealing with an image, audio, video, or stylesheet):
Requirement: Block responses starting with: %PDF-
Requirement: Block zip files - files starting with 50 4B 03 04, or 50 4B 05 06, or 50 4B 07 08
Requirement: Block MS Office files - files starting with D0 CF 11 E0 A1 B1 1A E1 (source: MS-XLS spec + Microsoft Compound File Binary File Format spec)
Requirement: Block CSV files
Requirement: Block XML files - files starting with: <?xml
Requirement: Block ProtoBuf (binary and text encoding)
Requirement: Block responses beginning with JSON parser breaker - examples: )]}' {}&& while(1);
Requirement: Block HTML files: whitespace + HTML comments followed by a HTML element
Pri2 requirement: Do not regress performance / latency / etc
Requirement: Make the final decision based on the 1st 1024 bytes of the response body.
Assumption: UTF-8 (maybe it is okay if UTF-16 encoding of Javascript is not recognized as Javascript?)
That's also the case for invalid LHS on assignment, e.g. `lhs() = 5`, which should be an early error but for web compat we make it a runtime error.Overall, this could be something we expose, but:
- There's a couple of additional complications around JS standards incompatible errors (like the two aforementioned ones), some of which are intentional
- There's the rule-of-2 violation
- This breaks streaming compilation (since the full body of the resource has to be available for parsing before it is sent to the renderer)
- Parsing JS ain't cheap, and doing so as part of the network process, presumably before sending anything to the renderer, is quite a cost
- We don't have a way of distinguishing valid JS from valid JSON during parse, so we'd effectively need to parse twice
When the sniffer sees:[ 123, 456, “long string taking X bytes”,then it should block the response when the Content-Type is a JSON MIME type, but otherwise it should allow the response (trading off security for backcompatibility).When the sniffer sees:
{ “foo”:
then it should block the response, because such a prefix never results in valid Javascript. (Although the JSON object syntax is exactly Javascript's object-initializer syntax, a Javascript object-initializer expression is not valid as a standalone Javascript statement.)
- Standards can change, and syntax can change with it, so whether or not something is blocked will be version dependent
- The DX of blocking a script just because of a parse error may be suboptimal
On Thu, Aug 12, 2021 at 9:37 AM 'Mathias Bynens' via v8-dev <v8-...@googlegroups.com> wrote:Another complication is that V8 currently doesn’t throw early (“parse”) errors for regular expression literals (issue 896). This would have to be resolved before we can accurately validate whether a given input is valid JS or not.On Thu, Aug 12, 2021 at 9:31 AM 'Hannes Payer' via v8-dev <v8-...@googlegroups.com> wrote:Hi Lukasz,To understand your question correctly: You want an API which returns true if the JavaScript input is valid, right?
I think this surgery should be possible but I am deferring to the parser owners. @Leszek Swirski @Toon Verwaest WDYT? Maybe that's even a nice testing mode for JS language features.The parser is quite complicated which is a problem from a security perspective. That's a Rule-of-2 violation.
You received this message because you are subscribed to a topic in the Google Groups "v8-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/v8-dev/NGGCw9OjatI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to v8-dev+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/CAGRskv9jhWgPAqjiTvWuy0JCyLAgdYS_9PKgg-5bAqpuKyp81Q%40mail.gmail.com.
ORB-with-html/json/xml-sniffing shows that some security benefits of ORB may be realized without full-fidelity JS sniffing/parsing.
(Although the JSON object syntax is exactly Javascript's object-initializer syntax, a Javascript object-initializer expression is not valid as a standalone Javascript statement.)
When the sniffer sees:
[ 123, 456, “long string taking X bytes”,
then it should block the response when the Content-Type is a JSON MIME type
ORB-with-html/json/xml-sniffing shows that some security benefits of ORB may be realized without full-fidelity JS sniffing/parsing.
You may call it a security benefit to block "obvious" parser breakers like )]}', but in general, any "when in doubt, don't block it" strategy won't be much of an obstacle to intentional attacks. For instance, once Mr. Bad Guy has learned that the sniffer only looks at the first 1024 characters, they can send a response whose first 1024 characters lead to a "well, it might be valid JS" judgement (such as a JS comment, or long string, or whatever). OTOH any "when in doubt, block it" strategy runs the risk of breaking existing websites in those doubtful cases.
(Although the JSON object syntax is exactly Javascript's object-initializer syntax, a Javascript object-initializer expression is not valid as a standalone Javascript statement.)There is (at least) one subtlety here: JS is more permissive than the official JSON spec. The latter requires quotes around property names, the former doesn't. I.e. {"foo": is indeed never valid JS, but {foo: is (the brace opens a code block, and foo is a label). Also, the colon is essential for rejecting the former snippet, because {"foo"; is valid JS (code block plus ignored string á la "use strict";), so this is a concrete example where the 1024-char prefix issue is relevant.When the sniffer sees:
[ 123, 456, “long string taking X bytes”,
then it should block the response when the Content-Type is a JSON MIME typeI don't follow. When the Content-Type is JSON, and the actual contents are valid JSON, why should that be blocked?
--
--
v8-dev mailing list
v8-...@googlegroups.com
http://groups.google.com/group/v8-dev
---
You received this message because you are subscribed to a topic in the Google Groups "v8-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/v8-dev/NGGCw9OjatI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to v8-dev+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/CAKSzg3TNvd1jd3yH8xyD767ZhbCqhEZJMFmm7nQ%2BtcQcXfjt_g%40mail.gmail.com.
On Thu, Aug 12, 2021 at 3:11 PM Jakob Kummerow <jkum...@chromium.org> wrote:ORB-with-html/json/xml-sniffing shows that some security benefits of ORB may be realized without full-fidelity JS sniffing/parsing.
You may call it a security benefit to block "obvious" parser breakers like )]}', but in general, any "when in doubt, don't block it" strategy won't be much of an obstacle to intentional attacks. For instance, once Mr. Bad Guy has learned that the sniffer only looks at the first 1024 characters, they can send a response whose first 1024 characters lead to a "well, it might be valid JS" judgement (such as a JS comment, or long string, or whatever). OTOH any "when in doubt, block it" strategy runs the risk of breaking existing websites in those doubtful cases.In CORB threat model the attacker does *not* control the responses - CORB tries to prevent https://attacker.com (with either Spectre or a compromised renderer) from being able to read no-cors responses from https://victim.com.(Although the JSON object syntax is exactly Javascript's object-initializer syntax, a Javascript object-initializer expression is not valid as a standalone Javascript statement.)There is (at least) one subtlety here: JS is more permissive than the official JSON spec. The latter requires quotes around property names, the former doesn't. I.e. {"foo": is indeed never valid JS, but {foo: is (the brace opens a code block, and foo is a label). Also, the colon is essential for rejecting the former snippet, because {"foo"; is valid JS (code block plus ignored string á la "use strict";), so this is a concrete example where the 1024-char prefix issue is relevant.When the sniffer sees:
[ 123, 456, “long string taking X bytes”,
then it should block the response when the Content-Type is a JSON MIME typeI don't follow. When the Content-Type is JSON, and the actual contents are valid JSON, why should that be blocked?Correct. There is no way to read cross-origin JSON via a "no-cors" fetch. The only way to read cross-origin JSON is via CORS-mediated fetch (where the victim has to opt-in by responding with "Access-Control-Allow-Origin: ...").
--
--
v8-dev mailing list
v8-...@googlegroups.com
http://groups.google.com/group/v8-dev
---
You received this message because you are subscribed to a topic in the Google Groups "v8-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/v8-dev/NGGCw9OjatI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to v8-dev+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/CAKSzg3TNvd1jd3yH8xyD767ZhbCqhEZJMFmm7nQ%2BtcQcXfjt_g%40mail.gmail.com.
--Thanks,Lukasz
You received this message because you are subscribed to the Google Groups "v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to v8-dev+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/CAA_NCUHWD5G2G9aHe%3DnM6k-hSZY2ufqx7GwEhmKYSfPN9b%3D9WA%40mail.gmail.com.
Thinking out loud: One idea could be to have a separate sandboxed compiler process in which we compile incoming JS code. That could reject the source if it doesn't compile; or compile it to a script that just throws with no additional info about the actual source.That process could implement streaming compilation; so we don't block streaming until later, we don't double parse, we still have a sandbox (not in the network process). There might even be benefits for caching as a compromised renderer cannot look at the compilation artefacts until it receives them.If we fully compile and create a code cache from the compilation result we don't need a new API on the V8 side, but do additional serialization/deserialization work. That should be faster than reparsing though. The upper limit of the cost would essentially be the cost of serializing / deserializing a code cache for each script.
To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/CANS-YRqhC5Z_XeNuN0-4VNMgOV-bJ6LHd1e%3Daw%2Bn82pjxWJx1Q%40mail.gmail.com.
On Tue, Aug 17, 2021 at 6:59 AM Toon Verwaest <verw...@chromium.org> wrote:Thinking out loud: One idea could be to have a separate sandboxed compiler process in which we compile incoming JS code. That could reject the source if it doesn't compile; or compile it to a script that just throws with no additional info about the actual source.That process could implement streaming compilation; so we don't block streaming until later, we don't double parse, we still have a sandbox (not in the network process). There might even be benefits for caching as a compromised renderer cannot look at the compilation artefacts until it receives them.If we fully compile and create a code cache from the compilation result we don't need a new API on the V8 side, but do additional serialization/deserialization work. That should be faster than reparsing though. The upper limit of the cost would essentially be the cost of serializing / deserializing a code cache for each script.This seems like an interesting idea. I wonder if compilation (no evaluation / running of scripts) would be considered safe enough to handle in a single (not origin/site-bound/locked) process.
One thing that I don't fully understand (For both full-JS-parsing and partial/hackish-non-JS-detection approaches) is if the encoding (e.g. UTF8 vs UTF16-LE vs Win-1250) has to be known and communicated upfront to the parser/sniffer? Or maybe the input to the decoder needs to be already in UTF8? Or maybe something in //net or //network layers can already handle this aspect of the problem (e.g. ensuring UTF8 in URLLoader::DidRead)?
Also - when trying to explore the partial/hackish-non-JS-detection idea, I wondered if the very first character in a script may only come from a relatively limited set of characters? Let's assume that the sniffer can skip whitespace (space, tab, CR, LF, LS, PS) and html/xml comments (e.g. <!-- ... -->) - AFAICT the very next character has to be either:
- The start of a reserved keyword like "if", "let", etc. (all lowercase ASCII)
- The start of an identifier (any Unicode code point with the Unicode property “ID_Start”)
- The start of a unary expression: + - ~ !
- The start of a string literal, string template, or a regexp literal (or non-HTML comment): " ' ` /
- The start of a numeric literal: 0-9
- An opening paren, bracket or brace: ( [ {
- Not quite sure if a dot or an equal sign can appear as the very first character: . =
This would reject PDFs (starts with %) and HTML/XML (starts with <), but still would accept ZIP files (first character is a 0x50 - capital P) and MSOffice files (first character is a 0xD0 which according to Unicode has ID_Start property set to true). Rejecting ZIP and MSOffice files would require going beyond the first character - maybe rejecting control characters like 0x11 or 0x03 outside of comments (not sure if at this point the sniffer's heuristics are starting to get too complex).
To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/CAA_NCUHjjiB9kMbyk%2Bn1ZMEda%2B8Oehr6ukU1VkK0vt9pcW%2B%3DuQ%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/CANS-YRqxEZHNcHV%2ByHZLBfoNOCbzQRxjXkfaeo2VCQgvUG9zKg%40mail.gmail.com.
Google Germany GmbH
Erika-Mann-Straße 33
80636 München
Geschäftsführer: Paul Manicle, Halimah DeLaine Prado
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Diese E-Mail ist vertraulich. Falls sie diese fälschlicherweise erhalten haben sollten, leiten Sie diese bitte nicht an jemand anderes weiter, löschen Sie alle Kopien und Anhänge davon und lassen Sie mich bitte wissen, dass die E-Mail an die falsche Person gesendet wurde.
This e-mail is confidential. If you received this communication by mistake, please don't forward it to anyone else, please erase all copies and attachments, and please let me know that it has gone to the wrong person.
To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/ceb7ce0a-dac1-4634-810b-b35b5b97e1f0n%40googlegroups.com.
I want to note one thing here, kind of a side observation really: while(1); is valid JS, it's just an infinite loop. Do we also want to guard against common patterns like this?
To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/CAGRskv9ODo7Hco1M8Ac79KP0R7Zauzo7-QVtZ2-TRYM71881cQ%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/CAA_NCUEaAoxoxeB5hVQ8Kiw2%3DLCAqcz1d5ddgqM3O1dL2pP4JA%40mail.gmail.com.
As I understand it, the intention here is that false-positives for "is JS" are acceptable, and that it's up to the victim site to avoid prefixes that might be JS, but aren't. With that, what's the benefit of a full JS parse over a list of known non-JS prefixes like the one we already have?
To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/CAGRskv-CqQP%2B8ZCkU8oBAek34eR506nHBgoY0ioLOkzWbg-i2A%40mail.gmail.com.
Benefit of full JS parse over a list of known non-JS prefixes: Stricter is-it-JS checking = more non-JS things get blocked = improved security. Still, there is a balance here - some heuristics (like the ones proposed by Daniel) are almost as secure as full JS parse (while being easier to implement and having less of a performance impact).
- Leszek
--
--
v8-dev mailing list
v8-...@googlegroups.com
http://groups.google.com/group/v8-dev
---
You received this message because you are subscribed to a topic in the Google Groups "v8-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/v8-dev/NGGCw9OjatI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to v8-dev+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/CAGRskv9UUNJ9sjW0FvuHyCN90j%3DfbafSOgGVBG19qRe19_%2BO5w%40mail.gmail.com.
You received this message because you are subscribed to the Google Groups "v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to v8-dev+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/CAA_NCUE%3DgtMdPPzFGy-gSuvV62VqesgRdkTkfvpOXNf9xHKpYQ%40mail.gmail.com.
I want to note one thing here, kind of a side observation really: while(1); is valid JS, it's just an infinite loop. Do we also want to guard against common patterns like this?
Can we not detect these via some magic number sniffing? I'm fundamentally concerned about an allowlist approach for JS over a blocklist approach for non-JS.
This is pretty much the heart of the issue: The entire thing of CORB to ORB transition is to go from "blocklist" to "allowlist", based on the observation that block lists ultimately never seem to work. In particular, we don't want to pass things by default, where anything we don't know automatically passes. That does lead us to an allowlist, in some form. Elsewhere, I summarized (my understanding of) the ORB security requirements as this: For "no-cors" requests, we want to have some positive evidence that the data we're receiving is in a format suitable for the request type.
Being able to drop unknown stuff by default is really the core benefit of ORB.
I do think we have quite a bit of leeway to decide what form of "positive evidence" we'll accept. The current draft specifies a full JS parse, which I think is way over the top. But I do think we need something that tells us with some probability whether a given byte sequence looks like JS or not. The only hard criteria is that actually valid JS should pass, because otherwise we'll break websites left and right. (To that end, "while (1);" was arguably a terrible example.) (Caveat: Those are my opinions. Other browsers might have stronger opinions.)
IMHO, checking for "parser breakers", the way CORB does, is a convenient temporary solution, because we already know it's web compatible.
IMHO, a full parse (in the network process, or triggered by the network process) is crazy, and I'd really like to have something more lightweight.
Which leads me to the proposal to only use the scanner to look for a few tokens. And ideally for TC39 to adopt some sort of SmellsLikeJavaScript abstract operation that other standards could point to.
To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/3ab87558-c9ea-484c-b42a-459380e8ad25n%40googlegroups.com.
Ok, if allowlisting vs blocklisting is the heart of the issue, I can accept that this is a design requirement.So, re: parse vs. scan -- I'm not sure this is a sufficient simplification. In particular, if memory serves, our parse cost is roughly 50% scanner and 50% token interpretation + AST building, so you'll get at best a ~2x speedup over a full parse (or over a pre-parse? I don't remember the exact breakdown). Particularly there's a cost to identifying keywords vs identifiers, but we could probably drop that and ignore keywords. Parsing strings and regexp has some cost, but you could maybe make them cheaper with stronger approximations (race to closing quotes, that sort of thing). Then, I wouldn't check if the token combination is a definitely valid one, just whether the tokenizer failed at all + some simple token-based heuristics (like brace matching, simple patterns). Tokenizer failure would most likely catch almost all binary formats; non-binary formats are likely either too-JS compatible (like some raw JSON, a lot of YAML, and I think all CSV, is valid JS) and would need to still rely on a more blocklist approach with said token heuristics.
To view this discussion on the web visit https://groups.google.com/d/msgid/v8-dev/CAGRskv-koJeiWCti%2B8DgRcDAMMnRoUDN_WtY_VL8diSdxLrM6Q%40mail.gmail.com.