Early discussion for a proposal: A highly performant (to parse) binary format for HTML/CSS

603 views
Skip to first unread message

Apoorv Singal

unread,
Jul 11, 2024, 1:28:16 PM7/11/24
to Chromium-dev
Hey, I had this simple thought of a binary format for HTML/CSS that could be much more performant to parse in browsers, while keeping everything else untouched.

As of now the process of receiving and parsing HTML in the browser looks something like:
1. backend compresses raw html
2. backend sends compressed html over https
3. browser decompresses it to get raw html
4. browser parses raw html

Imagine a very simple binary format, where all standard HTML tags and their properties have an id assigned to them, and the syntax looks maybe like this:

// defining a tag <tag ...> ...</tag>
[32 bit, length of body (current index in file + this = end of current tag)][1 bit, 0 = standard html tag, 1 = custom tag][16 bit tag id | null terminated custom tag str][1 bit, 0 = standard prop, 1 = custom prop][prop id | null terminated custom prop][null terminated prop value]

and children are simply a prop too.

There's no loss of information between html to bhtml, a normal html file can be reconstructed from it exactly how it was, meaning more of less nothing in the browser would need to change other than the parsing step to add the new format.

bhtml would have a significantly smaller size than standard html and all, however, the thing that would actually make it pretty amazing would be reducing the document parsing time in browsers. The following flow (step 4 and 5 specifically, maybe 3 too) would be way more performant on the browsers' end:
1. backend converts raw html to bhtml
2. backend compresses bhtml
3. backend sends compressed bhtml over https
4. browser decompresses it to get bhtml
5. browser parses bhtml

The standard could look something like, whenever requesting a document, the browser could send an extra `text/bhtml` in the `Accept` header indicating the support for bhtml, and the server could react to it by sending to bhtml for that request.

First post here, would appreciate any feedback haha.
Thanks!

Reilly Grant

unread,
Jul 11, 2024, 1:32:45 PM7/11/24
to ad...@libkakashi.dev, Chromium-dev
This is essentially what we get from the Brotli compression algorithm, as the static dictionary has been designed to compress HTML/CSS well.
Reilly Grant | Software Engineer | rei...@chromium.org | Google Chrome


--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev
---
You received this message because you are subscribed to the Google Groups "Chromium-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chromium-dev...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/chromium-dev/284e1fcc-8eb2-43ae-9061-b0121cc39277n%40chromium.org.

Apoorv Singal

unread,
Jul 11, 2024, 7:07:02 PM7/11/24
to Chromium-dev, Reilly Grant, Chromium-dev, ad...@libkakashi.dev
Yea all LZ-based, or pretty much all in general, compression algorithms make dictionaries of repeated set of bytes, but that's converted back to it's original form and then parsed, right? So the browser still has a parser for standard html that works post decompression? Imagine, loosely speaking, the compressed version itself can be parsed directly, and is way more efficient to parse than standard html.

Peter Kasting

unread,
Jul 11, 2024, 8:11:57 PM7/11/24
to ad...@libkakashi.dev, Chromium-dev, Reilly Grant
On Thu, Jul 11, 2024 at 4:05 PM Apoorv Singal <ad...@libkakashi.dev> wrote:
that's converted back to it's original form and then parsed, right? So the browser still has a parser for standard html that works post decompression? Imagine, loosely speaking, the compressed version itself can be parsed directly, and is way more efficient to parse than standard html.

Citation needed re: "way more efficient". While that might seem intuitively true, it is not clear to me that it must be so in practice.

PK

Julian Pastarmov

unread,
Jul 12, 2024, 5:11:38 AM7/12/24
to Chromium-dev, Peter Kasting, Chromium-dev, Reilly Grant, ad...@libkakashi.dev
That's what most BASIC implementations did back in the 8-bit era. However the main motivation back then was not so much to improve the  performance of interpretation of  the language but rather to store the code more efficiently in times when 64kb was a luxury. In a similar sense compressing for wire transmission still makes sense these days. I doubt however that for a modern RAM/CPU having to compare 128 or 256 bits which is roughly the length of most tokens makes much difference to comparing 32 bits. All the rest of dealing with HTML/CSS like layout computation, event handling is going to eat most of the time and the benefit of small tokens will be dwarfed completely while the downside of lost readability of the raw data will be real. 

Apoorv Singal

unread,
Jul 12, 2024, 9:31:29 PM7/12/24
to Chromium-dev, Julian Pastarmov, Peter Kasting, Chromium-dev, Reilly Grant, ad...@libkakashi.dev
> Citation needed re: "way more efficient". While that might seem intuitively true, it is not clear to me that it must be so in practice.
True, I'm not sure what the actual number would look like, but imagine these two pseudo code snippets:
```js
let tag = '';

for (let char = stream.readChar(); char != WHITESPACE; char = stream.readChar()) {
  tag += char;
}
// parse props
// ...
document.createElement(tag)
```

```js
let tag = '';
let isCustomTag = stream.readBool(); 

if (!isCustomTag) { // easily more than 95% hit rate (humble guess)
  tag = standardHTMLTags[stream.readUint32()];
} else {
  // similar for loop as earlier
}
```
Optimisations along these lines but in tens of more situations like this, for both HTML and CSS. Also, a lot nuances, like how every text file is a valid html file and the parser just has to figure out the best structure while never throwing en error, would not need to be handled, or be handled in a website's build process.
I'd have to write an actual implementation for proper numbers though, maybe over some weekend if after this discussion it actually seems like a optimisation worth the effort of implementation/adoption.


> I doubt however that for a modern RAM/CPU having to compare 128 or 256 bits which is roughly the length of most tokens makes much difference to comparing 32 bits.
Comparing more bits isn't the issue, the issue is those strings of bits being of variable length. If every HTML tag was 256 bits, going from 256 to 32 wouldn't have made much difference, but having to find the end of a string has a lot of overhead.


> All the rest of dealing with HTML/CSS like layout computation, event handling is going to eat most of the time and the benefit of small tokens will be dwarfed completely while the downside of lost readability of the raw data will be real. 
Yea tbh that could be the case. There won't be any loss of readability other than custom indentation details and comments, but the performance improvement would probably only be around 1-2% of the loading time. I just ran a quick profiler on this page for first 2 seconds of loading and parsing HTML took 71.9ms. I'm guessing it could go down to, let's say, around 20ms. That's more or less all what it would do.

guest271314

unread,
Jul 20, 2024, 10:14:15 AM7/20/24
to Chromium-dev, Apoorv Singal, Julian Pastarmov, Peter Kasting, Chromium-dev, Reilly Grant
Reply all
Reply to author
Forward
0 new messages