CDP - how to determine that a page is fully loaded before requesting Chrome to render PDF?

1,271 views
Skip to first unread message

ziggy

unread,
Mar 2, 2021, 8:51:34 PM3/2/21
to headless-dev, chromi...@cromium.org

I am re-posting to the headless-dev group. I posted to the Chromium-dev group but it didn't show up.


This post is  related to other posts below.

https://groups.google.com/a/chromium.org/g/chromium-dev/c/LXZQz6UpVZI

https://groups.google.com/a/chromium.org/g/headless-dev/c/KW0UwYcrEb0

https://groups.google.com/a/chromium.org/g/headless-dev/c/rxoUQ4S9jpA

My goal is to have ability to customize header and footer when rendering to PDF. Right now I am experimenting with  c++ solution to connect to remote debugging port and implement rendering to PDF.

I am looking to port to c++ the below function implemented by https://github.com/Szpadel/chrome-headless-render-pdf tool that uses remote-debugging-interface.

async renderPdf(url, options) {
        const client = await CDP({host: this.host, port: this.port});
        this.log(`Opening ${url}`);
        const {Page, Emulation, LayerTree} = client;
        await Page.enable();
        await LayerTree.enable();

        const loaded = this.cbToPromise(Page.loadEventFired);
        const jsDone = this.cbToPromise(Emulation.virtualTimeBudgetExpired);

        await Page.navigate({url});
        await Emulation.setVirtualTimePolicy({policy: 'pauseIfNetworkFetchesPending', budget: 5000});

        await this.profileScope('Wait for load', async () => {
            await loaded;
        });

        await this.profileScope('Wait for js execution', async () => {
            await jsDone;
        });

        await this.profileScope('Wait for animations', async () => {
            await new Promise((resolve) => {
                setTimeout(resolve, 5000); // max waiting time
                let timeout = setTimeout(resolve, 100);
                LayerTree.layerPainted(() => {
                    clearTimeout(timeout);
                    timeout = setTimeout(resolve, 100);
                });
            });
        });

        const pdf = await Page.printToPDF(options);
        const buff = Buffer.from(pdf.data, 'base64');
        client.close();
        return buff;
    }

In order to render page to PDF, solution must navigate to page, determine whether the page content was loaded and then request Chrome to render page to PDF.

I am looking for the official state machine diagrams, interaction diagrams that I can follow to reliably determine the "page fully loaded" event. Can anybody point me where I can find such information?

In the meantime I used https://github.com/Szpadel/chrome-headless-render-pdf tool to capture traffic  between client and Chrome using  wireshark and https://github.com/wendigo/chrome-protocol-proxy . See the attached files with message exchanges. You can "cat connection*" files to git bash or Powershell windows to see the file content in colors, see the example Capture.png file.

The above function seems to monitor following 3 events in order to determine the page loaded complete event/status.

        const loaded = this.cbToPromise(Page.loadEventFired);
        const jsDone = this.cbToPromise(Emulation.virtualTimeBudgetExpired);
        LayerTree.layerPainted(() => { ....

The Page.loadEventFired event is fairly obvious.

Emulation.virtualTimeBudgetExpired - don't fully understand the context.

LayerTree.layerPainted - it seems to be an internal event based on LayerTree.layerTreeDidChange and LayerTree.layerPainted events from Chrome Browser.

I ma looking for comments on the above and/or directions on how to reliably determine "page fully loaded" event. Chrome doesn't seem to generate such event and it appears it is the client responsibility.

connection-file.log
connection-bbcnews.log
connection-google.log
wireshark.txt
Capture.PNG

Isaac Dawson

unread,
Mar 2, 2021, 9:00:36 PM3/2/21
to ziggy, headless-dev, chromi...@cromium.org
Ah yes, the mythical 'page is done loading event'. This does not exist with today's SPAs I'm afraid. Best solution I've come up with is:
- Listen to all network traffic, 
- Wait for Page.loadEventFired, 
- Wait for the # of open request counts to be zero then...
- Start listening to DOM events (add/remove children/attributes) and set a limit on how long you want to wait for the page to 'stabilize'
- Pray it's actually done loading and do your instrumentation/screen shots.

-Isaac


--
You received this message because you are subscribed to the Google Groups "headless-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to headless-dev...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/headless-dev/78e3c66a-629a-4cdb-b850-85e4a7f0655en%40chromium.org.

ziggy

unread,
Mar 2, 2021, 9:08:59 PM3/2/21
to headless-dev, isaac....@gmail.com, headless-dev, chromi...@cromium.org, ziggy
Thanks Isaac. Sounds like perfect job for Browser to insulate clients from such complexities and breaking changes.

Can you clarify " Wait for the # of open request counts to be zero then...". I am planning to request single url at a time, at least for now.

-Zbigniew

Isaac Dawson

unread,
Mar 2, 2021, 9:12:56 PM3/2/21
to ziggy, headless-dev, chromi...@cromium.org
No problem! So, *you* might be waiting for a single url, but all those third party resources being loaded by the page are coming in asynchronously. The DOMContentLoaded event may fire, but it could still be loading a 3mb JS file that actually provides the user with a usable interface. Unless you are talking about loading a single static HTML file?
-Isaac

ziggy

unread,
Mar 2, 2021, 9:25:33 PM3/2/21
to headless-dev, isaac....@gmail.com, headless-dev, chromi...@cromium.org, ziggy
Yes, this capability will be used by Mbox Mail Viewer tool to render mostly static HTML files. Regular email from users is fairly static text mails  but mails from business or organizations are not.
-zbigniew

ziggy

unread,
Mar 2, 2021, 9:36:40 PM3/2/21
to headless-dev, ziggy, isaac....@gmail.com, headless-dev, chromi...@cromium.org
Currently Mbox Mail Viewer execs Chrome or MSEdge to render mails to PDF using command line options --headless, --print-to-pdf and --print-to-pdf-no-header.  It is up to the Browser to determine when page loading is done.

I developed patch to enhance -print-to-pdf to allow to customize header and footer but the patch is going nowhere. I also enhanced https://github.com/Szpadel/chrome-headless-render-pdf to support customization of the header and footer but that requires users to install nodejs which for most users is a showstopper.  I was playing lately with --remote-debugging-pipe but it doesn't seem to work. So my next best option that should work with Google Chrome is to use --remote-debugging-port -:).

ziggy

unread,
Mar 3, 2021, 7:24:03 PM3/3/21
to headless-dev, ziggy, isaac....@gmail.com, headless-dev, chromi...@cromium.org
Hi Isaac,

Ok, sounds like I am not the first facing the mythical 'page is done loading event'. I hope there is a working comprehensive example how a client should implement such functionality otherwise every user  has to keep re-implementing the same feature.  I understand that it may not be simple to detect the page loading done event for pages with dynamic content and therefore the best effort solution is needed and, in my humble opinion,  the solution should be documented once by the domain experts and re-used by many.

Alex Clarke

unread,
Mar 8, 2021, 5:26:45 AM3/8/21
to ziggy, headless-dev, isaac....@gmail.com, chromi...@cromium.org
Page.loadEventFired isn't sufficient. Virtual time is supposed to be the solution to this, the idea is virtual time does not advance while there's outstanding network requests.  I wouldn't be surprised if there's corner cases where it doesn't work.  I'd suggest trying a larger virtual time budget if the default isn't enough.

ziggy

unread,
Mar 8, 2021, 1:27:35 PM3/8/21
to headless-dev, Alex Clarke, headless-dev, isaac....@gmail.com, chromi...@cromium.org, ziggy
Thanks, that helps. I am new to Chromium, JS and CDP so it is challenging to figured out the best approach. I understand now and agree with you that Emulation.virtualTimeBudgetExpired is the sound/best effort  solution. The event is marked as Experimental, hope it will not be deprecated.

The  last step executed by https://github.com/Szpadel/chrome-headless-render-pdf tool is to monitor LayerTree.layerPainted events and delay the printToPDF request until there is no such events for at least 100 milliseconds, at least that is my understanding of the JS CDP code below.

I captured and analyzed the exchange of message between client and browser for number of URLs. After the Emulation.virtualTimeBudgetExpired  event, I see mostly LayerTree.layerTreeDidChange events and a  few if any LayerTree.layerPainted events. Therefore, typically the the printToPDF request is delayed by exactly 100 milliseconds.

        await this.profileScope('Wait for animations', async () => {
            await new Promise((resolve) => {
                setTimeout(resolve, 5000); // max waiting time
                let timeout = setTimeout(resolve, 100);
                LayerTree.layerPainted(() => {
                    clearTimeout(timeout);
                    timeout = setTimeout(resolve, 100);
                });
            });
        });

    async profileScope(msg, cb) {
        const start = process.hrtime();
        await cb();
        this.log(msg, `took ${Math.round(this.getPerfTime(start))}ms`);
    }
Reply all
Reply to author
Forward
0 new messages