Printing to PDF in headless mode can take very long time.

2,092 views
Skip to first unread message

ziggy

unread,
May 11, 2023, 9:03:03 PM5/11/23
to headless-dev
Updated my initial attempt to post after upgrading Chrome Browser to the latest version.

I am the developer of the free MBox Mail Viewer. MBox Viewer relies on Chrome to print/export mails to PDF. MBox Viewer creates html file from the mail content and schedules Chrome to convert the mail in Html format to PDF by leveraging the print-to-pdf headless option.

Printing of some mails in the old headless mode takes an extremely long time, for example 10 minutes. This appears to be  due to invalid links and possible due to links asking for user input.

I just learned about the new headless mode in Chrome and this new headless implementation seems to help to address extreme cases such as the attached html file.  Instead of 10 minutes,  Chrome running in new headless mode completes navigation and generates PDF in 3-6 seconds.  That is a great improvement. However, in the old and new headless mode, it still takes a fairly long time (20-40 seconds) to print many business type mails to PDF. When I open the same html documents directly  in Chrome , it appears the key data is loaded within the second or two. That suggests that print time could be cut significantly.

I tried --timeout and --virtual-time-budget options to force cancellation of  all navigation and print the document but it didn't work for me. The --timeout option can extend the print time but it doesn't seem to stop navigation to reduce the print time. The --virtual-time-budget doesn't seem to make much impact on the print time.

My understanding of --timeout and --virtual-time-budget options is as follows:

--timeout -  Hard limit. Issues a stop after the specified number of milliseconds. This cancels all navigation and causes the DOMContentLoaded event to fire.  Print time should be exactly as specified by the  --timeout in milliseconds. It appears that my understanding is not correct. The --timeout seems to be ignored when document loading is not completed.

--virtual-time-budget- Soft limit. If the document is fully loaded before the virtual-time-budget expires, the document will be printed immediately.

The html file can be open in Chrome and the user can print to PDF before navigation is completed.  I was not able to do the same in headless mode.

Looking for advice on how to reduce the print to pdf time due to expired  ??? links in the headless mode. When links are valid, the print time seems to be reasonable up to a few seconds.

I attached the batch file and example html file exhibiting the issue. Below are results when printing in the old headless mode using various command line arguments.

Also, I have a question regarding the best solution to detect support for new headless mode programmatically.

Do I need to read the HKEY_CURRENT_USER\SOFTWARE\Chromium\BLBeacon entry in the registry to examine the Chrome version ?

Thank you,

P.S   Plan to attach files once my post is accepted. My initial post with the attachments was deleted immediately without offering any reason. It can be related to the attachments.

My Chrome version details:

C:\Program Files (x86)\Google\Chrome\Application\113.0.5672.64

Chrome is up to date
Version 113.0.5672.64 (Official Build) (64-bit)

HKEY_CURRENT_USER\SOFTWARE\Chromium\BLBeacon  113.0.5672.64


###############################
Headless print to PDF

startTime=16:42:05.72

C:\Temp\PrintToPDF>"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" --headless --disable-gpu --print-to-pdf-no-header --print-to-pdf="C:\temp\PrintToPDF\PTT89594.htm.pdf" "C:\temp\PrintToPDF\PTT89594.htm"

endTime=16:52:39.19

Total print-to-pdf time 634 seconds

####################################
Headless print to PDF

startTime=16:21:03.41

C:\Temp\PrintToPDF>"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" --headless --timeout=10000 --disable-gpu --print-to-pdf-no-header --print-to-pdf="C:\temp\PrintToPDF\PTT89594.htm.pdf" "C:\temp\PrintToPDF\PTT89594.htm"

endTime=16:31:44.61

"Total print-to-pdf time 641 seconds"

#######################################
Headless print to PDF

startTime=17:05:06.13

C:\Temp\PrintToPDF>"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" --headless --virtual-time-budget=10000 --disable-gpu --print-to-pdf-no-header --print-to-pdf="C:\temp\PrintToPDF\PTT89594.htm.pdf" "C:\temp\PrintToPDF\PTT89594.htm"

endTime=17:15:39.62

Total print-to-pdf time 633 seconds

##########################################





Andrey Kosyakov

unread,
May 11, 2023, 9:29:43 PM5/11/23
to ziggy, headless-dev
Hi Zbigniew,

My understanding is that the slowdown in the scenario you describe likely comes from the page loading phase rather than from actual printing. For genetic / large scale PDF printing in production, we would recommend using Puppeteer or other browser automation tools to drive chrome headless loading and printing the page, rather than relying on --print-to-pdf built-in command -- this would give you much better flexibility in how to load the page, or when and how to apply the timeouts, and would even let intercepting and cancelling network request. In case you only care for local/inline page content, you would be able to speed up loading by just using network emulation to simulate the browser being offline.

The built-in --print-to-pdf implementation is fairly simple and will wait either for the page 'load' event to be dispatched (which may indeed be deferred by pending subresource requests) or for the number of seconds specified in --timeout (which probably isn't best when it comes to printing arbitrary pages). You can check whether the time before the load event is the issue by opening the page interactively and using DevTools Network or Performance panels (the "load" event will be shown by the red line marker). 

As for --virtual-time, it's not quite what you expect it to be -- it enables the virtual time mode, which essentially "squashes" all timers set by setTimeout()/setInterval() in the page to fire immediately. It may help in case page JS is slow because of setTimeout(), but if you're bound by CPU or network when loading the page, it won't have much effect. It also is highly experimental and won't quite work in --headless=new if out-of-process iframes are involved.

Your attachment did not come through unfortunately. Could you please file a bug and attach the page in question there?

Best regards,
Andrey.

--
You received this message because you are subscribed to the Google Groups "headless-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to headless-dev...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/headless-dev/571ce276-bf0b-4d50-9de2-f8b8078446e6n%40chromium.org.
Message has been deleted

ziggy

unread,
May 11, 2023, 10:32:27 PM5/11/23
to headless-dev, ca...@chromium.org, headless-dev, ziggy
Hi Andrey,

I replied with 2 attachments   PTT89594.htm.txt and print2pdf.cmd.txt but it looks like my post disappeared eventually.

Are such files accepted as the attachments? Hope the html files can be attached to the bug report.

Below is my lost post

I agree that the issue is related to loading the file and not to the actual printing. I agree Puppeteer should be used for large scale printing. Simple MBox Viewer enables users to view and print mails to PDF. Most of users will likely print few mails such as for example all mails in the particular conversation, some users may print more mails but that is is likely not the typical case. Leveraging Puppeteer introduces complexity for typical users.

Interesting fact is that in the new headlines mode it takes few seconds to complete print versus 10 minutes in  the old mode.  Something has changed that possibly could be applied to the old mode.

I will also try to file bug report/potential enhancement request.

Respectfully,
Zbigniew

Andrey Kosyakov

unread,
May 12, 2023, 2:12:28 PM5/12/23
to ziggy, headless-dev
Hi Zbigniew,

On Thu, May 11, 2023 at 7:32 PM ziggy <zbigniew...@gmail.com> wrote:
I replied with 2 attachments   PTT89594.htm.txt and print2pdf.cmd.txt but it looks like my post disappeared eventually.
Are such files accepted as the attachments? Hope the html files can be attached to the bug report.

Thanks for the repro, it worked for me because you've sent a copy directly to me. I think groups may filter out attachments. We really prefer these as attachment to issues filed at crbug.com/new.
 
Interesting fact is that in the new headlines mode it takes few seconds to complete print versus 10 minutes in  the old mode.  Something has changed that possibly could be applied to the old mode.

I will also try to file bug report/potential enhancement request.

Fixed it for you :-)

./out/Release/chrome --headless --print-to-pdf=q.pdf --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36" ~/samples/PTT89594.html 
 
So the difference WRT --headless=new happens to be just the User-Agent string (which matches the desktop version in --headless=new) and some images in the document you provided are being loaded from www.swansonvitamins.com, which appears to discriminate clients based on the User-Agent string.

That said, it looks like --timeout should help in this case, but it actually doesn't. We'll look into that, but please file the issue!

Best regards,
Andrey.

ziggy

unread,
May 12, 2023, 3:53:51 PM5/12/23
to headless-dev, ca...@chromium.org, headless-dev, ziggy
Hi Andrey,

Thanks for  taking time to investigate.

I created one bug report   issue 1444963    yesterday and I was going to ask you how --timeout flag suppose to work before I create the second bug report.  From your response it looks like there is problem with --timeout and I created the second bug report


Before posting I was experimenting with the following strategy in MBox Viewer application  to deal with long print time.

1.  start chrome --headless --print-to-pdf= ... and start internal 6 seconds timer
2. If print-to-pdf completes before internal timer expires, kill the internal timer, otherwise
3. kill chrome and start new chrome as  chrome --headless --timeout=6000--print-to-pdf=

It was not working because --timeout was not working as I expected.  Relying on the hard timeout is not the ideal solution but it should work fairly well. From my experience core of the content is loaded fairly quickly assuming fast internet connections.

I don't think I will leverage your fix -:)) --user-agent= but thank you.

Thanks again for your help,
Zbigniew

ziggy

unread,
May 12, 2023, 6:58:32 PM5/12/23
to headless-dev, ziggy, ca...@chromium.org, headless-dev
Hi Andrey,

It looks like there is some activity on my two reported issues and good information provided.  In looks like the solution is to revert --timeout implementation to the original implementation, i.e implement the hard timeout. Deciding when navigation is complete is tricky and never perfect even when using CDP protocol so having the hard timeout option is useful.

In the past I attempted to leverage CDP to have greater control over printing to PDF in headless mode.  My attempt to leverage   the --remote-debugging-pipe on Windows didn't work.

I am thinking now that using the --remote-debugging-port option is probably better solution anyway. I would like to use c++ solution to connect and use CDP.  Unfortunately, all free solutions example I found are based on java, javascript, go, python, etc and not on  c++ which is understandable.  I think implementing small subset of CDP to support headless printing is manageable in c++. What I need is to start a dedicated thread within MBox Viewer, use TCP socket to connect to debugging port and talk to Chrome using CDP.  Is the secure socket required to connect to Chrome?  That will inject some complexity I guess but should be manageable.  Hope I can find free implementation of secure socket in c++. I am trying to reduce number of moving parts in any solution to reduce complexity and maintain reliability.

Suggestions are welcome.

Regards,
Zbigniew

Jerry Lee Daniel

unread,
Nov 19, 2023, 4:52:41 PM11/19/23
to headless-dev, zbigniew...@gmail.com, ca...@chromium.org, headless-dev
Loans, Project and Digital Investment financing available up to $500m.
Have a Business Plan, Fundable Project and Redeemable Collateral.

Whatsapp: +44 7405 896213

Diogo Almeida

unread,
Feb 16, 2024, 1:34:49 PMFeb 16
to headless-dev, Jerry Lee Daniel, zbigniew...@gmail.com, ca...@chromium.org, headless-dev
Using headless it does not wait for the iframe tag to load and when saving the page the page is broken

Jerry Lee Daniel

unread,
Mar 10, 2024, 5:37:28 PMMar 10
to headless-dev, Diogo Almeida, Jerry Lee Daniel, zbigniew...@gmail.com, ca...@chromium.org, headless-dev

Loans, Project and Digital Investment financing available up to $500m.
Have a Business Plan, Fundable Project and Redeemable Collateral.

Whatsapp: +1 (620) 698 1272
Reply all
Reply to author
Forward
0 new messages