Re: Dompdf command line out of memory with huge html

1,593 views
Skip to first unread message

BrianS

unread,
Jan 11, 2013, 10:35:06 AM1/11/13
to dom...@googlegroups.com
The complexity of the document will increase the amount of memory required. If you can't simplify the document structure then you might try breaking it into parts and combining the resulting PDFs using something like pdftk.


On Friday, January 11, 2013 10:11:22 AM UTC-5, Elia C. wrote:
Hi,

I'm using dompdf-0.6.3 from the command line on a 20MB html page (about 300pdf pages) but the process get me the message "Killed" and if I look in dmesg I get:

[109791.446010] Out of memory: Kill process 8631 (php) score 447 or sacrifice child
[109791.446014] Killed process 8631 (php) total-vm:3227856kB, anon-rss:2917716kB, file-rss:72kB


I give 3000M on /etc/php5/cli/php.ini on a machine with 5GB of memory.
I need to convert this file (no problem with the execution time), what can I do?

Elia C.

unread,
Jan 15, 2013, 3:35:35 AM1/15/13
to dom...@googlegroups.com
I cannot break into parts because I have a table of content with the page number reference. I tried to simplify the structure, I lauch the programm from the command line (it runs for more than 14 hours) and nothing happend. Other suggestion?

BrianS

unread,
Jan 15, 2013, 11:16:21 AM1/15/13
to dom...@googlegroups.com
When running from the command line the resource usage restrictions are usually removed. So your script will run until it uses up available memory. I've never tried to process a 20MB page, so I can't say how long or how much memory that would take, but again it depends on the complexity of the document. Typically the number one document structure that causes resource problems are tables, so if your document (being so large) has lots of large tables you'll likely run into issues. If you can post your document online we can take a look and see if there's anything you can do to improve the resource usage.

Elia C.

unread,
Jan 23, 2013, 4:37:59 AM1/23/13
to dom...@googlegroups.com
I have attacched the html file to this post. Even if I simplify the html structure, without tables it can't generate the pdf. Please help me, try by yourself.
Or, if someone have already had my problem please, tell me the solution, or suggest me other way to make this pdf, suggest me other library.
Thanks
catalog-online.html.zip

BrianS

unread,
Jan 23, 2013, 12:42:09 PM1/23/13
to dom...@googlegroups.com
I'm having similar issues rendering the document. And that's without the CSS. I think it's just the overall size of the document that's a problem for dompdf. We're considering ways of decreasing the memory footprint during rendering (at the cost of execution time), but nothing has been coded yet and won't make it into the next release.

I wouldn't give up on the possibility of splitting the document into parts and then joining them using something like pdftk. Since you're doing the TOC in the manner I outlined in another post you can just modify it to handle multiple documents. Here's an overview of how you might make this work:
  • run a loop to render each category.
  • each category is rendered by its own dompdf object ($dompdf_cat1, $dompdf_cat2)
  • use the same code to capture the page numbering as before, but add to the current_page the last page of the previous category. cat1 would add 1 (to account for the TOC), cat2 would add the total number of pages from cat1, etc.
  • create a dompdf object to hold the toc ($dompdf_toc)
  • render the toc as outlined in the earlier post and run the script to add in the page numbers
  • output the rendered documents and join them together using pdftk
Note: you may be able to save some memory by using a single dompdf object, rendering the current content, and destroying the dompdf object when done. The $GLOBALS content will still be available so long as you do this all within the same script. You could decrease memory usage and processing time even further by using two scripts (a controller and a renderer), but the logic gets more complicated as you need to find a way to pass the $GLOBALS variable between the two scripts.

Give it a try and see if you can get it working. If not I can try to find some time to pull together some code for you to look at.


And, though I hate for you to give up on dompdf, you might have better luck with some other rendering libraries. I don't know that we can offer an easier solution than what I've outlined above until we can spend more time optimizing dompdf or implementing some functionality for low-memory environments.

Elia C.

unread,
Jan 24, 2013, 7:12:39 AM1/24/13
to dom...@googlegroups.com
I'll try, thank you so much for the answer.

I love dompdf, but I must finish this document and this stress me a lot.

BrianS

unread,
Jan 24, 2013, 10:35:26 AM1/24/13
to dom...@googlegroups.com
If you have any trouble figuring out how to get this done post back here. As I said I'm happy to try to work up a sample.

Elia C.

unread,
Jan 25, 2013, 10:57:11 AM1/25/13
to dom...@googlegroups.com
As you can imagine, the attached file is the product of the execution of a script, in more details, it is the output of a Magento Module that I have modified to use dompdf library.
Magento use MVC pattern and the view part is very fragmentated so for me it's hard to render each category separately. If you have some more suggestion I can try it.

BrianS

unread,
Jan 25, 2013, 11:59:49 AM1/25/13
to dom...@googlegroups.com
Before you get too far down this path I'll just remind you that it doesn't hurt to also check out other libraries. We'll miss you, but totally understand.

Your document is fairly consistent in it's structure. The most straightforward way to split your document apart is to loop through it line by line to grab the various parts.

$lines = explode('\n', $html);
$sections = array();
$section_num = 0;
foreach ($lines as $line) {
  // start a new page when we reach lines with the following text in them
  if (strpos($line,'category-family') !== false) {
    $section_num++;
    $sections[$section_num] = '';
  }
  $section[$section_num].="\r\n".trim($line);
  // the first content section (the TOC) doesn't start with anything specific, so we'll end it at a known point
  if (strpos($line,'col-main') !== false) {
    $section_num++;
    $sections[$section_num] = '';}
  }
}
unset($
lines);
// Using PHP 5.3+? You can force garbage collection here to try and free up some memory (though at the cost of run time). Call gc_collect_cycles().

// $sections[0] is the start of your HTML (the HTML head plus starting BODY content)
// $sections[1] is your TOC
// all other indices are your document contents

// now render each section
// render $section[1] last, since it needs to have the page numbers filled out
for ($section_num = 2; $section_num < count($sections); $section_num++) {
  $dompdf = new DOMPDF;
  $dompdf->load_html($sections[0].$sections[$section_num]);
  $dompdf->render();
  file_put_contents('somefile.sec'.$section_num.'.pdf',$dompdf->output());
  unset($dompdf);
  // Using PHP 5.3+? You can force garbage collection here to try and free up some memory (though at the cost of run time). Call gc_collect_cycles().
}
$dompdf = new DOMPDF;
$dompdf->load_html($sections[0].$sections[1]);
$dompdf->render();
file_put_contents('somefile.sec1.pdf',$dompdf->output());
unset($dompdf);

// finally, join all the parts
$exec_cmd = 'pdftk';
for ($section_num = 1; $section_num < count($sections); $section_num++) {
  $exec_cmd .= ' ' .'somefile.sec'.$section_num.'.pdf';
}
$exec_cmd .= ' cat output ' . 'somefile.pdf';
exec($exec_cmd, $pdftk_output, $pdftk_return_code);

This is ugly, and I haven't tested or even read carefully what I just wrote so it may be buggy. The HTML will also be a little off, but I think it should still render ok. Also, you may have to adjust the code to take into account your knowledge of the document structure. Finally, doing things this way will mean that each section will start on a new page so you have to decide if that's ok.

This is just one possible method. You could also use regular expressions to parse the content or even load your document into a DOM and use that to parse it. You'll have to figure out what way is easiest for you to work with.

Elia C.

unread,
Jan 29, 2013, 6:52:43 AM1/29/13
to dom...@googlegroups.com
Ok, by breaking the file now I can render the single part and join them together with pdftk.
Now the problem is the current page number, the total page number and the toc, how can I set them correctly?

BrianS

unread,
Feb 2, 2013, 8:57:35 PM2/2/13
to dom...@googlegroups.com
As I mentioned before, you'll need to take into account earlier sections generated by dompdf. You can do this by storing the number of processed pages in the global variable. Just add that to the page number/page count with each inline script.

Elia C.

unread,
Feb 4, 2013, 12:52:00 PM2/4/13
to dom...@googlegroups.com
This is what I'm doing:

In controller:

$page = 1;
for ($i = 1; $i < count($parts); $i++) {

    $parts[$i] = str_replace('%page_number%', $page, $parts[$i]);

    $dompdf = new DOMPDF();
    $dompdf->set_paper('a4');
    $dompdf->load_html($parts[$i]);
    $dompdf->render();
    $page += $dompdf->get_canvas()->get_page_count();         
    unset($dompdf);
    gc_collect_cycles();

}

where parts contain this at the top:

<script type="text/php">
            $pdf->set_page_number(%page_number%);
</script>      

but this code set wrong page number, even if the value o $page is correct. Where I'm wrong? can you write some line of code?
Thanks

BrianS

unread,
Feb 6, 2013, 1:24:45 PM2/6/13
to dom...@googlegroups.com
Make sure you force your top-of-page script to be part of the document body. If, after splitting your document, you haven't added an explicit <body> element around each page then the scripts will be moved to the document head as part of the HTML load process.
Reply all
Reply to author
Forward
0 new messages