I am using FPDI to watermark PDFs we're selling on our company. I noticed one of the PDF I downgraded to 1.4 from 1.7 thru Acrobat looks pretty much the same but after my watermark function is done, the PDF will have white spaces on the right and bottom part of my cover page which has a black background. In short, the PDF looked bad to sell after the whole process.
I hit this same limitation in a project I'm currently working on, and ended up creating my own parser based on TCPDF's parser which works with a modified verision of FPDI called TCPDI and an unmodified copy of FPDF_TPL. It works with TCPDF 6, and supports up to at least PDF 1.6 (I haven't got a 1.7 PDF handy to try, but I'll be hunting one down shortly to make sure it works).
If you're still wanting to do this, please feel free to try out TCPDI / tcpdi_parser - if you encounter any issues, please report them via Github and I'll look into them. Basic installation and usage instructions can be found in the TCPDI README.
I did look into a paid license for FPDI but struggled immensly trying to compile and get the evaluation version running and lost hope/confidence. TCPDI lacked any real installation path other than forking or cloning and they all seemed patchy at best on PHP 7.4.
TL;DR For simple PDF text and metadata extraction, use pdfparser. For advanced options, try pdftotext and pdfinfo from Poppler. To join or split PDF files, encrypt them or apply watermarks, use pdftk. To make a JPEG or PNG screenshot of a PDF, use ImageMagick or pdftocairo.
In the previous article I described several tools that can be used together with PHP to create PDF files. Back then, the choice was not easy and we had a lot of criteria to consider while picking the best tool. Today we will browse possibilities to read and edit existing PDF files.
The library is convenient as it supports both parsing an existing file or a string with PDF data. It allows you to extract metadata and plain text from a document along with other objects (images, fonts). However, encrypted files are not yet supported. You can test the library at its demo page.
This is a library made by the creator of TCPDF, a well-known library generating PDF files. This parser draws less interest than the first one, though the author has over 15 years of experience handling PDFs.
I got familiar with this library when I received a bug report for a watermarking module in some e-book system. The module received a PDF, parsed it using FPDI, generated a watermark with FPDF and stamped it over all pages.
The need to extract plain text from a document led me to the Apache PDFBox library. It is written in Java and, as I described before, it offers some very nice features. However, in the PHP world we can only access a CLI wrapper for that library which has a limited set of options.
Software engineer since 2008. Experienced with complex systems for payments, media, advertising and education. Been a scrum master and a team leader. I love fintech, data processing and SQL optimization. Sometimes I talk at meetups.
What is the goal of this interesting subproject?
Will it be possible to import also newest PDF stadards like PDF 1.7+?
Will it be possible to manipulate with imported pages (copy, move, add text or so..) ?
While I can't speak for the status of tcpdf_import / tcpdf_parser, I can tell you that I've just released a pair of projects: TCPDI and tcpdi_parser, which work together with fpdf_tpl to provide PDF importing for TCPDF.
I've put these together because FPDI doesn't support PDFs above 1.4 unless you pay for a commercial addon, and I noticed that tcpdf_parser could cope with at least some PDFs above 1.4. To that end, tcpdi_parser is based on tcpdf_parser, with fixes and enhancements as well as some changes to make it virtually a drop-in replacement for FPDI's parser; TCPDI is essentially a clone of FPDI except using tcpdi_parser instead of FPDI's parser.
You can get both projects from my Github: pauln/tcpdi and pauln/tcpdi_parser - with brief instructions provided in TCPDI's README. You'll also need to grab a copy of fpdf_tpl, if you don't already have it - the link is in the TCPDI README.
The tcpdf_parser.php class is almost complete and can be used in the current status. Only some advanced functions are not yet implemented, like decoding encrypted documents and support for non common filters (tcpdf_filters.php).
The pauln approach doesn't make much sense to me because it has duplicate the old code that is also broken. I suggest pauln to create a new class like tcpdf_import and implement there the FPDI-like approach, using the external tcpdf_parser class.
Q1: Do you think it is really possible to parse and import PDF objects into TCPDF?
When I looked over parsed structures (up to PDF v1.7), I found it is extremly difficult task - just consider virtually unlimited number of fonts that could be included, all layout objects, ... I guess importing such a complex structures and syncing with TCPDF variables would require thousands lines of code and could be eventually very slow...
Q1: I think it is possible to parse and import a PDF document to TCPDF but this requires some effort.
As a starting point, it is not necessary to map any single property, we can just import pages as XObjects, as FPDI does. Then we can extend the class to extract and import more features.
Q2: to decrypt a PDF document you still probably need to use the same external libraries currently used by TCPDF to encrypt (in some cases). It is also possible to recode all in PHP but sometimes it is more pratical to use an external Open Source Library.
I recently moved my Moodle to a new VPS server form Amazons AWS and now experiencing many problems. The biggest is that most pdf's appear corrupted, even when uploading. I've looked into every possibility I can think of and believe there is an Assignment plugin (the fpdi parser) that is not working properly with this new server ( running php 5.6 on CentOS 6.8 and Moodle 3.1.2+ )
Was the moodledata directory transferred in a binary mode or an ascii mode?
You say all the PDF's are corrupted? Or just the ones you've checked out?
Might have to query the DB using the "humanly recognizable" name of a file
to find it's contenthash value and then manually copy out of moodledata/filedir/
a file that you think is corrupted.
The example provide shows it was a PDF in an assignment submission and
the humanly recognizable name was "Hasher Alam Task.pdf"
Here's the query to find that file's reference in the DB:
I am having the exact same problem on my moodle installation. When I try displaying pdf-files (after a migration) they are corrupt - just as explained in the first post. When copying to the root folder and then accessing the file test.pdf all looks good.
We never heard back from @Naomi if what I had suggested worked or not. The only thing in common - and remember just going on information provided here ... without specifics .... issue after a migration.
When I see the term 'migration' it means a site move .... from one location to another. Locations could be localhost -> server (provider) or one provider to another provider .... as an example. That could be accomplished in a variety of ways .... but the 'best' way is to transfer files server to server and use a tool to transfer files that doesn't involve asciii mode (like FTP) - but does transfers binary.
This to say, some files could be corrupted if transferred by downloading only to turn around and upload (FTP ascii mode). That's two transfers ... down then up. Twice the chance for corruption on transfer.
But then again, is the new server setup exactly like the old? Some folks have reported they simply can not use Moodle's Annotation which involves unoconv and libreoffice in a headless mode on their system.
Unlike other 'helpers' (du/ghostscript) that could be used by Moodle but the Moodle Admin user had to enter a path, the path to unoconv was populated in advance. But, if new server didn't have it available + one could not install it for some reason, then that path should be removed.
The rare thing is that new files of whatever kind are functioning perfectly - and files from before the migration doesn't. When I copy a file from the moodledata/filedir/etc to the webroot ( ) my browser serves the file as it should - but when viewed through moodle it wrecks up pdf's and png's (which are the filetypes I have tested).
What are ownerships/permissions on moodledata - recursively? If in a secure location and not running suphp or things like SELinux wiithout a rule to handle non-apache known location one could be quite liberal with permissions.
I have tested a lot of things and the only thing I have found out is that if I compare two supposedly identical files - one downloaded via ftp which is OK and one downloaded through Moodle which is broken, the one through Moodle lacks som signs (^M). I used vimdiff to compare the files.
I have a scanned image and saved as PDF file. When I submitted the PDF file to an assignment through the PDF submission plugin. An error is shown, saying "TCPDF ERROR: This document probably uses a compression technique which is not supported by the free parser shipped with FPDI."
I'm interested, too. I just saw the same error for the first time today when trying to grade a PDF that had been submitted to an Assignment activity. We are on 2.6 - no PDF Submission plug in. I was able to open the PDF with no problem. THe error was generated when I clicked the "Grade" icon for that submission.
Thanks for the tracker reference, Mary, this allowed us to work around the issue. On our Moodle 2.6, once we'd disabled PDF annotation site-wide, we could access the single grade page for affected students again.
This is implemented by using html.parser.HTMLParser from the Python standard library. The whole HTML 5 specification is not supported, and neither is CSS, but bug reports & contributions are very welcome to improve this. cf. Supported HTML features below for details on its current limitations.
7fc3f7cf58